Multiword Expressions

Timothy Baldwin and Su Nam Kim

While it may appear relatively innocuous, the question of what constitutes a “word” is a surprisingly vexed one. First, are dog and dogs two separate words, or variants of a single word? The traditional view from lexicography and linguistics is to treat them as separate inflected wordforms of the lexeme dog, as any difference in the syntax/semantics of the two words is predictable from the general process of noun pluralisation in English. Second, what is the status of expressions like top dog and dog days? A speaker of English who knew top, dog and day in isolation but had never been exposed to these two expressions would be hard put to predict the semantics of “person who is in charge” and “period of inactivity”, respectively. To be able to retrieve the semantics of these expressions, they must have lexical status of some form in the mental lexicon, which encodes their particular semantics. Expressions such as these which have surprising properties not predicted by their component words are referred to as multiword expressions (MWEs). The focus of this chapter is the precise nature and types of MWEs, and the current state of MWE research in NLP. In addition to providing a brief foray into the linguistic complexities of MWEs, this chapter details the key MWEs in MWE research, and outlines various approaches to the primary computational challenges associated with MWEs, namely: identification, extraction and interpretation.

Bibtex Citation

    author = {Timothy Baldwin and Su Nam Kim},
    title = {Multiword Expressions},
    booktitle = {Handbook of Natural Language Processing, Second Edition},
    editor = {Nitin Indurkhya and Fred J. Damerau},
    publisher = {CRC Press, Taylor and Francis Group},
    address = {Boca Raton, FL},
    year = {2010},
    note = {ISBN 978-1420085921}



LREC MWE workshop (2008) Resources

  • 16 corpus (1 Czech, 6 English, 2 Estonian, 2 French, 5 German)
  • Reference Data for Czech Collocation Extraction
    • Czech/collocation,non-collocation
    • Three reference data sets provided for the MWE evaluation campaign focused on ranking MWE candidates. The data sets comprise bigrams extracted from the Prague Dependency Treebank and the Czech National Corpus. The extracted bigrams are annotated as collocations and non-collocations and provided with corpus frequency information.
  • Cranberry Expressions in English and in German
    • English/
    • English,German/MWE evaluation
    • The dataset consists of a collection of English 77 cranberry words and 444 German cranberry words extracted from the Collection of Distributionally Idiosyncratic Items (CoDII), an electronic linguistic resource of lexical items with idiosyncratic occurrence patterns.
  • Compound Nouns in sentences
    • English/Compoun Nouns
    • The dataset contains a random sample of 1000 sentences from the BNC. It also contains a total of 345 binary compound nouns annotated according to the 5-way classification of SUB, DOB, POB, NA or NV.
  • Standardised Evaluation of English Noun Compound Interpretation
    • English/Noun Compound/Semantic Interpretation
    • The dataset contains 2,169 noun compounds tagged with a set of semantic relations in Barker and Szpakokiwcz 1998. They are extracted from POS tagged Wall Street Journal component of the Penn Treebank 2.0.
  • Paraphrasing Verbs for Noun Compound Interpretation
    • English/Noun Compound/Semantic Interpretation
    • The dataset contains 250 noun-noun compounds proposed in the linguistic literature and their paraphrasing verbs collected using Amazon's Mechanical Turk. It also contains a dataset of pairs of sentences representing a special kind of textual entailment task, where a binary decision is to be made about whether an expression involving a verb and two nouns can be transformed into a noun compound.
  • Interpreting Compound Nominalisations
    • Jeremy dataset
    • English/Noun Compound/Semantic Interpretation
  • A Resource for Evaluating the Deep Lexical Acquisition of English Verb-Particle Constructions
    • English/Verb-particle construction/Extraction
    • The dataset provides the platform for a task on the extraction of English verb-particle constructions with basic valence information.
  • The VNC-Tokens Dataset
    • English/Verb-Noun/Identification
    • The dataset contains a resource of almost 3000 English verb-noun combination usages annotated as to wether they are literal or idiomatic.
  • Multi-Word Verbs of Estonian: a Database and a Corpus
    • Estonian/Multi-word verbs
    • contains a database with 13,000 Estonian Multiword Verbs without annotation while contains a manually tagged corpus with Estonian MWVs (300,000 tokens)
  • A French Corpus Annotated for Multiword Nouns
    • French/Multiword noun
    • The dataset contains a French corpus annotated for multiword nouns. This corpus is designed for investigation in information retrieval and extraction as well as in deep and shallow syntactic parsing.
  • A French Corpus Annotated for Multiword Adverbs
    • French/Multiword adverb
    • The dataset contains an electronic dictionary of 6,800 French multiword adverbs. This corpus is designed for investigation in information retrieval and extraction as well as in deep and shallow syntactic parsing.
  • A Lexicon of Shallow-typed German-English MW-Expressions and a German Corpus of MW-Expressions annotated Sentences
    • German,ENglish/MWE
    • contains a bilingual German-English lexicon of 871 Multi-Word-Expressions. contains a corpus of 536 German sentences from various resources and MWEs are marked with XML tags.
  • A Lexicographic Evaluation of German Adjective-Noun Collocations
    • German/Adjective-Noun collocation
    • The database contains 1252 German adjective-noun pairs annotated by professional lexicographers at the dictionary department of Langenscheidt KG, Munich.
  • Description of evaluation resource -- German PP-Verb data
    • German/PP-Verb
    • The dataset contains 21796 German combinations of prepositional phrase (PP) and governing verb, manually annotated and extracted from the Frankfurter Rundschau corpus.

Preposition Project

  • English/Verb-particle construction
  • The Preposition Project (TPP) is designed to provide a comprehensive characterization of English preposition senses suitable for use in natural language processing. Each of 673 preposition senses for 334 prepositions (mostly phrasal prepositions) has been described by giving it a semantic role or relation name and by characterizing the syntactic and semantic properties of its complement and attachment point.

Korean Noun Compound data and dictionary

Compound Noun bracketing dataset of Mark Lauer

  • English/Compound Noun/Syntactic Analysis
  • The dataset contains bracketing of 3-term noun compounds found in the Penn Treebank.

Noun Phrase bracketing dataset of David Vadas

  • English/Compound Noun/Syntactic Analysis
  • Annotated data and annotation guidelines described in the paper "Adding Noun Phrase Structure to the Penn Treebank" (ACL 2008). Requires the Penn Treebank 3 corpus, and Python.

Biomedical Compound Noun bracketing dataset of Preslav Nakov

  • English/Compound Noun/Syntactic Analysis
  • The dataset contains 430 noun compounds extracted from biomedical text. The detail is described in "Search Engine Statistics Beyond the $n$-gram: Application to Noun Compound Bracketting" (CoNLL 2005)

Compound Noun set of Judith Levi

  • English/Compound Noun/Semantic Interpretation
  • The dataset contains about 330 compounds including 240 noun-noun pairs.

Compound Noun set (2 set) by Ken Barker

  • English/Compound Noun/Semantic Interpretation
  • * The dataset contains 505 and 335 compound nouns annotated with 20 semantic relations defined in Barker and Szpakokiwcz 1998.

Compound Noun set by Dr.Nastase

  • English/Compound Noun/Semantic Interpretation
  • The dataset contains 600 compound nouns annotated with 20 semantic relations defined in Barker and Szpakokiwcz 1998.

Compound Noun set by Dr.Ó Séaghdha

  • English/Compound Noun/Semantic Interpretation
  • The dataset contains 1,443 noun-noun compounds extracted from the British National Corpus (BNC). Each compound is annotated with one of six semantic relations: BE, HAVE, IN, ACTOR, INST and ABOUT.

VPC compositionality data by. Bannard

  • English/Verb-particle constructions/Modeling Compositionality
  • 160 English verb-particle constructions tagged with verb and particle compositionality. The detail and dataset can be found in Learning about the Meaning of Verb Particle Constructions from Corpora, Colin Bannard, Journal of Computer Speech and Language, 2005, 19(4), pp.467--478, Acquiring Phrasal Lexicons from Corpora, Colin Bannard, Ph.D dissertation, University of Edinburgh, UK, 2006

Verb particle constructions with compositionality judgements

  • English/Verb-particle constructions/Modeling Compositionality
  • The dataset contains English verb-particle constructions tagged with human judgment obtained by

3 native (British) English speakers (all computational linguists) on the verbs sampled as described in the above paper. NonNativeSpeaker is that obtained by a non native speaker of English working in our department.

Verb particle constructions with Levin verb classes and Google frequencies

  • English/Verb-particle construction
  • Each line in the file consists of a single VPC entry, with the following format: FREQ,VPARTICLE,VERB,PARTICLE,CLASS,QUERY where FREQ is the Google-based estimate of the number of web pages containin the given VPC, VPARTICLE is the full verb particle construction, VERB is the head verb, PARTICLE is the (prepositional) particle, CLASS is the unique Levin verb class for the given VPC, and QUERY is the Google query on which the web estimation is based. Note that in the instance a VPC occurs in multiple Levin classes, an individual listing is given for each class. Here, the estimated web frequency is that for the VPC type across all Levin classes, rather than for each sense of the VPC as instantiated in the different Levin classes.

English and Russian Prepositional Phrases

  • English,Russian/Verb-particle construction
  • A list of Russian PP MWEs and a ranked list of PP MWE candidates; a ranked list of English PP MWE candidates from the BNC; a tool for extracting MWEs
  • The dataset are statistically important in the pilot version of the Russian Reference Corpus (55 million words).

Light-verb construction corpus at NUS

  • English/Light-verb construction/Identification
  • The dataset contains the questions and responses that form the dataset. Each question contains a sentence as well as a verb-object pair extracted from the sentence.

Idiom data (SAID)


Collocation Searcher

Light-verb construction tool at NUS

Ted Pedersen's N-gram Statistics

  • English/collocation
  • The system allows the user to identify word and character Ngrams that appear in large corpora using standard tests of association such as Fisher's exact test, the log likelihood ratio, Pearson's chi-squared test, the Dice Coefficient, etc. It has been designed to allow a user to add their own tests with minimal effort.

Stefan Evert's UCS toolkit

  • English/collocation
  • The UCS toolkit is a collection of libraries and scripts for the statistical analysis of cooccurrence data. Data sets – each one containing a list of word pairs together with their joint and marginal frequencies – are stored in a tabular format in plain (compressed) text files. They can be viewed, printed, manipulated in various ways, annotated with association scores from a wide range of built-in measures, ranked, and sorted with the UCS/Perl system. Additional functionality for the graphical evaluation of association measures in a collocation extraction task is provided by the UCS/R system.


ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment

ACL 2004 Workshop on Multiword Expressions: Integrating Processing

EACL 2006 Workshop on Multi-word-expressions in a multilingual context

COLING/ACL Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties

A Broader Perspective on Multiword Expressions at ACL 2007

Towards a Shared Task for Multiword Expressions (MWE 2008) at LREC 2008

Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009) at ACL/IJCNLP

1st ACL-SIGSEM Workshop on The Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications

2nd ACL-SIGSEM Workshop on The Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications

3rd ACL-SIGSEM Workshop on Prepositions

4th ACL-SIGSEM Workshop on Prepositions

Journal Special Issue on Multiword Expressions

Computer Speech and Language, Special Issue on Multiword Expressions (2005 vol. 19)

International Journal of Language Resources and Evaluation, Special issue on Multiword expressions: hard going or plain sailing? (2009)

Bibliography MWE bibliography

Noun Compound bibliography

Nominal compound MWE bibliography

Verb Particle Construction bibliography

Light Verb Construction bibliography

Collocation bibliography

Idiom bibliography

Adjective-Noun bibliography