Introduction to Corpus Linguistics

Lecture notes for the JSI postgraduate school

Tomaž Erjavec
Dept. for Knowledge Technologies
Jožef Stefan Institute
Jamova 39
1000 Ljubljana

March 21st, 2007

Published at: http://nl.ijs.si/et/teach/jsi06-hlt/

1. Overview

1.1. What is a corpus?

  • The Collins English Dictionary (1986):
    1. a collection or body of writings, esp. by a single author or topic.
  • Guidelines of the Expert Advisory Group on Language Engineering Standards, EAGLES:
    Corpus : A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.
    Computer corpus : a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance.

1.2. Using corpora

Research on actual language: descriptive approach, study of performance, empirical linguistics.
  • Applied linguistics:
    • Lexicography: mono-lingual dictionaries, terminological, bi-lingual
    • Language studies: hypothesis verification, knowledge discovery
      (lexis, morphology, syntax, ...)
    • Translation studies: a source translation equivalents and their contexts
      translation memories, machine aided translations
    • Language learning: real-life examples
      "idiomatic teaching", curriculum development
  • Language technology:

1.3. Characteristics of a corpus

  1. Quantity:
    the bigger, the better
  2. Quality :
    the texts are authentic; the mark-up is validated
  3. Simplicity:
    the computer representation is understandable, with the markup easily separated from the text
  4. Documented:
    the corpus contains bibliographic and other meta-data

1.4. Typology of corpora

  • Corpora of written language, spoken and speech corpora (authenticity/price)
    e.g. the agency ELRA catalog
  • Reference corpora (representative) and sub-language corpora (specialised)
    e.g. BNC, ICE, COLT
  • Corpora with integral texts or of text samples (historical and legal reasons)
    e.g. Brown
  • Static and monitor corpora (language change)
  • Monolingual and multilingual parallel and comparable corpora
    e.g. Hansard, Europarl
  • Plain text and annotated corpora

1.5. History

(Computational) linguistic paradigms:
  • 1950 -- 1960: empiricism
    weak computers: frequency lists
  • 1970 -- 1980: cognitive modeling (generative approaches, artificial intelligence )
    deep analysis / "basic science": computational linguistics
  • 1990 -- ...: empiricist revival, also combined approaches
    quantity / usefulness: language technologies
  • 2000 -- ...: The Web
The history of computer corpora:
  • First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974
  • The spread of reference corpora: Cobuild Bank of English (monitor, 100..200..M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) 1999...
  • Slovene language reference corpora: FIDA (100M), Nova Beseda (100M...) 1998; FIDA+ (600M) 2006.
  • EU corpus oriented projects in the '90: NERC, MULTEXT-East,...
  • Language resources brokers: LDC 1992, ELRA 1995

1.6. Literature on corpora

  • Corpus Linguistics by Tony McEnery and Andrew Wilson. Edinburgh: Edinburgh University Press, 1996
  • An Introduction to Corpus Linguistics by Graeme D. Kennedy. Studies in Language and Linguistics, London, 1998
  • Corpus Linguistics: Investigating Language Structure and Use by Douglas Biber, Susan Conrad, Randi Reppen. Cambridge University Press, 1998
  • Uvod v korpusno jezikoslovje, Vojko Gorjanc. Domžale: Izolit, 2005
  • LREC conferences:
    Fifth international conference on Language Resources and Evaluation, LREC'06
  • Slovenian Conferences on LANGUAGE TECHNOLOGIES 2006, 2004,2002, 2000, 1998

1.7. Slovene language corpora

Text corpora:
  1. J. Toporišič (ur.): Besedila slovenskega jezika, 1975.
  2. P. Tancig et al. (IJS): Napadi na JNA, 1989.
  3. M. Hladnik et al. (FF):Literat, 1995--
  4. P. Jakopin et al. (ZRC):TELRI 'Plato' corpus, 1998; Beseda, 1999; Nova beseda, 1999--
  5. S. Krek et al. (DZS, Amebis, FF, IJS): FIDA, 1998--, FidaPlus, 2006
  6. T. Erjavec et al. (IJS): MULTEXT-East, 1998--, IJS-ELAN, 1999--.
  7. Š. Vintar et al. (FF): TRANS, 2002
  8. T. Erjavec et al. (IJS): SVEZ-IJS, 2004
  9. T. Erjavec et al. (IJS): SDT, 2006
  10. DSI, VoiceTran, ...

2. Compilation of corpora

2.1. Steps in the preparation of a corpus

  1. Choosing the component texts:
    linguistic and non-linguistic criteria; availability; simplicity; size
  2. Copyright
    sensitivity of source (financial and privacy considerations); agreement with providers; usage, publication
  3. Acquiring digital originals
    Web transfer; visit; OCR
  4. Up-translation
    conversion to standard format; consistency; character set encodings
  5. Linguistic annotation
    language dependent methods; errors
  6. Documentation
    TEI header; Open Archives etc.
  7. Use / Download
    • (Web-based) concordancers for linguists
    • download needed for HLT use
    • licences for use

2.2. What annotation can be added to the text of the corpus?

Annotation = interpretation
  • Documentation about the corpus (example)
  • Document structure (example)
  • Basic linguistic markup: sentences, words (example), punctuation, abbreviations (example)
  • Lemmas and morphosyntactic descriptions (example)
  • Syntax (example)
  • Alignment (example)
  • Terms, semantics, anaphora, pragmatics, intonation,...

2.3. Markup Methods

  • hand annotation: documentation, first steps
    generic (XML, spreadsheet) editors or specialised editors
  • semi-automatic: morphosyntactic and other linguistic annotation
    cyclic approach: machine, hand, validate, correct, machine, ...
  • machine, with hand-written rules: tokenisation
    regular expression
  • machine, with inductivelly built models from annotated data:
    "supervised learning"; HMMs, decision trees, inductive logic programming,...
  • machine, with inductivelly built models from un-annotated data:
    "unsupervised leaning"; clustering technigues
  • overview of the field

2.4. Computer coding of corpora

A good encoding must ensure durability, enable interchange between computer platforms and applications
  • The basic standard used is Extended Markup Language, XML
  • There are a number of companion standards and technologies: XML transformations (XSLT), data definition (DTD, XML Schema, ISO Relax NG), addressing and queries (XPath, XQuery), ...
  • The vocabulary of annotations for corpora and other language resources are defined by the Text Encoding Initiative, TEI
XML/TEI used much wider than just for corpora:

2.5. Examples of TEI encoding in corpora: meta-data

<teiHeader id="ecmr.H" type="text" lang="sl-en" creator=ET 
     status="update" date.created="1999-04-13" date.updated="1999-06-22" >
  <fileDesc>
  <titleStmt>
    <title lang="sl">Ekonomsko ogledalo; 13 &scaron;tevilk 98/99</title>
    <title lang="en">Slovenian Economic Mirror; 13 issues, 98/99</title>
    <respstmt>
      <name>Andrej Skubic, FF</name>
      <resp lang="sl">Zagotovitev digitalnega originala, poravnava</resp>
      <resp lang="en">Provision of digital original, alignment</resp>
      <name>Toma&zcaron; Erjavec, IJS</name>
      <resp lang="sl">Tokenizacija, pretvorba v TEI</resp>
      <resp lang="en">Tokenisation, conversion to TEI</resp>
    </respStmt>
  </titleStmt>
... 

2.6. Examples of TEI encoding in corpora: Structure of the text

<quote id="Osl.1.8.18" rend="center;it">
  <lg id="Osl.1.8.18.1">
    <l id="Osl.1.8.18.1.1">Tam pod kostanjevim drevesom</l>
    <l id="Osl.1.8.18.1.2">izdala si me,</l>
    <l id="Osl.1.8.18.1.3">izdal sem te,</l>
    <l id="Osl.1.8.18.1.4">ne da bi trenila z očesom.</l>
  </lg>
</quote>
<p id="Osl.1.8.19">
  <s id="Osl.1.8.19.1">Trije možje se niso niti ganili.</s>
  <s id="Osl.1.8.19.2">Toda ko je <name>Winston</name>
  znova pogledal v Rutherfordov propadli obraz, je opazil, 
da so njegove oči polne solz.</s>
... 

2.7. Examples of TEI encoding in corpora: Morphosyntactic descriptions

 
<s id="Osl.1.2.2.1">
  <w lemma="biti" ana="Vcps-sma">Bil</w>
  <w lemma="biti" ana="Vcip3s--n">je</w>
  <w lemma="jasen" ana="Afpmsnn">jasen</w><c>,</c>
  <w lemma="mrzel" ana="Afpmsnn">mrzel</w>
  <w lemma="aprilski" ana="Aopmsn">aprilski</w>
  <w lemma="dan" ana="Ncmsn">dan</w>
  <w lemma="in" ana="Ccs">in</w>
  <w lemma="ura" ana="Ncfpn">ure</w>
  <w lemma="biti" ana="Vcip3p--n">so</w>
  <w lemma="biti" ana="Vmps-pfa">bile</w>
  <w lemma="trinajst" ana="Mcnpnl">trinajst</w><c>.</c>
</s>

<fs id="Vcps-sma" select="sl" feats="V0. V1.c V2.p V3.s V5.s V6.m V7.a"/>
<fs id="Vcps-sman----n" select="cs" feats="V0. V1.c V2.p V3.s V5.s V6.m V7.a V8.n V13.n"/>
<fs id="Vcps-smay----n" select="cs" feats="V0. V1.c V2.p V3.s V5.s V6.m V7.a V8.y V13.n"/>
<fs id="Vcps-sna" select="sl" feats="V0. V1.c V2.p V3.s V5.s V6.n V7.a"/>
<fs id="Vcps-snan----n" select="cs" feats="V0. V1.c V2.p V3.s V5.s V6.n V7.a V8.n V13.n"/>

<fLib type="Verb">
  <f id="V0." select="en ro sl cs bg et hu hr sr sl-rozaj" name="PoS"><sym value="Verb"/></f>
  <f id="V1.m" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"><sym value="main"/></f>
  <f id="V1.a" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"><sym value="auxiliary"/></f>
  <f id="V1.o" select="en ro sl cs et hr sr sl-rozaj" name="Type"><sym value="modal"/></f>
  <f id="V1.c" select="ro sl cs hr sr sl-rozaj" name="Type"><sym value="copula"/></f>
  <f id="V1.b" select="en" name="Type"><sym value="base"/></f>

2.8. Examples of TEI encoding in corpora: Alignment

<linkGrp id="Oslen.1" type="body" targtype="s" domains="Oen Osl">
<link xtargets="Osl.1.2.2.1 ; Oen.1.1.1.1">
<link xtargets="Osl.1.2.2.2 ; Oen.1.1.1.2">
<link xtargets="Osl.1.2.3.1 ; Oen.1.1.2.1">
<link xtargets="Osl.1.2.3.2 ; Oen.1.1.2.2">
... <link xtargets="Osl.1.2.6.5 ; Oen.1.1.5.5">
<link xtargets="Osl.1.2.6.6 ; Oen.1.1.5.6 Oen.1.1.5.7">
<link xtargets="Osl.1.2.6.7 ; Oen.1.1.5.8">
... 

3. Examples of use

3.1. Lexicology

  • Concordances and collocations
    “You shall know a word by the company it keeps.” (Firth, 1957)
  • Induction of multilingual lexica:
    • D. Tufiş, Ana-Maria Barbu: Revealing translators knowledge: statistical methods in constructing practical translation lexicons for language and speech processing, in International Journal on Speech Technology, Vol.5, No. 3, 2002 Kluwer Pbls.
    • Nancy Ide, Tomaž Erjavec and Dan Tufiş: Sense Discrimination with Parallel Corpora, in Proceedings of the SIGLEX Workshop on Word Sense Disambiguation: Recent Successes and Future Directions. ACL2002, July Philadelphia 2002, pp. 56-60.
    Automatically built 7-language dictionary from '1984' corpus of EU project MULTEXT-East:
    first 100 entries

3.2. Automatic translation

  • VIČIČ, Jernej, ERJAVEC, Tomaž. Statistično strojno prevajanje na osnovi vzporednih korpusov. ERK 2002, 23.-25. 2002.
The Menola translator
 
Slovene sentence:   evropi vlada veliki brat 
ELAN model:         europe government big brother 
Bible model:        evropi brother chief upright . 
Czech translation:  evropi vláda velké bratr .

3.3. Concordances at nl2.ijs.si

At nl.ijs.si we have two interfaces:
Fuzzy matching and regular expressions:
  1. Search for RE: "hoditi" (search)
  2. Search for RE: "hodi.*" (search)
  3. Search for RE: ".*hodi.*" (search)
  4. Search for RE: "[bcčdfghjklmnprsštvzž]{5,}" (search)
Show results:
  1. ".*hod.*" as frequency list (search)
  2. "prihodki" as KWIC (search)
  3. "prihodki" bi-lingual (search)
Bi-lingual searching:
  1. "prihodki" and "income" (search)
  2. "prihodki" and not "income" (search)
  3. "community" and not "skupnost" (search)
Words, lemmas and annotations:
  1. Word "iti" in '1984' (search)
  2. Lemma "iti" in '1984' (search)
  3. Lemma "iti" in '1984' as frequency list (search)
Effect of corpus:
  1. "šel" in '1984' (search) in 'VAYNA' (search) in 'GORE' (search)
  2. "okrevanje" in 'ELAN-SL' (search) and "sožitje" (search)
Multiword searchers and colloations:
  1. "star* mam*" v 'ELAN-SL' (search)
  2. "* and death" v 'ELAN-EN' (search)

4. The future of corpus and data-driven linguistics

4.1. The future of corpus and data-driven linguistics

Size:
  • Larger quantities of readily accessible data (Web as corpus)
  • Larger storage and processing power (Moore law)
Complexity:
  • Deeper analysis:
    syntax, deixis, semantic roles, dialogue acts, ...
  • Multimodal corpora:
    speech, film, transcriptions,...
  • Annotation levels and linking:
    co-existence and linking of varied types of annotations; ambiguity
  • Development of tools and platforms:
    precision, robustness, unsupervised learning, meta-learning

4.2. Development of corpus linguistics for smaller languages

  • varied, high-quality and accessible corpora
  • technology of morphosyntactic annotation / lemmatisation
  • syntactically annotated corpora (treebanks)
  • application of developed methods
  • development of curricula...