Introduction to Corpus Linguistics

Lecture notes for the JSI postgraduate school

Tomaž Erjavec
Dept. for Knowledge Technologies
Jožef Stefan Institute
Jamova 39
1000 Ljubljana

March 21st, 2007

Published at: http://nl.ijs.si/et/teach/jsi06-hlt/

1. Overview
- 1.1. What is a corpus?
- 1.2. Using corpora
- 1.3. Characteristics of a corpus
- 1.4. Typology of corpora
- 1.5. History
- 1.6. Literature on corpora
- 1.7. Slovene language corpora
2. Compilation of corpora
3. Examples of use
- 3.1. Lexicology
- 3.2. Automatic translation
- 3.3. Concordances at nl2.ijs.si
4. The future of corpus and data-driven linguistics
- 4.1. The future of corpus and data-driven linguistics
- 4.2. Development of corpus linguistics for smaller languages

1. Overview

1.1. What is a corpus?

The Collins English Dictionary (1986):
1. a collection or body of writings, esp. by a single author or topic.
Guidelines of the Expert Advisory Group on Language Engineering Standards, EAGLES:
Corpus : A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.
Computer corpus : a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance.

1.2. Using corpora

Research on actual language: descriptive approach, study of performance, empirical linguistics.

Applied linguistics:
- Lexicography: mono-lingual dictionaries, terminological, bi-lingual
- Language studies: hypothesis verification, knowledge discovery
  (lexis, morphology, syntax, ...)
- Translation studies: a source translation equivalents and their contexts
  translation memories, machine aided translations
- Language learning: real-life examples
  "idiomatic teaching", curriculum development
Language technology:
- testing set for developed methods;
- training set for inductive learning
- (statistical Natural Language Processing)

1.3. Characteristics of a corpus

Quantity:
the bigger, the better
Quality :
the texts are authentic; the mark-up is validated
Simplicity:
the computer representation is understandable, with the markup easily separated from the text
Documented:
the corpus contains bibliographic and other meta-data

1.4. Typology of corpora

Corpora of written language, spoken and speech corpora (authenticity/price)
e.g. the agency ELRA catalog
Reference corpora (representative) and sub-language corpora (specialised)
e.g. BNC, ICE, COLT
Corpora with integral texts or of text samples (historical and legal reasons)
e.g. Brown
Static and monitor corpora (language change)
Monolingual and multilingual parallel and comparable corpora
e.g. Hansard, Europarl
Plain text and annotated corpora

1.5. History

(Computational) linguistic paradigms:

1950 -- 1960: empiricism
weak computers: frequency lists
1970 -- 1980: cognitive modeling (generative approaches, artificial intelligence )
deep analysis / "basic science": computational linguistics
1990 -- ...: empiricist revival, also combined approaches
quantity / usefulness: language technologies
2000 -- ...: The Web

The history of computer corpora:

First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974
The spread of reference corpora: Cobuild Bank of English (monitor, 100..200..M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) 1999...
Slovene language reference corpora: FIDA (100M), Nova Beseda (100M...) 1998; FIDA+ (600M) 2006.
EU corpus oriented projects in the '90: NERC, MULTEXT-East,...
Language resources brokers: LDC 1992, ELRA 1995

1.6. Literature on corpora

Corpus Linguistics by Tony McEnery and Andrew Wilson. Edinburgh: Edinburgh University Press, 1996
An Introduction to Corpus Linguistics by Graeme D. Kennedy. Studies in Language and Linguistics, London, 1998
Corpus Linguistics: Investigating Language Structure and Use by Douglas Biber, Susan Conrad, Randi Reppen. Cambridge University Press, 1998
Uvod v korpusno jezikoslovje, Vojko Gorjanc. Domžale: Izolit, 2005
LREC conferences:
Fifth international conference on Language Resources and Evaluation, LREC'06
Slovenian Conferences on LANGUAGE TECHNOLOGIES 2006, 2004,2002, 2000, 1998

1.7. Slovene language corpora

Text corpora:

J. Toporišič (ur.): Besedila slovenskega jezika, 1975.
P. Tancig et al. (IJS): Napadi na JNA, 1989.
M. Hladnik et al. (FF):Literat, 1995--
P. Jakopin et al. (ZRC):TELRI 'Plato' corpus, 1998; Beseda, 1999; Nova beseda, 1999--
S. Krek et al. (DZS, Amebis, FF, IJS): FIDA, 1998--, FidaPlus, 2006
T. Erjavec et al. (IJS): MULTEXT-East, 1998--, IJS-ELAN, 1999--.
Š. Vintar et al. (FF): TRANS, 2002
T. Erjavec et al. (IJS): SVEZ-IJS, 2004
T. Erjavec et al. (IJS): SDT, 2006
DSI, VoiceTran, ...

Speech corpora:

Laboratory for Digital Signal Processing, University of Maribor:
SpeechDat, ONOMASTICA...
Laboratory of Articifical Perception, Systems and Cybernetics, University of Ljubljana:
SQEL, GOPOLIS,...

2. Compilation of corpora

2.1. Steps in the preparation of a corpus

Choosing the component texts:
linguistic and non-linguistic criteria; availability; simplicity; size
Copyright
sensitivity of source (financial and privacy considerations); agreement with providers; usage, publication
Acquiring digital originals
Web transfer; visit; OCR
Up-translation
conversion to standard format; consistency; character set encodings
Linguistic annotation
language dependent methods; errors
Documentation
TEI header; Open Archives etc.
Use / Download
- (Web-based) concordancers for linguists
- download needed for HLT use
- licences for use

2.2. What annotation can be added to the text of the corpus?

Annotation = interpretation

Documentation about the corpus (example)
Document structure (example)
Basic linguistic markup: sentences, words (example), punctuation, abbreviations (example)
Lemmas and morphosyntactic descriptions (example)
Syntax (example)
Alignment (example)
Terms, semantics, anaphora, pragmatics, intonation,...

2.3. Markup Methods

hand annotation: documentation, first steps
generic (XML, spreadsheet) editors or specialised editors
semi-automatic: morphosyntactic and other linguistic annotation
cyclic approach: machine, hand, validate, correct, machine, ...
machine, with hand-written rules: tokenisation
regular expression
machine, with inductivelly built models from annotated data:
"supervised learning"; HMMs, decision trees, inductive logic programming,...
machine, with inductivelly built models from un-annotated data:
"unsupervised leaning"; clustering technigues
overview of the field

2.4. Computer coding of corpora

A good encoding must ensure durability, enable interchange between computer platforms and applications

The basic standard used is Extended Markup Language, XML
There are a number of companion standards and technologies: XML transformations (XSLT), data definition (DTD, XML Schema, ISO Relax NG), addressing and queries (XPath, XQuery), ...
The vocabulary of annotations for corpora and other language resources are defined by the Text Encoding Initiative, TEI

XML/TEI used much wider than just for corpora:

documentation: these slides
annotation of dictionaries: English-Slovene, Japanese-Slovene (from jaSlo)
for annotating text-critical editions

2.5. Examples of TEI encoding in corpora: meta-data

<teiHeader id="ecmr.H" type="text" lang="sl-en" creator=ET 
     status="update" date.created="1999-04-13" date.updated="1999-06-22" >
  <fileDesc>
  <titleStmt>
    <title lang="sl">Ekonomsko ogledalo; 13 &scaron;tevilk 98/99</title>
    <title lang="en">Slovenian Economic Mirror; 13 issues, 98/99</title>
    <respstmt>
      <name>Andrej Skubic, FF</name>
      <resp lang="sl">Zagotovitev digitalnega originala, poravnava</resp>
      <resp lang="en">Provision of digital original, alignment</resp>
      <name>Toma&zcaron; Erjavec, IJS</name>
      <resp lang="sl">Tokenizacija, pretvorba v TEI</resp>
      <resp lang="en">Tokenisation, conversion to TEI</resp>
    </respStmt>
  </titleStmt>
...

2.6. Examples of TEI encoding in corpora: Structure of the text

<quote id="Osl.1.8.18" rend="center;it">
  <lg id="Osl.1.8.18.1">
    <l id="Osl.1.8.18.1.1">Tam pod kostanjevim drevesom</l>
    <l id="Osl.1.8.18.1.2">izdala si me,</l>
    <l id="Osl.1.8.18.1.3">izdal sem te,</l>
    <l id="Osl.1.8.18.1.4">ne da bi trenila z očesom.</l>
  </lg>
</quote>
<p id="Osl.1.8.19">
  <s id="Osl.1.8.19.1">Trije možje se niso niti ganili.</s>
  <s id="Osl.1.8.19.2">Toda ko je <name>Winston</name>
  znova pogledal v Rutherfordov propadli obraz, je opazil, 
da so njegove oči polne solz.</s>
...

2.7. Examples of TEI encoding in corpora: Morphosyntactic descriptions

 
<s id="Osl.1.2.2.1">
  <w lemma="biti" ana="Vcps-sma">Bil</w>
  <w lemma="biti" ana="Vcip3s--n">je</w>
  <w lemma="jasen" ana="Afpmsnn">jasen</w><c>,</c>
  <w lemma="mrzel" ana="Afpmsnn">mrzel</w>
  <w lemma="aprilski" ana="Aopmsn">aprilski</w>
  <w lemma="dan" ana="Ncmsn">dan</w>
  <w lemma="in" ana="Ccs">in</w>
  <w lemma="ura" ana="Ncfpn">ure</w>
  <w lemma="biti" ana="Vcip3p--n">so</w>
  <w lemma="biti" ana="Vmps-pfa">bile</w>
  <w lemma="trinajst" ana="Mcnpnl">trinajst</w><c>.</c>
</s>

<fs id="Vcps-sma" select="sl" feats="V0. V1.c V2.p V3.s V5.s V6.m V7.a"/>
<fs id="Vcps-sman----n" select="cs" feats="V0. V1.c V2.p V3.s V5.s V6.m V7.a V8.n V13.n"/>
<fs id="Vcps-smay----n" select="cs" feats="V0. V1.c V2.p V3.s V5.s V6.m V7.a V8.y V13.n"/>
<fs id="Vcps-sna" select="sl" feats="V0. V1.c V2.p V3.s V5.s V6.n V7.a"/>
<fs id="Vcps-snan----n" select="cs" feats="V0. V1.c V2.p V3.s V5.s V6.n V7.a V8.n V13.n"/>

<fLib type="Verb">
  <f id="V0." select="en ro sl cs bg et hu hr sr sl-rozaj" name="PoS"><sym value="Verb"/></f>
  <f id="V1.m" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"><sym value="main"/></f>
  <f id="V1.a" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"><sym value="auxiliary"/></f>
  <f id="V1.o" select="en ro sl cs et hr sr sl-rozaj" name="Type"><sym value="modal"/></f>
  <f id="V1.c" select="ro sl cs hr sr sl-rozaj" name="Type"><sym value="copula"/></f>
  <f id="V1.b" select="en" name="Type"><sym value="base"/></f>

2.8. Examples of TEI encoding in corpora: Alignment

<linkGrp id="Oslen.1" type="body" targtype="s" domains="Oen Osl">
<link xtargets="Osl.1.2.2.1 ; Oen.1.1.1.1">
<link xtargets="Osl.1.2.2.2 ; Oen.1.1.1.2">
<link xtargets="Osl.1.2.3.1 ; Oen.1.1.2.1">
<link xtargets="Osl.1.2.3.2 ; Oen.1.1.2.2">
... <link xtargets="Osl.1.2.6.5 ; Oen.1.1.5.5">
<link xtargets="Osl.1.2.6.6 ; Oen.1.1.5.6 Oen.1.1.5.7">
<link xtargets="Osl.1.2.6.7 ; Oen.1.1.5.8">
...

3. Examples of use

3.1. Lexicology

Concordances and collocations
“You shall know a word by the company it keeps.” (Firth, 1957)
Induction of multilingual lexica:
- D. Tufiş, Ana-Maria Barbu: Revealing translators knowledge: statistical methods in constructing practical translation lexicons for language and speech processing, in International Journal on Speech Technology, Vol.5, No. 3, 2002 Kluwer Pbls.
- Nancy Ide, Tomaž Erjavec and Dan Tufiş: Sense Discrimination with Parallel Corpora, in Proceedings of the SIGLEX Workshop on Word Sense Disambiguation: Recent Successes and Future Directions. ACL2002, July Philadelphia 2002, pp. 56-60.
Automatically built 7-language dictionary from '1984' corpus of EU project MULTEXT-East:
first 100 entries

3.2. Automatic translation

VIČIČ, Jernej, ERJAVEC, Tomaž. Statistično strojno prevajanje na osnovi vzporednih korpusov. ERK 2002, 23.-25. 2002.

The Menola translator

 
Slovene sentence:   evropi vlada veliki brat 
ELAN model:         europe government big brother 
Bible model:        evropi brother chief upright . 
Czech translation:  evropi vláda velké bratr .

3.3. Concordances at nl2.ijs.si

At nl.ijs.si we have two interfaces:

Fuzzy matching and regular expressions:

Search for RE: "hoditi" (search)
Search for RE: "hodi.*" (search)
Search for RE: ".*hodi.*" (search)
Search for RE: "[bcčdfghjklmnprsštvzž]{5,}" (search)

Show results:

".*hod.*" as frequency list (search)
"prihodki" as KWIC (search)
"prihodki" bi-lingual (search)

Bi-lingual searching:

"prihodki" and "income" (search)
"prihodki" and not "income" (search)
"community" and not "skupnost" (search)

Words, lemmas and annotations:

Word "iti" in '1984' (search)
Lemma "iti" in '1984' (search)
Lemma "iti" in '1984' as frequency list (search)

Effect of corpus:

"šel" in '1984' (search) in 'VAYNA' (search) in 'GORE' (search)
"okrevanje" in 'ELAN-SL' (search) and "sožitje" (search)

Multiword searchers and colloations:

"star* mam*" v 'ELAN-SL' (search)
"* and death" v 'ELAN-EN' (search)

4. The future of corpus and data-driven linguistics

4.1. The future of corpus and data-driven linguistics

Size:

Larger quantities of readily accessible data (Web as corpus)
Larger storage and processing power (Moore law)

Complexity:

Deeper analysis:
syntax, deixis, semantic roles, dialogue acts, ...
Multimodal corpora:
speech, film, transcriptions,...
Annotation levels and linking:
co-existence and linking of varied types of annotations; ambiguity
Development of tools and platforms:
precision, robustness, unsupervised learning, meta-learning

4.2. Development of corpus linguistics for smaller languages

varied, high-quality and accessible corpora
technology of morphosyntactic annotation / lemmatisation
syntactically annotated corpora (treebanks)
application of developed methods
development of curricula...