<?xml version="1.0" encoding="ISO-8859-2"?>
<!DOCTYPE TEI.2 SYSTEM 'teixlite.dtd' [
<!--
  <?xml-stylesheet type="text/xsl" href="tlslides.xsl"?>
-->
  <!NOTATION XML SYSTEM 'http://www.w3.org/MarkUp/'>

  <!ENTITY nbsp   "&#160;"    >    

  <!ENTITY larr   "&#x2190;" ><!--/leftarrow /gets A: =leftward arrow-->
  <!ENTITY rarr   "&#x2192;" ><!--/rightarrow /to A: =rightward arrow-->
  <!ENTITY uarr   "&#x2191;" ><!--/uparrow A: =upward arrow-->
  <!ENTITY darr   "&#x2193;" ><!--/downarrow A: =downward arrow-->

]>
<TEI.2>
  <teiHeader creator="et" status="update" date.created="2004-04-19" date.updated="2007-03-20">
    <fileDesc>
      <titleStmt>
        <title>Introduction to Corpus Linguistics</title>
        <author>
          <name>
            <xref url="http://nl.ijs.si/et/">Tomaž Erjavec</xref>
          </name>
          <address>
            <addrLine>Dept. for Knowledge Technologies</addrLine>
            <addrLine>Jožef Stefan Institute</addrLine>
            <addrLine>Jamova 39</addrLine>
            <addrLine>1000 Ljubljana</addrLine>
          </address>
        </author>
      </titleStmt>
      <publicationStmt>
        <pubPlace>
          <xref url="http://nl.ijs.si/et/teach/jsi06-hlt/">http://nl.ijs.si/et/teach/jsi06-hlt/</xref>
        </pubPlace>
      </publicationStmt>
      <sourceDesc>
        <p>Created in electronic form.</p>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <langUsage>
        <language id="en">English</language>
        <language id="sl">Slovene</language>
        <language id="xml">XML</language>
      </langUsage>
    </profileDesc>
  </teiHeader>
  <text lang="en">
    <front>
      <titlePage>
        <docTitle>
          <titlePart>Introduction to Corpus Linguistics</titlePart>
          <titlePart>Lecture notes for the JSI postgraduate school</titlePart>
        </docTitle>
        <docAuthor>
          <name>
            <xref url="http://nl.ijs.si/et/">Tomaž Erjavec</xref>
          </name>
          <address>
            <addrLine>Dept. for Knowledge Technologies</addrLine>
            <addrLine>Jožef Stefan Institute</addrLine>
            <addrLine>Jamova 39</addrLine>
            <addrLine>1000 Ljubljana</addrLine>
          </address>
        </docAuthor>
        <docDate>March 21st, 2007</docDate>
        <docImprint>
          <pubPlace>
          <xref url="http://nl.ijs.si/et/teach/jsi06-hlt/">http://nl.ijs.si/et/teach/jsi06-hlt/</xref>
          </pubPlace>
        </docImprint>
      </titlePage>
    </front>
    <body>
      <div1>
        <head>Overview</head>
        <!--
      <div2>
      <head>Overview of the talk</head>
	
      <list type="ordered">
      <item><ref target="what">what is a corpus?</ref>:
      <ref target="qualia">corpus qualities</ref> and
      <ref target="typo">typology</ref></item>
      <item><ref target="history">history</ref></item>
      <item><ref target="slko">Slovene language corpora</ref></item>
      <item><ref target="use">using corpora</ref></item>
      <item><ref target="exptools">computer tools</ref></item>
      <item><ref target="markup">corpus markup</ref></item>
      <item><ref target="tryit">on-line concordances</ref></item>
      <item><ref target="link">literature</ref></item>
      </list>
      </div2>
	-->
        <div2 id="what">
          <head>What is a corpus?</head>
          <p>
            <list>
              <item>The Collins English Dictionary (1986):<lb/>
                <hi>1. a collection or body of writings, esp. by a single author or topic.</hi>
              </item>
              <item>Guidelines of the Expert Advisory Group on Language Engineering Standards, <xref
                  url="http://www.ilc.cnr.it/EAGLES96/home.html">EAGLES</xref>:<lb/>
                <hi>

                  <xref url="eagles-corpus.html">
                    <hi rend="bold">Corpus</hi>
                  </xref>: A collection of pieces of language that are selected and ordered
                  according to explicit linguistic criteria in order to be used as a sample of the language.<lb/>
                  <xref url="eagles-corpus.html">
                    <hi rend="bold">Computer corpus</hi>
                  </xref>: a corpus which is encoded in a standardised and homogeneous way for
                  open-ended retrieval tasks.
Its constituent pieces of language are documented as to their origins and provenance. 
</hi>
              </item>
            </list>
          </p>
        </div2>
        <div2 id="use">
          <head>Using corpora</head>
          <p>Research on <hi>actual</hi> language: descriptive approach, study of performance, empirical linguistics.<list>
              <item>Applied linguistics:
<list>
                <item><hi>Lexicography</hi>:  mono-lingual dictionaries, terminological, bi-lingual</item>
              <item>
                <hi>Language studies</hi>: hypothesis verification, 
                  knowledge discovery<lb/> (lexis, morphology, syntax, ...)</item>
              <item>
                <hi>Translation studies</hi>: a source translation equivalents and their contexts<lb/> 
                  translation memories, machine aided translations</item>
              <item>
                <hi>Language learning</hi>: real-life examples<lb/> "idiomatic teaching", curriculum development</item>
</list>
</item>
              <item><hi>Language technology</hi>: 
<list>
<item>testing set for developed methods;</item>
<item><hi>training set</hi> for inductive learning </item>
<item>(<xref url="http://www-nlp.stanford.edu/links/statnlp.html#Syllabi">statistical Natural Language Processing</xref>)
</item>
</list>
               </item>
            </list>
          </p>
        </div2>
        <div2 id="qualia">
          <head>Characteristics of a corpus</head>
          <list type="ordered">
            <item>
              <hi>Quantity</hi>:<lb/> the bigger, the better</item>
            <item>
              <hi>Quality </hi>:<lb/> the texts are authentic; the mark-up is validated</item>
            <item>
              <hi>Simplicity</hi>:<lb/> the computer representation
              is understandable, with the markup easily 
              separated from the text</item>
            <item>
              <hi>Documented</hi>:<lb/> the corpus contains bibliographic and other meta-data</item>
          </list>
        </div2>
        <div2 id="typo">
          <head>Typology of corpora</head>
          <list>
            <item>Corpora of <hi>written language</hi>, <hi>spoken</hi> and <hi>speech</hi> corpora
              (authenticity/price)<lb/> e.g. the 
              agency <xref url="http://www.elra.info/">ELRA</xref>
              <xref url="elra-telephone.html">catalog</xref>
            </item>
            <item>
              <hi>Reference</hi> corpora (representative) and <hi>sub-language corpora</hi>
              (specialised)<lb/> e.g. <xref url="http://info.ox.ac.uk/bnc/">BNC</xref>, <xref
                url="http://www.ucl.ac.uk/english-usage/ice/index.htm">ICE</xref>, <xref
                url="http://torvald.aksis.uib.no/colt/">COLT</xref>
            </item>
            <item>Corpora with <hi>integral</hi>
              texts or of text <hi>samples</hi> (historical and legal reasons)<lb/> e.g.
                <xref url="http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM">Brown</xref>
            </item>
            <item>
              <hi>Static</hi> and <hi>monitor</hi> corpora (language change) </item>
            <item>
              <hi>Monolingual</hi> and multilingual <hi>parallel</hi> and <hi>comparable</hi>
              corpora<lb/> e.g. <xref
                url="http://www.ldc.upenn.edu/Catalog/LDC95T20.html">Hansard</xref>,
                <xref url="http://people.csail.mit.edu/koehn/publications/europarl/">Europarl</xref>
            </item>
            <item><hi>Plain text</hi> and <hi>annotated</hi> corpora</item>
          </list>
        </div2>
        <div2 id="history">
          <head>History</head>
          <p>(Computational) linguistic paradigms: <list>
              <item>1950 -- 1960: empiricism<lb/> weak computers: frequency lists</item>
              <item>1970 -- 1980: cognitive modeling (generative approaches, artificial intelligence
                )<lb/> deep analysis / "basic science": computational linguistics</item>
              <item>1990 -- ...: empiricist revival, also combined approaches<lb/> quantity /
                usefulness: language technologies</item>
              <item>2000 -- ...: The Web</item>
            </list>
          </p>
          <p> The history of  computer corpora: <list>
              <item>First milestones: 
                <xref url="http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM">Brown</xref>
                (1 million words) 1964; 
                <xref url="http://khnt.hit.uib.no/icame/manuals/lob/INDEX.HTM">LOB</xref> (also 1M) 1974</item>
              <item>The spread of reference corpora:
                  Cobuild Bank of English (monitor, 100..200..M) 1980;
                  <xref url="http://info.ox.ac.uk/bnc/">BNC</xref>
                (100M) 1995; Czech <xref url="http://ucnk.ff.cuni.cz/">CNC</xref> (100M) 1998;
                Croatian <xref
url="http://www.hnk.ffzg.hr/">HNK</xref> (100M) 1999...</item>
<item>Slovene language reference corpora: 
<xref url="http://www.fida.net/">FIDA</xref> (100M), 
<xref url="http://bos.zrc-sazu.si/">Nova Beseda</xref> (100M...) 1998; 
<xref url="http://www.fidaplus.net/">FIDA+</xref> (600M) 2006.
</item>
              <item>EU corpus oriented projects in the '90: 
                  NERC, <xref url="http://nl.ijs.si/ME/">MULTEXT-East</xref>,...
              </item>
              <item>Language resources brokers: <xref url="http://www.ldc.upenn.edu">LDC</xref> 1992, <xref
                  url="http://www.elra.info/">ELRA</xref> 1995</item>
            </list>
          </p>
        </div2>
        <div2 id="public">
          <head>Literature on corpora</head>
          <list>
<item><hi>Corpus Linguistics</hi> by Tony McEnery and Andrew Wilson. Edinburgh: Edinburgh
              University Press, 1996</item>
<item><hi>An Introduction to Corpus Linguistics</hi> by Graeme D. Kennedy.
              Studies in Language and Linguistics, London, 1998</item>
<item><hi>Corpus Linguistics: Investigating
                Language Structure and Use</hi> by Douglas Biber, Susan Conrad, Randi Reppen.
              Cambridge University Press, 1998</item>
<item>Uvod v korpusno jezikoslovje, Vojko Gorjanc. Domžale: Izolit, 2005</item>

            <item>LREC conferences:<lb/>
              Fifth international conference on Language Resources and
                  Evaluation, <xref url="http://www.lrec-conf.org/lrec2006/">LREC'06</xref>
            </item>
              <item>Slovenian Conferences on LANGUAGE TECHNOLOGIES 
<xref url="http://nl.ijs.si/is-ltc06/"
                  >2006</xref>,
<xref url="http://nl.ijs.si/isjt04/"
                  >2004</xref>,<xref
url="http://nl.ijs.si/isjt02/">2002</xref>, <xref
                  url="http://nl.ijs.si/isjt00/">2000</xref>, <xref url="http://nl.ijs.si/isjt98/"
                  >1998</xref>
              </item>
          </list>
        </div2>
        <div2 id="slko">
          <head>Slovene language corpora</head>
          <p>Text corpora:<list type="ordered">
              <item>J. Toporišič (ur.): <hi>Besedila slovenskega jezika</hi>, 1975.</item>
              <item>P. Tancig et al. (IJS): <xref url="vayna-hdr.html">Napadi na JNA</xref>, 1989.</item>
              <item>M. Hladnik et al. (FF):<xref url="http://www.ijs.si/lit/leposl.html-l2"
                >Literat</xref>, 1995--</item>
              <item>P. Jakopin et al. (ZRC):<xref url="http://nl.ijs.si/telri/Republic/">TELRI
                  'Plato' corpus</xref>, 1998;
                <xref url="http://bos.zrc-sazu.si/">Beseda</xref>,
                 1999; Nova beseda, 1999--</item>
              <item>S. Krek et al. (DZS, Amebis, FF, IJS): <xref url="http://www.fida.net/"
                >FIDA</xref>, 1998--, <xref url="http://www.fidaplus.net/">FidaPlus</xref>, 2006</item>
              <item>T. Erjavec et al. (IJS): <xref url="http://nl.ijs.si/ME/">MULTEXT-East</xref>,
                1998--, <xref url="http://nl.ijs.si/elan/#corpus">IJS-ELAN</xref>, 1999--.</item>
              <item>Š. Vintar et al. (FF): <xref url="http://www-ai.ijs.si/~spela/trans-index.html"
                  >TRANS</xref>, 2002</item>
              <item>T. Erjavec et al. (IJS): <xref url="http://nl.ijs.si/svez/">SVEZ-IJS</xref>, 2004</item>
              <item>T. Erjavec et al. (IJS): <xref url="http://nl.ijs.si/sdt/">SDT</xref>, 2006</item>
              <item>DSI, VoiceTran, ...</item>
            </list>
          </p>
          <p>Speech corpora:<list>
              <item>
                <xref url="http://www.dsplab.uni-mb.si/">Laboratory for Digital Signal Processing, University of Maribor</xref>:<lb/> SpeechDat, ONOMASTICA...</item>
              <item>
                <xref url="http://luks.fe.uni-lj.si/">Laboratory of Articifical Perception, Systems and Cybernetics, University of Ljubljana</xref>:<lb/> SQEL, GOPOLIS,...</item>
            </list>
          </p>
        </div2>
        <!--div2 id="slkoj">
          <head>Corpus linguistics in Slovenia</head>
          <p>Conferences:<list>
              <item>Conferences LANGUAGE TECHNOLOGIES 
<xref url="http://nl.ijs.si/is-ltc06/"
                  >2006</xref>
<xref url="http://nl.ijs.si/isjt04/"
                  >2004</xref>,<xref url="http://nl.ijs.si/isjt02/">2002</xref>,<xref
                  url="http://nl.ijs.si/isjt00/">2000</xref>,<xref url="http://nl.ijs.si/isjt98/"
                  >1998</xref>
              </item>
              <item> 13th intl.congres of Slavists <xref
                  url="http://www.ff.uni-lj.si/msk/program/p7-tematski.htm">Tematic block "Corpus
                  Linguistics for Slavic Languages"</xref>, 17. August 2003, Cankarjev Dom,
                Ljubljana.</item>
              <item>
                <xref url="http://www.telri.de/telri2/seminar/5th/">5th TELRI Seminar</xref>: Corpus
                Linguistics: How to Extract Meaning from Corpora<lb/> 22. - 24. September 2000,
                Filozofska Fakulteta, Ljubljana.</item>
              <item>
                <xref url="http://nl.ijs.si/eamt00/">EAMT 2000</xref>: European Association for
                Machine Translation Workshop<lb/> 10. - 12. maja 2000, Austrotel, Ljubljana.</item>
              <item>
                <xref url="http://www2.arnes.si/~svinta/workshop.htm">Workshop on Language
                  Technologies - Multilingual Aspects</xref>
                <lb/> 8. - 9. julija 1999, FF, Ljubljana</item>
              <item>
                <xref url="http://www-ai.ijs.si/SasoDzeroski/ICML99/main.html">ICML'99</xref>: 16th
                Int. Conference on Machine Learning<lb/> 30. junija 1999, Bled<list>
                  <item>
                    <xref url="http://www-ai.ijs.si/DunjaMladenic/ICML99/">Workshop on Machine
                      Learning in Text Data Analysis</xref>
                  </item>
                  <item>
                    <xref url="http://www.cs.york.ac.uk/mlg/lll/workshop/">Learning Language in
                      Logic (LLL) Workshop</xref>
                  </item>
                </list>
              </item>
            </list>
          </p>
          <p> Teaching (FF, Ljubljani University):<list>
              <item>Izbirni predmet za filologe na FF:<xref
                  url="http://www.ff.uni-lj.si/hp/pj/seminar/bes_in_rac.html">
                  <hi>Besedilo in računalniki</hi>
                </xref>
              </item>
              <item>
                <xref url="http://www.ff.uni-lj.si/primjez/jezikoslovje.htm">Oddelek za splošno in
                  primerjalno jezikoslovje</xref>
              </item>
              <item>
                <xref url="http://www.ff.uni-lj.si/prevajanje/">Oddelek za prevajanje in
                tolmačenje</xref>, FF, Ljubljani University (Š. Vintar)</item>
            </list>
          </p>
        </div2>
-->
      </div1>
      <div1>
        <head>Compilation of corpora</head>
        <div2 id="exptools">
          <head>Steps in the preparation of a corpus</head>
          <list type="ordered">
            <item>Choosing the component texts:<lb/> linguistic and non-linguistic criteria; availability; simplicity; size </item>
            <item>Copyright<lb/> sensitivity of source (financial and
            privacy considerations); agreement with providers; usage,
            publication</item>
            <item>Acquiring digital originals<lb/> Web transfer;  visit; OCR</item>
            <item>Up-translation<lb/> conversion to standard format; consistency; character set encodings</item>
            <item>Linguistic annotation<lb/> language dependent methods; errors</item>
            <item>Documentation<lb/> TEI header; Open Archives etc.</item>
            <item>Use / Download
<list>
<item>(Web-based) concordancers for linguists</item>
<item>download needed for HLT use</item>
<item>licences for use</item>
</list>
</item>
          </list>
        </div2>
        <div2 id="markup">
          <head>What annotation can be added to the text of the corpus?</head>
          <p>Annotation = interpretation</p>
          <list>
            <item>Documentation about the corpus (<xref url="mte-cesana-hdrs.html#sourceDesc"
              >example</xref>)</item>
            <item>Document structure (<xref url="mteosm-ro.html">example</xref>)</item>
            <item>Basic linguistic markup: sentences, words (<xref url="glass.html"
              >example</xref>), punctuation, abbreviations (<xref url="sample.html#ecmr">example</xref>)</item>
             <item>Lemmas and morphosyntactic descriptions (<xref url="svez-tst.xls">example</xref>)</item>
            <item>Syntax (<xref url="SDT.bmp">example</xref>)</item>
            <item>Alignment (<xref url="http://nl.ijs.si/elan/sample.html#ijs-elan-sample.t"
              >example</xref>)</item>
            <item>Terms, semantics, anaphora, pragmatics, intonation,...</item>
          </list>
        </div2>

        <div2>
          <head>Markup Methods</head>
          <list>
            <item>
              <hi>hand annotation</hi>: documentation, first steps<lb/> 
              generic (XML, spreadsheet) editors or specialised editors</item>
            <item>
              <hi>semi-automatic</hi>: morphosyntactic and other linguistic annotation<lb/> cyclic approach:
              machine, hand, validate, correct, machine, ...</item>
            <item>
              <hi>machine, with hand-written rules</hi>: tokenisation<lb/> regular expression</item>
            <item>
              <hi>machine, with inductivelly built models from annotated data</hi>: <lb/> 
                "supervised learning"; HMMs, decision trees, inductive logic programming,...
              </item>
            <item>
              <hi>machine, with inductivelly built models from un-annotated data</hi>: 
              <lb/> "unsupervised leaning"; clustering technigues</item>
            <item>
              <xref url="http://www-nlp.stanford.edu/links/statnlp.html">overview of the field</xref>
            </item>
          </list>
        </div2>
        <div2>
          <head>Computer coding of corpora</head>
          <p>A good encoding must ensure durability, enable interchange between computer platforms and applications
              <list>
              <item>The basic standard used is <hi>Extended Markup
                Language</hi>, <xref url="http://www.w3.org/XML">XML</xref>
              </item>
              <item>There are a number of companion standards and technologies: 
                XML transformations (XSLT), data definition (DTD, XML Schema, ISO Relax NG),
                addressing and queries (XPath, XQuery), ...</item>
              <item>The vocabulary of annotations for corpora and
              other language resources are defined by the <hi>Text
              Encoding Initiative</hi>, <xref
                  url="http://www.tei-c.org/">TEI</xref>
              </item>
            </list>
          </p>
          <p>XML/TEI used much wider than just for corpora:<list>
              <item>documentation: these <xref url="corpus.xml">slides</xref>
              </item>
              <item>annotation of dictionaries: 
               <xref url="cnc-mte.html">English-Slovene</xref>, <xref url="jaslo.bmp">Japanese-Slovene</xref>
(from <xref url="http://nl.ijs.si/jaslo/">jaSlo</xref>)
              </item>
              <item>for annotating <xref url="http://nl.ijs.si/e-zrc/slomsek/data/eSlomsek.html#sl1d"
                  >text-critical editions</xref>
              </item>
            </list>
          </p>
        </div2>
        <div2>
          <head>Examples of TEI encoding in corpora: meta-data</head>
          <p>
<eg>
&lt;teiHeader id="ecmr.H" type="text" lang="sl-en" creator=ET 
     status="update" date.created="1999-04-13" date.updated="1999-06-22" &gt;
  &lt;fileDesc&gt;
  &lt;titleStmt&gt;
    &lt;title lang="sl"&gt;Ekonomsko ogledalo; 13 &amp;scaron;tevilk 98/99&lt;/title&gt;
    &lt;title lang="en"&gt;Slovenian Economic Mirror; 13 issues, 98/99&lt;/title&gt;
    &lt;respstmt&gt;
      &lt;name&gt;Andrej Skubic, FF&lt;/name&gt;
      &lt;resp lang="sl"&gt;Zagotovitev digitalnega originala, poravnava&lt;/resp&gt;
      &lt;resp lang="en"&gt;Provision of digital original, alignment&lt;/resp&gt;
      &lt;name&gt;Toma&amp;zcaron; Erjavec, IJS&lt;/name&gt;
      &lt;resp lang="sl"&gt;Tokenizacija, pretvorba v TEI&lt;/resp&gt;
      &lt;resp lang="en"&gt;Tokenisation, conversion to TEI&lt;/resp&gt;
    &lt;/respStmt&gt;
  &lt;/titleStmt&gt;
... 
</eg>
          </p>
        </div2>
        <div2>
          <head>Examples of TEI encoding in corpora: Structure of the text</head>
          <p>

<eg>
&lt;quote id="Osl.1.8.18" rend="center;it"&gt;
  &lt;lg id="Osl.1.8.18.1"&gt;
    &lt;l id="Osl.1.8.18.1.1"&gt;Tam pod kostanjevim drevesom&lt;/l&gt;
    &lt;l id="Osl.1.8.18.1.2"&gt;izdala si me,&lt;/l&gt;
    &lt;l id="Osl.1.8.18.1.3"&gt;izdal sem te,&lt;/l&gt;
    &lt;l id="Osl.1.8.18.1.4"&gt;ne da bi trenila z o&#x10D;esom.&lt;/l&gt;
  &lt;/lg&gt;
&lt;/quote&gt;
&lt;p id="Osl.1.8.19"&gt;
  &lt;s id="Osl.1.8.19.1"&gt;Trije mo&#x17E;je se niso niti ganili.&lt;/s&gt;
  &lt;s id="Osl.1.8.19.2"&gt;Toda ko je &lt;name&gt;Winston&lt;/name&gt;
  znova pogledal v Rutherfordov propadli obraz, je opazil, 
da so njegove o&#x10D;i polne solz.&lt;/s&gt;
... 
</eg>
          </p>
        </div2>
        <div2>
          <head>Examples of TEI encoding in corpora: Morphosyntactic descriptions</head>
          <p>
<eg> 
&lt;s id="Osl.1.2.2.1"&gt;
  &lt;w lemma="biti" ana="Vcps-sma"&gt;Bil&lt;/w&gt;
  &lt;w lemma="biti" ana="Vcip3s--n"&gt;je&lt;/w&gt;
  &lt;w lemma="jasen" ana="Afpmsnn"&gt;jasen&lt;/w&gt;&lt;c&gt;,&lt;/c&gt;
  &lt;w lemma="mrzel" ana="Afpmsnn"&gt;mrzel&lt;/w&gt;
  &lt;w lemma="aprilski" ana="Aopmsn"&gt;aprilski&lt;/w&gt;
  &lt;w lemma="dan" ana="Ncmsn"&gt;dan&lt;/w&gt;
  &lt;w lemma="in" ana="Ccs"&gt;in&lt;/w&gt;
  &lt;w lemma="ura" ana="Ncfpn"&gt;ure&lt;/w&gt;
  &lt;w lemma="biti" ana="Vcip3p--n"&gt;so&lt;/w&gt;
  &lt;w lemma="biti" ana="Vmps-pfa"&gt;bile&lt;/w&gt;
  &lt;w lemma="trinajst" ana="Mcnpnl"&gt;trinajst&lt;/w&gt;&lt;c&gt;.&lt;/c&gt;
&lt;/s&gt;

&lt;fs id="Vcps-sma" select="sl" feats="V0. V1.c V2.p V3.s V5.s V6.m V7.a"/&gt;
&lt;fs id="Vcps-sman----n" select="cs" feats="V0. V1.c V2.p V3.s V5.s V6.m V7.a V8.n V13.n"/&gt;
&lt;fs id="Vcps-smay----n" select="cs" feats="V0. V1.c V2.p V3.s V5.s V6.m V7.a V8.y V13.n"/&gt;
&lt;fs id="Vcps-sna" select="sl" feats="V0. V1.c V2.p V3.s V5.s V6.n V7.a"/&gt;
&lt;fs id="Vcps-snan----n" select="cs" feats="V0. V1.c V2.p V3.s V5.s V6.n V7.a V8.n V13.n"/&gt;

&lt;fLib type="Verb"&gt;
  &lt;f id="V0." select="en ro sl cs bg et hu hr sr sl-rozaj" name="PoS"&gt;&lt;sym value="Verb"/&gt;&lt;/f&gt;
  &lt;f id="V1.m" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"&gt;&lt;sym value="main"/&gt;&lt;/f&gt;
  &lt;f id="V1.a" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"&gt;&lt;sym value="auxiliary"/&gt;&lt;/f&gt;
  &lt;f id="V1.o" select="en ro sl cs et hr sr sl-rozaj" name="Type"&gt;&lt;sym value="modal"/&gt;&lt;/f&gt;
  &lt;f id="V1.c" select="ro sl cs hr sr sl-rozaj" name="Type"&gt;&lt;sym value="copula"/&gt;&lt;/f&gt;
  &lt;f id="V1.b" select="en" name="Type"&gt;&lt;sym value="base"/&gt;&lt;/f&gt;
</eg>
          </p>
        </div2>
        <div2>
          <head>Examples of TEI encoding in corpora: Alignment</head>
          <p>
<eg>
&lt;linkGrp id="Oslen.1" type="body" targtype="s" domains="Oen Osl"&gt;
&lt;link xtargets="Osl.1.2.2.1 ; Oen.1.1.1.1"&gt;
&lt;link xtargets="Osl.1.2.2.2 ; Oen.1.1.1.2"&gt;
&lt;link xtargets="Osl.1.2.3.1 ; Oen.1.1.2.1"&gt;
&lt;link xtargets="Osl.1.2.3.2 ; Oen.1.1.2.2"&gt;
... &lt;link xtargets="Osl.1.2.6.5 ; Oen.1.1.5.5"&gt;
&lt;link xtargets="Osl.1.2.6.6 ; Oen.1.1.5.6 Oen.1.1.5.7"&gt;
&lt;link xtargets="Osl.1.2.6.7 ; Oen.1.1.5.8"&gt;
... 
</eg>
          </p>
        </div2>
      </div1>
      <div1>
        <head>Examples of use</head>
        <div2>
          <head>Lexicology</head>
          <list>
            <item>Concordances and collocations<lb/>
              <q>You shall know a word by the company it keeps.</q> (Firth, 1957)</item>
            <item id="danlex">Induction of multilingual lexica: <list>
                <item>
                  <xref url="http://www.racai.ro/~tufis/">D. Tufiş</xref>, Ana-Maria Barbu:
                  Revealing translators knowledge: statistical methods in constructing practical
                  translation lexicons for language and speech processing, in International Journal
                  on Speech Technology, Vol.5, No. 3, 2002 Kluwer Pbls.</item>
                <item>Nancy Ide, Tomaž Erjavec and Dan Tufiş: Sense Discrimination with Parallel
                  Corpora, in Proceedings of the SIGLEX Workshop on Word Sense Disambiguation:
                  Recent Successes and Future Directions. ACL2002, July Philadelphia 2002, pp.
                  56-60.</item>
              </list>
              <p> Automatically built 7-language dictionary from
                  '1984' corpus of EU project <xref
                  url="http://nl.ijs.si/ME/">MULTEXT-East</xref>:<lb/>
                <xref url="orw-multiLex.html">first 100 entries</xref>
              </p>
            </item>
          </list>
        </div2>
        <div2 id="mt-try">
          <head>Automatic translation</head>
          <list>
            <item> VIČIČ, Jernej, ERJAVEC, Tomaž. Statistično strojno prevajanje na osnovi
              vzporednih korpusov. ERK 2002, 23.-25. 2002.</item>
          </list>
          <p>The <xref url="http://www.pef.upr.si/menola/">Menola</xref> translator
            <lb/>
            <eg> 
Slovene sentence:   evropi vlada veliki brat 
ELAN model:         europe government big brother 
Bible model:        evropi brother chief upright . 
Czech translation:  evropi vláda velké bratr .</eg>
          </p>
        </div2>

        <!--
<div2 id="concord"><head>Mrežne konkordance</head>

<p>Konkordančniki za druge jezike:
<list>
<item><xref url="http://sara.natcorp.ox.ac.uk/lookup.html">BNC</xref></item>
<item><xref url="http://corpora.ids-mannheim.de/~cosmas/ProtoDocs/Deutsch/start.html">COSMAS </xref></item>
<item><xref url="http://www-rali.iro.umontreal.ca/TransSearch/TS-simple-uen.cgi">Hansard</xref></item>
<item><xref url="http://www.tekstlab.uio.no/Bosnian/Corpus.html">Oslo Corpus of Bosnian Texts</xref></item>
<item><xref url="http://www.webcorp.org.uk/">WebCorp</xref>
</item>
</list>
</p>

<p>Slovenski konkordančniki:
<list>
<item><xref url="http://bos.zrc-sazu.si/s_beseda.html">nova beseda</xref></item>
<item><xref url="http://www.fida.net/">FIDA</xref></item>
<item><xref url="http://nl2.ijs.si/corpus/">nl.ijs.si</xref>
</item>
</list>
</p>

<p>Obstajajo seveda tudi navadni konkordančniki:
<list>
<item><xref url="http://www.liv.ac.uk/~ms2928/">Wordsmith</xref></item>
<item><xref url="http://www.ruf.rice.edu/~barlow/mono.html">MonoConcd</xref>
    in 
    <xref url="http://www.ruf.rice.edu/~barlow/parac.html">ParaConc</xref></item>
<item><xref url="">WordCruncher</xref>
</item>
</list>

</p>
</div2>
-->

        <div2>
          <head>Concordances at nl2.ijs.si</head>
          <p>At <xref url="http://nl2.ijs.si/corpus/">nl.ijs.si</xref> we have two interfaces:<list>
              <item>
                <xref url="http://nl2.ijs.si/corpus/index-mono.html">monolingual</xref>
              </item>
              <item>
                <xref url="http://nl2.ijs.si/corpus/index-bi.html">bi-lingual</xref>
              </item>
            </list>
          </p>
          <p>Fuzzy matching and regular expressions:
 		<list type="ordered">
              	<item>Search for RE: <hi rend="bold">"hoditi"</hi> (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search?Display=KWIC&amp;Context=60&amp;Corpus=ORW-SL&amp;Query=%22hoditi%22"
                  >search</xref>)</item>
              <item>Search for RE: <hi rend="bold">"hodi.*"</hi> (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search?Display=KWIC&amp;Context=60&amp;Corpus=ORW-SL&amp;Query=%22hodi.*%22"
                  >search</xref>)</item>
              <item>Search for RE: <hi rend="bold">".*hodi.*"</hi> (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search?Display=KWIC&amp;Context=60&amp;Corpus=ORW-SL&amp;Query=%22.*hodi.*%22"
                  >search</xref>)</item>
              <item>Search for RE: <hi rend="bold">"[bcčdfghjklmnprsštvzž]{5,}"</hi> (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=KWIC&amp;Context=60&amp;Corpus=ELAN-SL&amp;Query=%22[bcčdfghjklmnprsštvzž]{5,}%22"
                  >search</xref>)</item>
            </list>
          </p>
          <p>Show results:
	<list type="ordered">
              <item>
                <hi rend="bold">".*hod.*"</hi> as frequency list (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=LIST&amp;Context=60&amp;Corpus=ELAN-SL&amp;Query=%22.*hod.*%22"
                  >search</xref>)</item>
              <item>
                <hi rend="bold">"prihodki"</hi> as KWIC (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=KWIC&amp;Context=60&amp;Corpus=ELAN-SL&amp;Query=%22prihodki%22"
                  >search</xref>)</item>
              <item>
                <hi rend="bold">"prihodki"</hi> bi-lingual (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=PARA&amp;Context=60&amp;Corpus=ELAN-SL&amp;Query=%22prihodki%22"
                  >search</xref>)</item>
            </list>
          </p>
          <p>Bi-lingual searching:<list type="ordered">
              <item>
                <hi rend="bold">"prihodki"</hi> and <hi rend="bold">"income"</hi> (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=PARA&amp;Context=60&amp;Corpus=ELAN-SL&amp;Query=%22prihodki%22+:ELAN-EN+%22income%22"
                  >search</xref>)</item>
              <item>
                <hi rend="bold">"prihodki"</hi> and not <hi rend="bold">"income"</hi> (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=PARA&amp;Context=60&amp;Corpus=ELAN-SL&amp;Query=%22prihodki%22+:ELAN-EN+!%22income%22"
                  >search</xref>)</item>
              <item>
                <hi rend="bold">"community"</hi> and not <hi rend="bold">"skupnost"</hi> (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=PARA&amp;Context=60&amp;Corpus=ELAN-EN&amp;Query=%22community%22+:ELAN-SL+!%22skupnost.*%22"
                  >search</xref>)</item>
            </list>
          </p>
          <p>Words, lemmas and annotations:<list type="ordered">
              <item>Word "iti" in '1984' (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search?Display=KWIC&amp;Context=60&amp;Corpus=ORW-SL&amp;Query=%22iti%22"
                  >search</xref>)</item>
              <item>Lemma "iti" in '1984' (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search?Display=KWIC&amp;Context=60&amp;Corpus=ORW-SL&amp;Query=[lemma%3D%22iti%22]"
                  >search</xref>)</item>
              <item>Lemma "iti" in '1984' as frequency list (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search?Display=LIST&amp;Corpus=ORW-SL&amp;Query=[lemma%3D%22iti%22]"
                  >search</xref>)</item>
            </list>
          </p>
          <p>Effect of corpus:<list type="ordered">
              <item>"šel" in '1984' (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search?Display=KWIC&amp;Context=40&amp;Corpus=ORW-SL&amp;Query=%22el%22"
                  >search</xref>) in 'VAYNA' (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search?Display=KWIC&amp;Context=40&amp;Corpus=VAYNA&amp;Query=%22šel%22"
                  >search</xref>) in 'GORE' (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search?Display=KWIC&amp;Context=40&amp;Corpus=GORE&amp;Query=%22šel%22"
                  >search</xref>)</item>
              <item> "okrevanje" in 'ELAN-SL' (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=KWIC&amp;Context=60&amp;Corpus=ELAN-SL&amp;Query=%22okrevanj.*%22"
                  >search</xref>) and "sožitje" (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=KWIC&amp;Context=60&amp;Corpus=ELAN-SL&amp;Query=%22sožitj.*%22"
                  >search</xref>)</item>
            </list>
          </p>
          <p>Multiword searchers and colloations:<list type="ordered">
              <item>"star* mam*" v 'ELAN-SL' (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=KWIC&amp;Context=60&amp;Corpus=ELAN-SL&amp;Query=%22star.*%22+%22mam.*%22"
                  >search</xref>)</item>
              <item>"* and death" v 'ELAN-EN' (<xref
                  url="http://nl2.ijs.si/cgi-bin/corpus-search-bi?Display=KWIC&amp;Context=60&amp;Corpus=ELAN-EN&amp;Query=%22.*%22+%22and%22+%22death%22"
                  >search</xref>)</item>
            </list>
          </p>
        </div2>
      </div1>
      <div1 id="conc">
        <head>The future of corpus and data-driven linguistics</head>
        <div2>
          <head>The future of corpus and data-driven linguistics</head>
          <p>Size:
          <list>
            <item>Larger quantities of readily accessible data (Web as corpus)</item>
            <item>Larger storage and processing power (Moore law)</item>
          </list>
          Complexity:
          <list>
            <item>Deeper analysis:<lb/> syntax, deixis, semantic roles, dialogue acts, ...</item>
            <item>Multimodal corpora:<lb/> speech, film, transcriptions,... </item>
            <item>Annotation levels and linking:<lb/> co-existence and
              linking of varied types of annotations; ambiguity</item>
            <item>Development of tools and platforms:<lb/> precision,
                 robustness, unsupervised learning, meta-learning</item>
          </list>
        </p>
        </div2>
        <div2>
          <head>Development of corpus linguistics for smaller languages</head>
          <list>
            <item>varied, high-quality and accessible corpora</item>
            <item>technology of morphosyntactic annotation / lemmatisation</item>
            <item>syntactically annotated corpora (treebanks)</item>
            <item>application of developed methods</item>
            <item>development of curricula...</item>
          </list>
        </div2>
      </div1>

      <!--
<div1 id="link"><head>Bibliografija in povezave</head>

<list>
<item>Tomaž Erjavec: 
<xref url="http://nl.ijs.si/et/Bib/SlKorpus/slKorpus-la2/">Računalniške
zbirke besedil</xref>. Jezik in Slovstvo, 42/2-3, str.81-96, 1997.
</item>
<item>Simon Krek:
<xref url="http://www.fida.net/clanki/krek_01.html">
Računalniški korpusi v slovaropisju</xref>.
Razgledi, Št. 13 (23. jun. 1999), str. 8.

</item>
<item>Vladimir Batagelj: 
<xref url="http://vlado.fmf.uni-lj.si/vlado/sgml/sgmluvod.htm">
Uvod v SGML</xref>.
Uporabna informatika 3(1995)4, 20-25. 
</item>
<item>Špela Vintar: 
<xref url="http://www2.arnes.si/~svinta/linki.htm">Nekaj koristnih 
povezav za prevajalce</xref>
</item>
<item>Michael Barlow: 
<xref url="http://www.ruf.rice.edu/~barlow/corpus.html">Corpus Linguistics</xref>
</item>
<item>Christopher Manning: 
<xref url="http://www-nlp.stanford.edu/links/statnlp.html">Statistical
natural language processing and corpus-based
computational linguistics: An annotated list of resources</xref></item>
</list>

</div1>
-->
    </body>
  </text>
</TEI.2>
