next up previous


Making the ELAN Slovene/English Corpus

Tomaz Erjavec
tomaz.erjavec@ijs.si
Department for Intelligent Systems
Jozef Stefan Institute
Jamova 39
SI-1000 Ljubljana
Slovenia

July 2, 1999

Abstract:

Parallel corpora are a basic resource for research and development of multilingual language technologies and for translation and terminology studies. The paper presents a sentence-aligned Slovene/English corpus which was developed in the scope of the EU ELAN project.

The IJS-ELAN corpus contains 1 million words, with minimal restrictions on further exploitation and distribution. The corpus is composed of fifteen terminologically rich texts, mostly written in the '90. To facilitate interchange and re-usability of the digital data, the computer encoding of the corpus is standardised in accordance with the guidelines of the Text Encoding Initiative. The corpus is tokenised into (various types of) words and punctuation marks; one component, Orwell's '1984', is also annotated with disambiguated lemmas and morphosyntactic descriptions.

The paper outlines the making of the corpus, details its component bi-texts, explains the markup, distribution and concludes with possibilities for the use and further annotation of the corpus.

Introduction

For bilingual research, parallel corpora are an essential language resource. For the Slovene language, so far the only available parallel corpus has been the one released on the TELRI CD-ROM [Erjavec et al. 1998], which comprised Plato's Republic and the MULTEXT-East corpus [Erjavec and Ide1998]. The MULTEXT-East corpus derives most its value from the fact that it contains parallel texts in many languages, and is heavily annotated; the markup includes document structure, together with quotes and sentences, disambiguated lemmas and morphosyntactic descriptions of its words, and alignment of sentences with the ones from the English original. To facilitate the reusability of the corpus, it is annotated in accordance with international recommendations for written text corpora targeted towards language engineering research, in particular the Corpus Encoding Specification, CES [Ide1998]. While the encoding of MULTEXT-East corpus is such that it is suitable for further processing, annotation and exploitation, the parallel English-Slovene part nevertheless consists of only one novel-length text, Orwell's '1984'. While the text is without doubt interesting, this English-Slovene 'corpus' is very limited both in size and variety.

The European Language Activity Network (EU MLIS project ELAN) provided an opportunity to somewhat remedy this lack. Our contribution to ELAN was, in part, to collect and annotate a 1 million word Slovene-English / English-Slovene corpus and make it widely available as a standardised dataset for bilingual language research on the Slovene language. The corpus contains fifteen recent texts, from interesting areas of text production. The texts and corpus encoding have been chosen so as to have minimal restriction on further use, so they can be made widely available as a standardised dataset for bilingual language engineering research.

The article is structured as follows: Section 2 reports on the corpus compilation project and the processing issues involved. Section 3 presents the 15 component texts of the corpus. Section 4 turns to the digital storage and annotations of the corpus. Section 5 gives conclusions and direction for further work.

Making of the Corpus

The project operated under tight time and labour constraints, so it was imperative to maximise the results by minimising the most costly steps in the production process. These include obtaining permission of the copyright holders to use the texts for the purposes of the project, obtaining the digital originals of the texts themselves, converting, segmenting and aligning the bi-texts, tokenising the texts, converting to a standard format and writing the text and corpus headers.

Rather then acquiring the texts and converting, segmenting, aligning and, especially, hand-validating locally, these tasks were performed by external collaborators of the project. From the language departments at the University of Ljubljana, Andrej Skubic was responsible for governmental texts that are available on the Web, while Spela Vintar acquired RTF texts from the Office of the Government of the Republic of Slovenia for European Affairs. Both also performed conversion from source format and the segmentation and alignment. The software used was Déjà Vu, a commercial translation memory program, which offers an interactive alignment environment. Roman Maurer was the translator as well as the aligner of the book 'Linux Installation and Getting Started', which was already annotated in SGML. For conversion and alignment he, of course, used GNU programs. The other texts were, apart from being available, chosen because they were seen to fit in with the overall scheme of the corpus. Various methods previously used in the MULTEXT-East project, e.g., the Vanilla aligner [Danielsson and Ridings1997] have been used for the segmentation and alignment of these components.

This first step produced predominately ASCII bi-texts, stripped of original markup, with aligned segments represented side-by-side in a simple tabular form. These bi-texts were then normalised on the character level and remnants of formatting cleaned up via Perl filters. We thus obtained aligned bi-texts in a common format, which are missing all structural information above the translation unit segments, e.g., divisions and paragraphs, and all formatting below this level, e.g., boldification and font changes. This approach loses information from the originals, which is considered bad, but it has certain advantages. The conversion process is fast, the resulting structure is simple and it protects the copyright of the text, by making the original texts difficult to reprint on the basis of the corpus.

Such bi-texts were then tokenised into words and punctuation marks. This step was performed by the MULTEXT tool 'mtseg' [Cristo1996], with resources for English and Slovene, developed in the MULTEXT-East project [Dimitrova et al. 1998]. The tokenisation also flags numerals, compounds, abbreviations etc. This step introduced many errors, most of which were corrected via Perl filters and Emacs macros. Finally, the tokenised aligned texts were converted into a TEI conformant encoding; here the header information is added to the bi-texts, and the bi-texts and corpus as a whole encoded as a SGML document.

Corpus Composition

The small-scale of the project prohibited any attempt towards making an English-Slovene reference-type corpus except maybe at the level of encoding. Rather, we tried to maximise the size of the corpus while ensuring availability and retaining those aspects of the texts that are interesting for target applications. The composition of the IJS-Elan corpus is thus the result of a balancing act between usability and ease of acquisition.

The corpus mostly contains recent (90's) terminology rich texts from active topic areas, which are, from the copyright aspect, easy to distribute further. Having collaborators who chose the kind of texts they were themselves most interested in studying also gave a certain coherence to the corpus.

The corpus comprises fifteen bi-texts; they are mostly complete texts, but with omissions of predominately non-textual data (numerical charts etc). In the corpus, each bi-text is given its ID and constitutes, along with its header, one element of the corpus.

The texts are usefully divided into those that have a Slovene original and an English translation, and those whose original is English, and the translation is into Slovene. Apart from there being linguistic differences due to the opposition original/translation, the two parts also have a quite different composition.

The Slovene - English half has been, for the most part, acquired from various branches of the Slovene government. It consists of eleven texts, containing somewhat more than half of the corpus material. The Slovene-English texts, together with their IDs, approximate sizes in kilo-bytes and -words, and year of publication, are as follows:

usta
364 Kb, 20 kW, 1997
Constitution of the Republic of Slovenia
Ustava Republike Slovenije
Constitutional Court of the Republic of Slovenia
kuca
1102 Kb, 69 kW, 1990-95
Speeches by the President of Slovenia, M. Kucan
Govori predsednika RS, M. Kucana
The Office of the President of the Republic of Slovenia

parl
325 Kb, 20 kW, 1998
Functioning of the National Assembly
Delovanje Drzavnega zbora
The National Assembly of the Republic of Slovenia

ecmr
4056 Kb, 239 kW, 1998/1999
Slovenian Economic Mirror; 13 issues
Ekonomsko ogledalo; 13 stevilk
Institute of Macroeconomic Analysis and Development of the Republic of Slovenia

ekol
1222 Kb, 70 kW, 1999
National Environmental Protection Programme
Nacionalni program varstva okolja
Office of the Government of the Republic of Slovenia for European Affairs

spor
589 Kb, 34 kW, 1996
Europe Agreement
Evropski sporazum
Office of the Government of the Republic of Slovenia for European Affairs

anx2
483 Kb, 25 kW, 1996
Europe Agreement - Annex II
Evropski sporazum - Priloga II
Office of the Government of the Republic of Slovenia for European Affairs

stra
1511 Kb, 89 kW, 1997
Slovenia's Strategy for Integration into EU
Strategija Slovenije za vkljucevanje v EU
Office of the Government of the Republic of Slovenia for European Affairs

kmet
543 Kb, 29 kW
Slovenia's programme for accession to EU - agriculture
Drzavni program za prilagajanje zakonodaje - kmetijstvo
Office of the Government of the Republic of Slovenia for European Affairs

ekon
394 Kb, 23 kW
Slovenia's programme for accession to EU - economy
Drzavni program za prilagajanje zakonodaje - gospodarstvo
Office of the Government of the Republic of Slovenia for European Affairs

vade
471 Kb, 24 kW, 1995
Vademecum by Lek, 1995
Vademecum Lekove domace lekarne
Lek d.d.; OTC Division

The English-Slovene part of the corpus contains almost half of the corpus material, but is composed of only four elements, with two of these being full-length books. It also has different text types from the Slovene-English part: two components deal with computers, one with pharmaceuticals and one with a rather grim projection of the future, from the past:

vino
1182 Kb, 69 kW, 1994
EC Council Regulation No 3290/94 - agriculture
Uredba sveta ES st. 3290/94 - kmetijstvo
Office of the Government of the Republic of Slovenia for European Affairs
ligs
3044 Kb, 173 kW, 1999
Linux Installation and Getting Started
Namestitev in zacetek dela z Linuxom
Linux Documentation Project: -en: Specialized Systems Consultants / -sl: Linux User Group Of Slovenia, LUGOS

gnpo
353 Kb, 13 kW, 1999
GNU PO localisation files
GNU PO lokalizacije datoteke
Free Software Foundation, Linux Documentation Project

orwl
6698 Kb, 195 kW, 1948
G. Orwell: Nineteen Eighty-Four
G. Orwell: 1984
The Slovene translation of the book was published by Knjiznica Kondor, Mladinska knjiga in 1983 (translator: Alenka Puhar). The first digital versions of the English and Slovene (as well as the Serbian and Croat translations) were keyed in at the School of Oriental and African Studies at London University, then became a part of the Oxford Text Archive and was published, with minimal changes, on the ECI-I CDROM. This served as the basis of the marked-up MULTEXT-East version.

Corpus Encoding

One of the guidelines of the project was to use, in line with most other similar efforts, the Text Encoding Initiative Guidelines (TEI P3, [Sperberg-McQueen and Burnard1994]) for the annotation of our corpus; the guidelines provide a comprehensive and general guide for encoding linguistic data for scholarly purposes and are, in turn, an application of the ISO Standard Generalized Markup Language, SGML. We use an instantiation of TEI which keeps the benefits of the 'off the shelf' TEI encoding for aligned corpora (header, sub-segment markup) but treats corpus texts as a direct collection of translation units. This document type is very similar to the one used in PLUG (Parallel Corpora in Linköping, Uppsala, and Göteborg, [Ahrenberg et al. 1999,Tiedemann1998]). The main difference lies in the method of construction: PLUG use their own XML DTD whereas we parametrise the TEI in conformance with the procedures outlined in Chapter 29 of TEI P3 [Sperberg-McQueen and Burnard1994, pp.737-744]. Our TEI parametrisation is defined as follows:

<!DOCTYPE teiCorpus.2 PUBLIC 
  "-//TEI P3//DTD Main Document Type//EN" [

  <!-- base tag set -->
  <!ENTITY % TEI.prose    'INCLUDE'>

  <!-- add: basic linguistic analysis -->
  <!ENTITY % TEI.analysis 'INCLUDE'>

  <!-- add: pointer mechanisms -->
  <!ENTITY % TEI.linking  'INCLUDE'>

  <!-- add: local extensions -->
  <!ENTITY % TEI.extensions.ent 
    SYSTEM "teitmx.ent">
  <!ENTITY % TEI.extensions.dtd 
    SYSTEM "teitmx.dtd">
]>

In short, the above SGML prolog specifies that the root element of the corpus is <teiCorpus.2> and that the document uses the TEI.prose base tagset, and two additional tagsets: TEI.linking which implements linking elements, and TEI.analysis, the module for basic linguistic analysis, which specifies, inter alia, the definition of segments, <seg>, and words, <w>. Finally, the above parametrisation makes reference to two files with local extensions to TEI, which we give below:

teitmx.ent:               
<!ENTITY % body 'IGNORE' >
                          
teitmx.dtd:                             
<!ELEMENT %n.body;      - -  (tu+)>     
<!ELEMENT tu            - -  (seg, seg)>
<!ATTLIST tu                 %a.global;>

The entity extension IGNOREs the standard definition of the TEI <body>, while the DTD extension redefines <body> to be composed of translation units only; each translation unit contains two segments and the standard global attributes, in particular the ID id and idref lang).

The structure of our corpus is explained in more detail in [Erjavec1999]; we here give a brief overview and focus of the actual usage of tags in the corpus.

Corpus structure

The corpus as a whole is a valid SGML document, and therefore contains the following components:

1.
the SGML Declaration, which defines local processing options. It makes the usual (TEI) assumptions about capacity points but limits the character set to ASCII: all the language specific characters in the corpus, e.g., c, and C are encoded as SGML entities, e.g.,  &ccaron; and &Cacute;. The declaration also prohibits tag minimisation, so the corpus encoding is XML-like.

2.
the SGML DTD, the Document Type Definition, which defines the annotation grammar of the corpus. As explained above it is a parametrisation of TEI.

3.
the SGML Document itself, which contains the SGML Prolog, the corpus header and references to all the corpus components, i.e.,  headers and texts.

Each of the fifteen corpus elements is stored in two files, one containing the component header and the other the aligned bi-text. As we expect that many users will be interested only in parts of the corpus, a significant amount of information identical across texts is kept in the text headers, and not solely in the header of the corpus.

The headers

The corpus as a whole, as well as each component has its TEI header. This header contains detailed information about the file itself, the source of its text, its encoding, and revision history.

As the corpus is bilingual, it seemed only proper to have the headers bilingual as well. This is achieved by doubling header elements but distinguishing their localisation via the lang attribute.

To give an impression of the information encoded in the header, we give below some examples for the corpus headers. The first is the start of the corpus and corpus header:

<teiCorpus.2>
  <teiheader type="corpus" lang="slen"  id="ijs-e
   creator="et" status="update" date.created="199
   date.updated="1999-06-22"                     
  >                                              
    <filedesc>                                       
      <titlestmt>                                     
        <title lang="en">The IJS-ELAN Slovene/Eng
        <title lang="sl">Slovenskoangle&scaron;ki

Part of the responsibility statement from a text header:

  <respstmt>                                     
   <name>Jasna Belc, SVEZ</name>                 
   <resp lang="sl">Zagotovitev digitalnega origin
   <resp lang="en">Provision of digital original<
   <name>&Scaron;pela Vintar, FF</name>          
   <resp lang="sl">Poravnava</resp>              
   <resp lang="en">Alignment</resp>

The bibliography of the source texts in a text header:

<bibl lang="en" default="yes">                   
  <title lang="en">Linux Installation and Getting
  <xref type="URL">http://metalab.unc.edu/LDP/LDP
  <xref type="URL">ftp://metalab.unc.edu/pub/Linu
  <publisher>Specialized Systems Consultants
    <xref type="URL">http://www.ssc.com/</xref>
  </publisher>
</bibl>

The tags declaration in a text header:

<tagsdecl>
 <tagusage gi=text occurs=1></tagusage>
 <tagusage gi=body occurs=1></tagusage>
 <tagusage gi=tu occurs=956></tagusage>
 <tagusage gi=seg occurs=1912></tagusage>
 <tagusage gi=w occurs=33765></tagusage>
 <tagusage gi=c occurs=4198></tagusage>
</tagsdecl>

The texts

Each text is composed of translation units, <tu> elements, each having two segments: the original and translation. The definition of the segment element is taken directly from the TEI.analysis module, and allows significant subsegment-level markup. Our corpus currently encodes word and punctuation elements, i.e., it is tokenised. Below we give some translation units from the corpus:

<tu lang="sl-en" id="usta.301">
<seg lang="sl"><w type=dig>70.</w> <w>&ccaron;
<seg lang="en"><w>Article</w> <w type=dig>70</
</tu>
...
<tu lang="sl-en" id="spor.301">               
<seg lang="sl"><w>ii</w><c>)</c> <w>za</w> <w>
<seg lang="en"><c type=open>(</c><w>ii</w><c t
</tu>
...
<tu lang="sl-en" id="kmet.301">               
<seg lang="sl"><c>-</c> <w>razvoj</w> <w>pode&
<seg lang="en"><c>-</c> <w>Pillar</w> <w>IV</w
</tu>
...
<tu lang="sl-en" id="vade.301">               
<seg lang="sl"><w>Na</w> <w>bole&ccaron;e</w> 
<seg lang="en"><w>Apply</w> <w>a</w> <w>thin</
</tu>
...
<tu lang="en-sl" id="ligs.301">               
<seg lang="en"><w>Many</w> <w>text</w> <w>proc
<seg lang="sl"><w>Za</w> <w>Linux</w> <w>je</w
</tu>
...
<tu lang="en-sl" id="gnpo.301">               
<seg lang="en"><w>Usage</w><c>:</c> <w>%s</w> 
<seg lang="sl"><w>Uporaba</w><c>:</c> <w>%s</w
</tu>

As can be seen, the structure of a translation unit is straightforward, which makes it suitable for direct processing with limited tools or computer expertise.

The above encoding also to a large extent enforces the condition that the usage of the corpus will be at most over its sentences, as all super-segmental markup is lost and the texts would thus require a substantial amount of effort to recreate in their entirety. This implicitly protects copyright, as it would be difficult to recreate the texts in their entirety, yet they remain suitable as language resources.

Tokenisation

As has been seen, the texts have also been tokenised, i.e., they are marked up for words and punctuation symbols. This markup is of course not meant for reading, but to facilitate software to exploit the corpus further, rather than having to do tokenisation itself, usually a first step before further processing.

We have assigned some possibly useful values to the type attribute of the token elements. As the programs that assigned these types, as well as the corpus texts contain errors, so do the types of some tokens. The word element has the following values of TYPE, with examples taken from the corpus:

<w type=comp>
Compound (lexical multiword unit), e.g.,  medtem ko, vice versa, New York
<w type=dig>
Digit (numeric expression), e.g., 1984, 3., IV, 20%, 1993-1996, 25/76, 16MB
<w type=abbr>
Abbreviation (ending in a period), e.g., tar., et al., S.u.S.E., dipl.
<w>
implied type for 'normal' words, e.g.,  Slovenije, market, 's, Article, zivinorejo, INAVGURACIJSKI, Hurt-Andreatta, Hrup51, E-postni, D'you

The punctuation element, <c> can be marked as type=open or type=close, e.g., <c type=open>[</c>. This type is interesting for quotes: the quotes themselves have been normalised to the 'directionless' single or double quote, and it is the type attribute that specifies whether this is an opening or closing quote.

A special component of the corpus is the MULTEXT-East English-Slovene '1984'. In addition to alignment segments, the text is also annotated for sentences. More importantly, the word tokens of this component are also marked up for lemma and morphosyntactic specification, an invaluable annotation for lexical studies, extraction programs and other applications. The Slovene morphosyntactic specifications and the lexicon are further explained in [Erjavec1998]; below we give the first translation unit from the corpus:

<tu lang="en-sl" id="orwl.1">
<seg lang="en">
<s id="Oen.1.1.1.1"><w>It</w> <w>was</w> 
<w>a</w> <w>bright</w> <w>cold</w> <w>day</w> 
<w>in</w> <w>April</w><c>,</c> <w>and</w> 
<w>the</w> <w>clocks</w> <w>were</w> 
<w>striking</w> <w>thirteen</w><c>.</c></s>
</seg>
<seg lang="sl">
<s id="Osl.1.2.2.1">
<w lemma="biti" function="Vcps-sma">Bil</w>
<w lemma="biti" function="Vcip3s--n">je</w>
<w lemma="jasen" function="Afpmsnn">jasen</w>
<c>,</c>                                     
<w lemma="mrzel" function="Afpmsnn">mrzel</w>
<w lemma="aprilski" function="Aopmsn">aprilsk
<w lemma="dan" function="Ncmsn">dan</w>      
<w lemma="in" function="Ccs">in</w>          
<w lemma="ura" function="Ncfpn">ure</w>      
<w lemma="biti" function="Vcip3p--n">so</w>  
<w lemma="biti" function="Vmps-pfa">bile</w> 
<w lemma="trinajst" function="Mcnpnl">trinajs
<c>.</c>
</s>
</seg>

The annotation of the corpus is readable directly in the TEI format, but hardly pleasing to the eye. One of the benefits of SGML encoding is easy down-translation for the required application. We have implemented a conversion to HTML for headers and texts. The IJS-ELAN corpus headers and text sample in this rendition are available on the WWW page of the project.

Availability

The question of reusability has for long been a key issue of digital language resources. It is well known that making such resources is a lengthy process, yet the work is all too often done again and again, because ready-made resources were not available in a usable format or not available to others.

Reusability suffers where resources are stored in proprietary, diverse and poorly documented encodings. This makes them difficult to port between applications and computer platforms. If the format is directly linked to a specific piece of software, e.g., an editor, and the software becomes obsolete, so does the data. This was the reason behind the SGML standard and the TEI guidelines, and they have to a certain extent solved the problem.

The other obstacle is that resources are simply not available. Either because they do not exist, but also because they are not distributed. This is less due to the lack of distribution mechanisms, then it is because the resources are considered proprietary. With corpora this problem is doubly acute: copyright restriction can be exercised on the corpus annotation, i.e., on the corpus as a whole, and, of course, on the component texts.

In line with the idea of the ELAN project, namely to make language resources available to the language community, and the local GNU orientation, we aimed at a very simple distribution mechanism, namely to make the IJS-ELAN deliverables available for downloading via the WWW. To respect copyright on the original texts, we mostly chose providers of public texts, we request the acknowledgement of the resource and its sources, and did not encode typographical information in the corpus, thus making it of non-print quality.

The corpus distribution is packed as a 3.6 MB .tar.gz file, which extracts 22 MB of corpus files into the ijs-elan directory. The corpus proper consists of 3 + 2 $\times$ 15 files.

The first three files are the SGML declaration, ijs-elan.decl, the second the one-file SGML DTD, ijs-elan.dtd, and the third, ijs-elan.sgml, the SGML corpus document with the corpus header and references to the corpus components. Each of component is stored in two files, one with the text header, ID -hdr.tei, and the other with the aligned bi-text itself, ID -txt.tei.

Conclusions

The paper presented the IJS-ELAN 1 million word Slovene/English parallel aligned corpus; further information about the corpus, including headers and samples in HTML, the distribution, and on-line concordancing can be found at http://nl.ijs.si/elan/ .

Slovene language has so far lacked such a resource. At this point it is important for the corpus to be, on the one hand, used, and, on the other, further developed. Further work would involve enriching the annotation of the corpus and, as is always the case with corpora, making it more representative, as regards composition and size.

Currently, the most pressing need and the most interesting task seems to be the lemmatisation and morphosyntactic tagging of the corpus. Such annotation opens opportunity for further computational exploitation, as lemmatised words and simple syntactic patterns can be used in the processing of the corpus. This enables work on shallow syntactic parsing (e.g., bracketing of NPs), term recognition and translation, named entity extraction etc.

Automatic part-of-speech and lemma annotation of the English half should be relatively simple, as there exist publicly available taggers for the language, although it is likely to take up some time. The Slovene part presents significantly greater problems; quality tagging a la '1984' means either hand tagging the corpus or having a substantial hand-annotated corpus, with which to train a stochastic tagger and preferably the environment and labour to correct the results.

While we have trained and tested a few taggers on '1984', with seemingly good results [Dzeroski et al. 1999], the task becomes much harder in dealing with texts that are lexically and syntactically different from the training set. It is the topic of further research, most likely in cooperation with partners in the FIDA project [Krek et al. 1998], on how to best approach this problem.

Acknowledgements

The author would like to thank Jaro Lajovic, Vojko Gorjanc, and Spela Vintar for comments on previous drafts of this paper.

Thanks is due to the Offices of the Government of Slovenia, especially the Office for European Affairs, to the Linux Users Group of Slovenia, LUGOS, and Lek d.d., OTC Division for providing the source texts for the corpus.

The work presented in this paper was in part supported by subcontract to MLIS-ELAN 121 project, Institut für deutsche Sprache, and the grant MZT L2-0461-0106 from the Ministry of Science and Technology of Slovenia.

References

Ahrenberg et al. 1999
Lars Ahrenberg, Magnus Merkel, Daniel Ridings, Anna Sågvall Hein, and Jörg Tiedemann.
1999.
Automatic processing of parallell corpora: A swedish perspective.
http://numerus.ling.uu.se/$\sim$corpora/plug/.

Cristo1996
Philippe Di Cristo.
1996.
Mtseg: The multext multilingual segmenter tools.
MULTEXT Deliverable MSG 1, Version 1.3.1, CNRS, Aix-en-Provence.
http://www.lpl.univ-aix.fr/projects/multext/MtSeg/.

Danielsson and Ridings1997
Pernilla Danielsson and Daniel Ridings.
1997.
Practical presentation of a ``vanilla'' aligner.
In U. Reyle and C. Rohrer, editors, Presented at the TELRI Workshop on Alignment and Exploitation of Texts. Institute Jozef Stefan, Ljubljana.
http://svenska.gu.se/PEDANT/workshop/ workshop.html.

Dimitrova et al. 1998
Ludmila Dimitrova, Tomaz Erjavec, Nancy Ide, Heiki-Jan Kaalep, Vladimír Petkevic, and Dan Tufis.
1998.
Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages.
In COLING-ACL '98, pages 315-319, Montréal, Québec, Canada.

Dzeroski et al. 1999
Saso Dzeroski, Tomaz Erjavec, and Jakub Zavrel.
1999.
Morphosyntactic tagging of slovene: Evaluating pos taggers and tagsets.
Research Report IJS-DP 8018, Jozef Stefan Institute, Ljubljana.

Erjavec and Ide1998
Tomaz Erjavec and Nancy Ide.
1998.
The MULTEXT-East corpus.
In First International Conference on Language Resources and Evaluation, LREC'98, pages 971-974, Granada. ELRA.

Erjavec et al. 1998
Tomaz Erjavec, Ann Lawson, and Laurent Romary.
1998.
East meets West: Producing Multilingual Resources in a European Context.
In First International Conference on Language Resources and Evaluation, LREC'98, pages 233-240, Granada. ELRA.
http://www.ids-mannheim.de/telri/cdrom.html.

Erjavec1998
Tomaz Erjavec.
1998.
The Multext-East Slovene Lexicon.
In Proceedings of the 7th Slovene Electrotechnical Conference, ERK '98, pages 189-192, Portoroz, Slovenia.
http://nl.ijs.si/et/Bib/ERK98/.

Erjavec1999
Tomaz Erjavec.
1999.
A TEI encoding of aligned corpora as translation memories.
In Proceedings of the EACL-99 Workshop on Linguistically Interpreted Corpora (LINC-99), Bergen. ACL.

Ide1998
Nancy Ide.
1998.
Corpus Encoding Standard: SGML guidelines for encoding linguistic corpora.
In First International Conference on Language Resources and Evaluation, LREC'98, pages 463-470, Granada. ELRA.
http://www.cs.vassar.edu/CES/.

Krek et al. 1998
Simon Krek, Marko Stabej, Vojko Gorjanc, Tomaz Erjavec, Miro Romih, and Peter Holozan.
1998.
FIDA: korpus slovenskega jezika.
http://www.fida.net.

Sperberg-McQueen and Burnard1994
C. M. Sperberg-McQueen and Lou Burnard, editors.
1994.
Guidelines for Electronic Text Encoding and Interchange.
Chicago and Oxford.

Tiedemann1998
Jörg Tiedemann.
1998.
Parallel corpora in Linköping, Uppsala and Göteborg (PLUG).
Work package 1., Department of Linguistics, Uppsala University.
http://numerus.ling.uu.se/$\sim$corpora/plug/.

About this document ...

Making the ELAN Slovene/English Corpus

This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 ltma-erjavec.

The translation was initiated by Tomaz Erjavec on 7/2/1999


next up previous
Tomaz Erjavec
7/2/1999