(v slovenščini)

SVEZ IJS

The SVEZ-IJS English-Slovene ACQUIS Corpus

nl.ijs.si SVEZ-IJS TEI header Corpus encoding Download Concordances References Links

The SVEZ-IJS ACQUIS Corpus Version 1 contains cca 10 million words of English-Slovene translation memory segments, which had been produced in the process of translating EU legislation (the ACQUIS) into Slovene.

The translation memory, which is the source of this corpus, was produced by the Translation Section of the Office of the Government of the Republic of Slovenia for European Affairs, who also provide the description and on-line searches on a monitor corpus derived from the translation memory.

At the Dept. of Knowledge Technologies, Jožef Stefan Institute we then processed this translation memory by:

  1. normalising (removing formatting)
  2. tokenising (splitting into words, punctuation marks and sentences)
  3. tagging (marking each word by its context disambiguated MULTEXT-East morphosyntactic description)
  4. lemmatising (assigning the base form to each word)
  5. encoding (annotating it in a standard format, the TEI P4)
  6. mounting (on our Web based concordancer)
  7. distributing (making it available via the Web-based licence agreement)
The annotation, encoding and availability of the corpus are meant to facilitate developments of language technology and studies in bilingual terminology extraction, primarily for the Slovene language.

The corpus processing steps were preformed automatically, so the linguistic annotation contains erorrs. While hearing of a particular annotation mistake in a 10 million word corpus is not much use, receiving reports on patterns of erros can be. So, we are glad to hear of systematic types of errors that you might notice in the corpus, or of general design flaws.

Encoding

The SVEZ-IJS ACQUIS corpus is encoded in XML, in compliance with the Text Encoding Initiative Guidelines P4. The corpus is encoded as one <TEI.2> element, which is then composed of the TEI header and the text. The TEI header contains meta-information, i.e. it describes the corpus. To make it more accessible, it is also available in HTML, where the heading of each element is linked to its description in TEI P4.

The body of the corpus text is simply a series of translation units, in the same order as they appeared in the original TM. Each one contains meta-information about the TU and two segments - one in English, and one in Slovene. The tokens in these segments are further linguistically annotated. Below is an example, together with some explanation:

<ab n="163">Start of translation unit No. 163
 <interpGrp resp="svez" type="seg">Meta information about the TU
  <interp type="status" value="legal" corresp="status.legal"/>Translation status: valid categories are given in the header
  <interp type="acquis" value="3" corresp="acquis.3"/>ACQUIS field code: valid categories are given in the header
  <interp type="celex" value="32000L0042"/>CELEX source document identifier' there can be more than one
 </interpGrp>End of meta information
 <seg lang="en">Start of English language segment
  <w ana="Sp" ctag="IN" lemma="for">For</w>Words are marked with <w>
  <w ana="Ncnp" ctag="NNS" lemma="egg">eggs</w>Value of lemma is the base form of the word
...
  <w ana="Sp" ctag="IN">in</w>lemma is present only when it differs from the word
  <w ana="Vmpp" ctag="VBG">mg/kg</w>Value of ana is the MULTEXT-East morphosyntactic description: valid categories are given in the header (Note also mistake in tagging!)
  <w ana="Afp" ctag="JJ">fat</w>Value of ctag is the Penn Treebank tag - only for English
  <c>.</c>Punctuation symbols are marked with <c>
 </seg>End of English language segment
 <seg lang="sl">Start of Slovene language segment
  <w ana="Spsa" lemma="za">Za</w>lemma is present even if it differs only in capitalisation from the word
  <w ana="Ncnpa" lemma="jajce">jajca</w>Nothing much to add here..
...
 </seg>End of Slovene language segment
</ab>End of translation unit

Concordancing

The corpus is freely available for searching, by using the IJS on-line concordancer and selecting SVEZ-IJS-SL or SVEZ-IJS-EN as the corpus. Note that the corpus can be also searched from the government's EVROKORPUS page.

The IJS concordancer uses the IMS Corpus Query Processor as its back-end. CQP allows for quite complex queries, esp. as the corpus has positional attributes for lemma and msd, in addition to the defauly word. A more detailed explanation of the query language is given in the concordance query help, but for illustration we give below some search examples:

Download

The corpus is also freely available for downloading, but for research purposes only, and on the condition that the authors of the corpus (Office of the Government of the Republic of Slovenia for European Affairs and Department of Knowledge Technologies of the Jozef Stefan Institute) shall be acknowledged in any work making use of the corpus. To get access to the the corpus, please fill out and submit the on-line licence. You will then receive a user-name and password by email, which enables you to download the corpus.

As the forms are processed manually, it might take some time to receive the reply - still, if nothing happens for a couple of days, please let us know of the problem by sending an email to tomaz.erjavec at ijs.si

Bibliography

The SVEZ-IJS corpus is described in

Please use this reference when acknowledging the use of the corpus.

Related Sites

Acknowledgments

The compilation of the corpus was financed by the Ministry of Science, Education and Sport of Slovenia, under the project CRP V2-0894 "Izdelava virov in sistema za simultano prevajanje slovenščina-angleščina", and by EU 6FWP projects SEKT and ALVIS.

Valid HTML 4.01!

Page last updated 2006-05-19, et