TEI Headers for
MULTEXT-East cesDoc multilingual corpus


Headers

MULTEXT-East cesDoc multilingual corpus

TEI header ( corpus)

creator: et
status: update
date: 1996-10-31 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
MULTEXT-East cesDoc multilingual corpus
Responsibility Statement:
Tomaž Erjavec, JSI
TEI encoding
MULTEXT-East corpus workpackage leader
Responsibility Statement:
Jean Veronis, Nancy Ide Laboratoire Parole et Langage Centre National de la Recherche Scientifique Aix-en-Provence, France
MULTEXT-East Project management
Responsibility Statement:
Nancy Ide, Greg Priest-Dorman; Vassar
CES encoding
English data
Responsibility Statement:
Dan Tufiş, RACAI
Romanian data
Responsibility Statement:
Tomaž Erjavec, JSI
Slovene data
Responsibility Statement:
Vladimír Petkevič, ITCL
Czech data
Responsibility Statement:
Tomaž Erjavec, JSI
Slovene data
Responsibility Statement:
Ludmila Dimitrova, BAS
Bulgarian data
Responsibility Statement:
Heiki-Jaan Kaalep, TU
Estonian data
Responsibility Statement:
Csaba Oravecz, HAS
Hungarian data
Responsibility Statement:
Paul Sokolovsky, SIT
Russian data
Responsibility Statement:
Andrius Utka
Lithuanian data
Responsibility Statement:
Cvetana Krstev
Serbian data
Funder:
EU Copernicus Project COP106 "MULTEXT-East"
Funder:
EU Copernicus Concerted Action "TELRI"
Funder:
EU Copernicus Project PL96-1142 "Concede"
Funder:
Individual partners' grants and contracts.
Edition Statement:
Edition: Version 3
Extent: 2,029,874 words
Publications Statement:
Distributor:
MULTEXT-East Web site
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Individual partners, c.f. component headers
Availiability:

Available for research purposes upon receipt of signed agreement.

Source Description:

See individual component source descriptions.

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. <http://nl.ijs.si/ME/>

Editorial declaration:
Normalisation:

Encoded to Corpus Encoding Standard Level 1 (CES1)

See individual component headers.

Quotation:

All quoation marks coverted to q, original rendering in rend attribute. At times marked with other attributes (who, type). q sometimes occurs within s - TEI extended to accommodate. See also individual component headers.

Segmentation:

Marked up to the level of paragraph: p, quote plus marking of sub-paragraph element q. Some marking of particular sub-paragraph elements, e.g. name, date, abbr, mentioned, distinct, foreign. See also individual component headers.

No end-of-line hyphenation present in texts.

The two-letter language codes follow ISO 639.

Class declaration:
abbr = 6706
author = 168
bibl = 168
body = 33
byline = 1240
cell = 75
closer = 2
corr = 1
date = 1721
dateline = 220
div = 3433
figDesc = 115
figure = 115
foreign = 1306
group = 1
head = 3414
hi = 6705
item = 167
l = 785
label = 2
lg = 140
list = 33
measure = 18
mentioned = 1635
name = 44823
note = 56
num = 5261
opener = 290
p = 34684
ptr = 42
q = 30236
quote = 1200
ref = 27
row = 15
s = 78905
sp = 251
speaker = 6
table = 3
term = 6
text = 34
time = 87
title = 975
Taxonomy:
Category orwl:
Nineteen Eighty-Four
Category fict:
Fiction
Category news:
Newspapers
Category spch:
Speech
Category oana:
Nineteen Eighty-Four, Morphosyntactically Annotated

Profile Description

Language use:
sl-rozaj: Resian (dialect of Slovene)
be: Byelorussian
bg: Bulgarian
br: Breton
ca: Catalan
co: Corsican
cs: Czech
cy: Welsh
da: Danish
de: German
el: Greek/Latin
en: English
es: Spanish
et: Estonian
eu: Basque
fi: Finnish
fr: French
ga: Irish
gd: Scots Gaelic
gl: Galician
hr: Croatian
hu: Hungarian
hy: Armenian
ik: Inupiak
is: Icelandic
it: Italian
ji: Yiddish
ka: Georgian/Ibero
kl: Greenlandic
la: Latin/Latin
lt: Lithuanian
lv: Latvian;Lettish
mk: Macedonian
mo: Moldavian
nl: Dutch
no: Norwegian
oc: Occitan
pl: Polish
pt: Portuguese
rm: Rhaeto-Romance
ro: Romanian
ru: Russian
sh: Serbo-Croatian
sk: Slovak
sl: Slovene
sq: Albanian
sr: Serbian
sv: Swedish
tr: Turkish
tt: Tatar
uk: Ukrainian

Revision Description



TEI header (English text)

creator: NMI
status: update
date: 1995-05-10 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Nineteen Eighty-Four, English
Responsibility Statement:
Nancy Ide
Modified ECI tags of first chapter to conform to CES Added or modified some sub-paragraph level tagging.
Responsibility Statement:
Tomaz Erjavec
Modified full ECI Orwell to conform to CES V3.15
Responsibility Statement:
Greg Priest-Dorman
Modified Tomaz Erjavec's full Orwell to conform to CES V3.21 Checked and modified markup for correctness down to the paragraph level
Responsibility Statement:
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSeg and english resources.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 104302 words WordCount represents the number of words in this text exclusive of tags and header information. ByteCount reflects the approximate size of the file containing the doctype and cesDoc element including all text, tags and header information.
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Vassar College Computer Science Department
Address:
124 Raymond Avenue, Poughkeepsie, New York, USA 12604
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Nineteen Eighty-Four, English
Responsibility Statement:
Nancy Ide
Modified ECI tags of first chapter to conform to CES Added or modified some sub-paragraph level tagging.
Responsibility Statement:
Tomaz Erjavec
Modified full ECI Orwell to conform to CES V3.15
Responsibility Statement:
Greg Priest-Dorman
Modified Tomaz Erjavec's full Orwell to conform to CES V3.21 Checked and modified markup for correctness down to the paragraph level
Responsibility Statement:
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSeg and english resources.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Vassar College Computer Science Department
Address:
124 Raymond Avenue, Poughkeepsie, New York, USA 12604
Availiability:

Available for research purposes upon receipt of signed agreement

October 1st, 1997
Source Description:
Title Statement:
Title:
The European Corpus Initiative Multilingual Corpus 1: 1984 by George Orwell (English)
Responsibility Statement:
Association for Computational Linguistics
Converted from OTA's DTD to ECI DTD
Publications Statement:
Distributor:
ACL
Address:
ACL
Availiability:

Available for research purposes upon receipt of signed agreement

1994
Source Description:
Title Statement:
Title:
Orwell's 1984: electronic edition
Responsibility Statement:
Oxford Text Archive
The four versions of Orwell's 1984 in the OTA were all prepared by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental And African Studies at London University. The texts here have not been encoded or proofread in any way since they were produced (other than the English text, which was converted to an SGML like encoding by John Price-Wilkin, and subsequently automatically converted to conform to the OTA's dtd by myself and Alan Morrison. The other languages were converted to TEI conformant SGML by the ECI project 1993.) ——LB, Nov 1992
Edition Statement:

Public Domain TEI edition prepared at the Oxford Text Archive

Publications Statement:
Distributor:
Oxford Text Archive
Address:
Oxford University Computing Service 13 Banbury Road Oxford OX2 6NN UK archive@ox.ac.uk
Availiability:

Freely available for non-commercial use provided that this header is included in its entirety with any copy distributed

19 Nov 1992
Source Description:
Title:
Nineteen Eighty Four
1949; reprinted 1961
Publisher:
New American Library
Place of publication:
New York

Encoding Description

Project description:

This English version of Orwell's 1984 is encoded conformant to level 1 specifications of the Corpus Encoding Standard for the MULTEXT-EAST project. The English is to serve as the base for the parallel corpus, which will include aligned versions of the text in Romanian, Bulgarian, Estonian, Slovenian, Czech, and Hungarian.

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 2.0 CES LEVEL: 1

Quotation:

Rendition attribute values on Q, QUOTE, MENTIONED and TERM tags are adapted from ISOpub and ISOnum standard entity set names when used. If the rend attribute is ommited in the markup the rendition on the first set of Q, QUOTE, MENTIONED or TERM tags is "PRE lsquo POST rsquo" and the rendition on Q, MENTIONED or TERM tag nested in a Q or QUOTE tag is "PRE ldquo POST rdquo"

Segmentation:

Marked up to the level of paragraph: P, QUOTE plus marking of sub-paragraph element Q. Some marking of particular sub-paragraph elements: NAME, DATE, ABBR, MENTIONED, DISTINCT, FOREIGN.

No end-of-line hyphenation present in the ECI original.

Class declaration:
abbr = 38
Abbreviations are marked only within marked names. Other abbreviations are not marked.
body = 1
date = 40
All dates which contain one or more digits (the characters 0-9) are marked, including dates specifying day/month/year and dates consisting only of a year. No attempt was made to identify or mark dates in other forms.
distinct = 1
div = 28
foreign = 39
The Newspeak words "thoughtcrime" and "doublethink" are consistently marked as FOREIGN, when they do not appear in some other tag where the lang attribute provides the language information. Latin and French words are also marked.
head = 1
hi = 103
The highlighting tag is used to mark words and phrases which were typographically distinguished in the printed version of the text, and for which no other more precise tag is applicable. In most of these cases, such highlighting signifies emphasis.
item = 4
l = 32
list = 1
mentioned = 261
Rendition information has not been systematically retained. When no rendition information is provided, rendering is generally in italics in the 1949 Harcourt, Brace and World Edition of Ninteeen Eighty-Four. The original electronic version contained rendition information inconsistent with the 1949 Harcourt edition.
name = 1744
Frequently occurring names of people, places, organizations, products, languages, and events, are marked. If a name is marked, every occurrence of that name is marked. Person names in the genitive are not marked to include the English genitive suffix "'s". For other names, only those occurrences which function as stand-alone proper nouns are marked; adjectival uses (e.g., "Newspeak words") are not marked.
note = 2
num = 52
Anything containing one or more digits (the characters 0-9) that is not part of a date, and all roman numerals, are marked as a number. In cases where a ratio is expressed (per cent, per thousand), the entire phrase (e.g., "10 per cent") is marked as a number.
p = 1286
lg = 10
ptr = 2
q = 2209
The Q tag is used to mark quoted dialogue. The attribute "type=indirect" is used when attributed speech is marked typographically in the printed text (e.g., "I know you," he seemed to say). The attribute "type=written" is used in those cases where Winston's writing in his diary is represented as quoted thought (e.g., "If there is hope," he wrote, "it lies in the Proles."). If no "rend" attribute is provided on the Q tag, the value is assumed to be "PRE ldquo" on the first Q in a series of Qs within the same P unbroken by #PCDATA and "POST rdquo" on the last Q in the series. The attribute "broken=yes" is used when no sentence terminating punctuation (either inside the Q itself or in the intervening text between two Qs) appears between two dialogue fragments by the same speaker.
quote = 35
QUOTE marks quotations from outside sources, including extensive quotations from Winston's diary and Goldstein's treatise.
s = 6701
S tags have been inserted automatically and then cleaned up by hand in the locations (character offsets) provided by MTSeg version 1.3.1 using the english resource files.
text = 1
title = 46
Rendition information has not been systematically retained. The original electronic version contained rendition information inconsistent with the 1949 Harcourt, Brace and World edition.

Profile Description

Language use:
ns: Newspeak
ns-jg: Newspeak official jargon
en-ck: British Cockney English

Revision Description



TEI header (Romanian text)

creator: Ştefan Bruda
status: update
date: 1995-12-10 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Nineteen Eighty-Four, Romanian
Responsibility Statement:
Dan Tufiş Center for Artificial Intelligence NLP division Romanian Academy
Overal editorship.
Ştefan Bruda Center for Artificial Intelligence NLP division Romanian Academy
Error correction and CES1 conformance.
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Romanian resources.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 118093 words wordCount computed considering clitics as distinct words and several words making a compound just one word. This count was computed on the segmented document with word mark-up. If the counting ignores clitics and compounds the wordCount would be 98074; the sequence that provided this count is the following: sed -e '1,/<\/ces[Hh]eader>/d' < ces-file | sed -e 's/<[^<].*>//g' | sed -e 's/<.*$//g' |sed -e 's/^.*>//g' | wc -w bytecount - disk space occupied by the full sgml text
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Romanian Academy, Centre for Artificial Intelligence
Address:
13, 13 Septembrie Str., Bucharest, Romania
Address:
eAddress: tufis@valhalla.racai.ro
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Nineteen Eighty-Four, Romanian
Responsibility Statement:
Dan Tufiş Center for Artificial Intelligence NLP division Romanian Academy
Overal editorship.
Ştefan Bruda Center for Artificial Intelligence NLP division Romanian Academy
Error correction and CES1 conformance.
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Romanian resources.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Romanian Academy, Centre for Artificial Intelligence
Address:
13, 13 Septembrie Str., Bucharest, Romania
Address:
eAddress: tufis@valhalla.racai.ro
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title:
O mie nouă sute optzeci şi patru
George Orwell 1991
Publisher:
Editura Univers
Place of publication:
Bucharest

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

Segmentation:

Marked up to the level of paragraph: P, QUOTE, LIST, POEM plus marking of particular sub-paragraph elements: HI, Q, FOREIGN, NAME

Class declaration:
name = 2159
title = 1
div = 28
text = 1
foreign = 429
l = 26
body = 1
quote = 23
item = 4
p = 1335
num = 3
lg = 7
hi = 413
q = 2137
Q tags with a attribute of "type=MI" have been inserted automatically after S insertion.
head = 28
s = 6487
S tags have been inserted automatically and then cleaned up by hand in the locations (character offsets) provided by MTSeg version 1.3.1 using the Romanian resource files.
note = 3
abbr = 3
list = 1
date = 7

Profile Description

1996-05-06
Language use:
ns-ro: Nouvorbă

Revision Description



TEI header (Slovene text)

creator: ET
status: update
date: 1996-04-18 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Nineteen Eighty-Four, Slovene
Responsibility Statement:
Tomaž Erjavec
Error correction and CES1 conformance.
Olga Vuković
Up-translation of ECI version to CES1 V2.0 conformance, using the printed edition as the reference proof-reading the text
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Slovene resources.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 91619 words WordCount represents the number of words in this text exclusive of tags and header information. ByteCount reflects the size of the file containing the doctype and cesDoc element including all text, tags and header information.
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute,
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Nineteen Eighty-Four, Slovene
Responsibility Statement:
Tomaž Erjavec
Error correction and CES1 conformance.
Olga Vuković
Up-translation of ECI version to CES1 V2.0 conformance, using the printed edition as the reference proof-reading the text
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Slovene resources.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute,
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title Statement:
Title:
The European Corpus Initiative Multilingual Corpus 1: 1984 by George Orwell (Slovene)
Responsibility Statement:
Association for Computational Linguistics
Converted from OTA's DTD to ECI DTD
Publications Statement:
Distributor:
ACL
Address:
ACL
Availiability:

Available for research purposes upon receipt of signed agreement

1994
Source Description:
Title Statement:
Title:
Orwell's 1984: electronic edition
Responsibility Statement:
Oxford Text Archive
The four versions of Orwell's 1984 in the OTA were all prepared by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental And African Studies at London University. The texts here have not been encoded or proofread in any way since they were produced (other than the English text, which was converted to an SGML like encoding by John Price-Wilkin, and subsequently automatically converted to conform to the OTA's dtd by myself and Alan Morrison. The other languages were converted to TEI conformant SGML by the ECI project 1993.) --LB, Nov 1992
Edition Statement:

Public Domain TEI edition prepared at the Oxford Text Archive

Publications Statement:
Distributor:
Oxford Text Archive
Address:
Oxford University Computing Service 13 Banbury Road Oxford OX2 6NN UK archive@ox.ac.uk
Availiability:

Freely available for non-commercial use provided that this header is included in its entirety with any copy distributed

19 Nov 1992
Source Description:
Title:
1984
George Orwell Translator: Alenka Puhar 1983
Publisher:
Knjižnica Kondor
Publisher:
Mladinska knjiga
Place of publication:
Ljubljana

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.3 CES LEVEL: 1

Typographical mistakes corrected

Quotation:

Rendition attribute values on HI, Q and QUOTE tags are adapted from ISOpub and ISOnum standard entity set names The 'default' rendition of Q (PRE mdash) has not been included in Q

All end-of-line hyphenation removed.

Segmentation:

Marked up to the level of paragraph: P, QUOTE, LIST, POEM plus marking of particular sub-paragraph elements: NAME, Q Page breaks left in the document as comments

No end-of-line hyphenation present in the ECI original.

Class declaration:
abbr = 26
body = 1
date = 33
div = 28
foreign = 7
head = 29
hi = 242
item = 4
l = 34
list = 1
name = 1327
note = 1
p = 1288
lg = 10
ptr = 1
q = 2260
Q tags with a attribute of "type=MI" have been inserted automatically after S insertion.
quote = 35
s = 6689
S tags have been inserted automatically and then cleaned up by hand in the locations (character offsets) provided by MTSeg version 1.3.1 using the Slovene resource files.
text = 1
title = 10

Profile Description

1996-04-18
Language use:
ns-sl: Newspeak Slovene

Revision Description



TEI header (Czech text)

creator: VP
status: update
date: 1996-04-20 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Nineteen Eighty-Four, Czech
Responsibility Statement:
Vladimír Petkevič
Checked and modified markup for correctness down to the subparagraph level
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Czech resources.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 80317 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic (ÚTKL FFUK)
Address:
Celetná 13, Prague, Czech Republic
Address:
eAddress: vladimir.petkevic@ff.cuni.cz
Address:
eAddress: ucnk.ff.cuni.cz directory: pub/corpora/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Nineteen Eighty-Four, Czech
Responsibility Statement:
Vladimír Petkevič
Checked and modified markup for correctness down to the subparagraph level
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Czech resources.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic (ÚTKL FFUK)
Address:
Celetná 13, Prague, Czech Republic
Address:
eAddress: vladimir.petkevic@ff.cuni.cz
Address:
eAddress: ucnk.ff.cuni.cz directory: pub/corpora/ME
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title Statement:
Title:
Electronic form of 1984 by George Orwell in Czech, obtained via OCR
Responsibility Statement:
Vladimír Petkevič Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic (ÚTKL FFUK)
OCR'ed the novel
Publications Statement:
Distributor:
Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic (ÚTKL FFUK)
Address:
Celetná 13, Praha 1 Czech Republic
Availiability:

Available for research purposes only

May 1, 1996
Source Description:
Title:
1984
George Orwell 1991
Publisher:
Naše vojsko
Place of publication:
Prague, Czech Republic

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

The OCR'ed text of the novel has been automatically spell-checked.

The text contains no hyphens

Segmentation:

Two levels of DIV are used: the first one denotes the PARTS, the second one denotes the CHAPTERS within PARTS Marked up down the subparagraph level according to the CES canonical markup of the English version

Class declaration:
abbr = 23
body = 1
date = 39
div = 28
foreign = 91
head = 1
hi = 75
item = 4
l = 33
list = 1
mentioned = 244
name = 2181
note = 2
num = 48
p = 1285
lg = 11
ptr = 1
q = 2208
Q tags with a attribute of "type=MI" have been inserted automatically after S insertion.
quote = 36
s = 6714
S tags have been inserted automatically and then cleaned up by hand in the locations (character offsets) provided by MTSeg version 1.3.1 using the Czech resource files.
term = 2
text = 1
title = 45

Profile Description

1996-04-18
Language use:
cs-cl: Czech colloquial
ns-cs: Newspeak Czech
ns-jg-cs: Newspeak official jargon Czech

Revision Description



TEI header (Bulgarian text)

creator: LS
status: update
date: 1996-06-05 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Nineteen Eighty-Four, Bulgarian
Responsibility Statement:
Lydia Sinapova, Ludmila Dimitrova, Kiril Simov
Typing-in '1984', inserting paragraph and some sub-paragraph level tagging.
Responsibility Statement:
Lydia Sinapova
Modified full Orwell markup down to sub-paragraph level to conform to CES V4.0, using the English version as a base
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Bulgarian resources.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 87235 words WordCount represents the number of words in this text exclusive of tags and header information. Microsoft Word 6.0 was used to count words. ByteCount reflects the approximate size of the file containing the doctype and cesDoc element including all text, tags and header information.
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Institue of Mathematics, Bulgarian Academy of Sciences
Address:
Acad G. Bonchev st. bl.8 1113 Sofia, Bulgaria
Address:
eAddress: mult@ling.math.acad.bg
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Nineteen Eighty-Four, Bulgarian
Responsibility Statement:
Lydia Sinapova, Ludmila Dimitrova, Kiril Simov
Typing-in '1984', inserting paragraph and some sub-paragraph level tagging.
Responsibility Statement:
Lydia Sinapova
Modified full Orwell markup down to sub-paragraph level to conform to CES V4.0, using the English version as a base
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Bulgarian resources.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Institue of Mathematics, Bulgarian Academy of Sciences
Address:
Acad G. Bonchev st. bl.8 1113 Sofia, Bulgaria
Address:
eAddress: mult@ling.math.acad.bg
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title:
Nineteen Eighty Four (Bulgarian)
George Orwell
Responsibility Statement:
Translator:
Lydia Bozhilova
1989
Publisher:
Profizdat
Place of publication:
Sofia

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

Quotation:

No quotation marks are preserved in text. Rendition attribute values on Q and QUOTE tags are adapted from ISOpub and ISOnum standard entity set names Two rendition short-cuts are used, 'rend=mdash' stands for 'rend="PRE mdash POST mdash"' 'rend=dblq' stands for 'rend="PRE ldquo POST rdquo"' 'rend="PRE mdash" (or "PRE ldquo") is used when the quoted dialogue ends up with the paragraph (there is no other typographical distinction). 'rend="POST mdash" (or "POST rdquo") is used when there is no typographical distinction (except ordinary punctuation) for the beginning of the quoted dialogue. No default rendition is used.

Segmentation:

Marked up to the level of paragraph: P, QUOTE, POEM, NOTE, plus marking of sub-paragraph element Q. Some marking of particular sub-paragraph elements: NAME, DATE, TIME, MENTIONED, FOREIGN, ABBR.

No end-of-line hyphenation present.

Class declaration:
abbr = 28
All abbreviations are marked.
body = 1
date = 40
All dates which contain one or more digits (the characters 0-9) are marked, including dates specifying day/month/year and dates consisting only of a year. The attribute 'iso8601' is used consistently except in two cases: when the date specifies year and consists only of digits, and within quoted Newspeak sentences. No attempt was made to identify or mark dates in other forms.
div = 28
foreign = 29
Only those Newspeak words which are typographically distinguished in the printed version of the text are marked as FOREIGN if they do not appear in some other tag where the lang attribute provides the language information. Latin words are also marked.
head = 1
hi = 103
The highlighting tag is used to mark words and phrases which were typographically distinguished, and for which no other more precise tag is applicable. In most of these cases, such highlighting signifies emphasis.
item = 4
l = 26
list = 1
mentioned = 256
name = 1704
All names of people, places, organizations, products, and events, are marked. Person names in the genitive are not marked. All names of countries and towns are marked with type=place. Names of rivers and oceans are not marked. While in the English version the word INGSOC is marked with NAME LANG=NS, in the Bulgarian version it is marked only if typographically distinguished from the rest of the text. In the English version the word NEWSPEAK is marked with NAME TYPE=LANGUAGE, while in the Bulgarian version it is marked only if typographically distinguished from the rest of the text.
note = 8
num = 34
Anything containing one or more digits (the characters 0-9) that is not part of a date, and all roman numerals, are marked as a number. In cases where a ratio is expressed (per cent, per thousand), the entire phrase (e.g., "10 per cent") is marked as a number.
p = 1321
lg = 7
ptr = 8
q = 2203
The Q tag is used to mark quoted dialogue. The attribute "broken=yes" is used when no sentence terminating punctuation (either inside the Q itself or in the intervening text between two Qs) appears between two dialogue fragments by the same speaker. Q tags with a attribute of "type=MI" have been inserted automatically after S insertion.
quote = 34
QUOTE marks quotations from outside sources, including extensive quotations from Winston's diary and Goldstein's treatise.
s = 6649
S tags have been inserted automatically and then cleaned up by hand in the locations (character offsets) provided by MTSeg version 1.3.1 using the Slovene resource files.
text = 1
title = 41

Profile Description

Language use:
bg-cl: Bulgarian colloquial
ns-bg: Newspeak Bulgarian
ns-jg-bg: Newspeak official jargon Bulgarian

Revision Description



TEI header (Estonian text)

creator: HJK
status: update
date: 1995-10-18 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Nineteen Eighty-Four, Estonian
Responsibility Statement:
Viire Villandi
entered and validated the text
Heili Orav
added CES tags
Heiki-Jaan Kaalep
supervised the work
Heiki-Jaan Kaalep
modified the tags and header for version 4
Leho Paldre
modified the tags and header for version 4.1
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Estonian resources.
Leho Paldre
Manually checked automatic tagging of sentences. Corrected 48 typos.
Heiki-Jaan Kaalep
Corrected the final bytecount and wordcount.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 79334 words WordCount represents the number of words in this text exclusive of tags and header information. ByteCount reflects the approximate size of the file containing the doctype and cesDoc element including all text, tags and header information.
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
TÜ arvutuslingvistika uurimisgrupp
Address:
Tiigi 78-232, Tartu, Estonia
Address:
eAddress: hkaalep@psych.ut.ee
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Nineteen Eighty-Four, Estonian
Responsibility Statement:
Viire Villandi
entered and validated the text
Heili Orav
added CES tags
Heiki-Jaan Kaalep
supervised the work
Heiki-Jaan Kaalep
modified the tags and header for version 4
Leho Paldre
modified the tags and header for version 4.1
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Estonian resources.
Leho Paldre
Manually checked automatic tagging of sentences. Corrected 48 typos.
Heiki-Jaan Kaalep
Corrected the final bytecount and wordcount.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
TÜ arvutuslingvistika uurimisgrupp
Address:
Tiigi 78-232, Tartu, Estonia
Address:
eAddress: hkaalep@psych.ut.ee
Availiability:

Freely available

October 1, 1997
Source Description:
Title:
1984
George Orwell
Responsibility Statement:
Elias Treeman
Translator from English
Edition: Loomingu Raamatukogu 1990 nr. 48-51
Publisher:
Perioodika
Place of publication:
Tallinn
1990

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.1 CES LEVEL: 1

Segmentation:

Marked up to the level of paragraph: P, QUOTE plus marking of sub-paragraph element Q, incl. broken Qs. Some marking of particular sub-paragraph elements. BODY, DIV, HEAD, ITEM, L, LIST, NOTE, P, POEM, PTR, QUOTE, TEXT are used so that to be in harmony with the English electronic version of 1984 for MULTEXT-EAST v. 4; the differences are due only to the differences between the English electronic and Estonian printed version. ABBR, DATE, FOREIGN, HI, MENTIONED, NAME, NUM, Q, TITLE are used sloppily.

No end-of-line hyphenation

Class declaration:
abbr = 73
body = 1
date = 18
div = 28
foreign = 93
head = 5
hi = 183
item = 4
l = 32
list = 1
mentioned = 44
name = 2457
note = 2
num = 14
p = 1289
lg = 10
ptr = 2
q = 2192
Q tags with a attribute of "type=MI" have been inserted automatically.
quote = 35
s = 6658
S tags have been inserted automatically and then cleaned up by hand in the locations (character offsets) provided by MTSeg version 1.3.1 using the Estonian resource files.
text = 1
title = 29

Profile Description

Language use:
ns-et: Newspeak Estonian

Revision Description



TEI header (Hungarian text)

creator: CO
status: update
date: 1996-04-22 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Nineteen Eighty-Four, Hungarian
Responsibility Statement:
Csaba Oravecz
CES1 conformant tagging
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Hungarian resources.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 81167 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Research Institute for Linguistics, Hungarian Academy of Sciences
Address:
Budapest, Színház u. 5-9.
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Nineteen Eighty-Four, Hungarian
Responsibility Statement:
Csaba Oravecz
CES1 conformant tagging
Greg Priest-Dorman
Added tagging of sentences in paragraphs using MtSgml and Hungarian resources.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Research Institute for Linguistics, Hungarian Academy of Sciences
Address:
Budapest, Színház u. 5-9.
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title:
1984
George Orwell 1989
Publisher:
Európa Könyvkiadó
Place of publication:
Budapest

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

The following errors in the Hungarian edition have been corrected: p. 18. l. 9. első számú elsőszámú p. 52. l. 14. nemlétező nem létező p. 85. l. 9. reménytvesztve reményt vesztve p. 128. l. 9. lovasszobor lovas szobor p. 148. l. 17. éleselméjűnek éles elméjűnek p. 212. l. 21. kell hogy kell, hogy p. 233. l. 26. lelkitusa lelki tusa p. 295. l. 34. kell hogy kell, hogy p. 127. l. 32. - p. 128. l. 3. A lány sietve befejezte az ebédet, és eltávozott. Winston még ott maradt, rágyújtott egy cigarettára. Többet nem beszéltek, s amennyire két, ugyanannál az asztalnál egymással szemközt ülő ember egyáltalán megteheti, nem is néztek egymásra. Többet nem beszéltek, s amennyire két, ugyanannál az asztalnál egymással szemközt ülő ember egyáltalán megteheti, nem is néztek egymásra. A lány sietve befejezte az ebédet, és eltávozott. Winston még ott maradt, rágyújtott egy cigarettára.

Class declaration:
abbr = 38
body = 1
date = 39
All dates which contain one or more digits are marked, including dates specifying day/month/year and dates consisting only of a year.
div = 28
foreign = 43
The Newspeak words "gondolatbűn" (thoughtcrime), "gondolatbűnöző" (thoughtcriminal) and "duplagondol" (doublethink) are consistently marked as FOREIGN, when they do not appear in some other tag where the lang attribute provides the language information. Latin and French words are also marked.
head = 5
hi = 71
The highlighting tag is used to mark words and phrases which were typographically distinguished in the printed version of the text, and for which no other more precise tag is applicable.
item = 4
l = 32
list = 1
mentioned = 218
name = 1843
Frequently occurring names of people, places, organizations, products, languages, and events, are marked.
note = 2
num = 10
p = 1292
lg = 10
ptr = 2
q = 2197
The Q tag is used to mark quoted dialogue. The attribute "type=indirect" is used when attributed speech is marked typographically in the printed text. The attribute "type=written" is used in those cases where Winston's writing in his diary is represented as quoted thought. If no "rend" attribute is provided on the Q tag, the value is assumed to be "PRE mdash POST mdash". Except for the second section of the broken Q tag (see below) in which case no rendition on the tag indicates lack of typographical marking in the text, while if there is typographical marking it is explicitly given in the "rend" attribute rend on the tag. The attribute "broken=yes" is used when no sentence terminating punctuation (either inside the Q itself or in the intervening text between two Qs) appears between two dialogue fragments by the same speaker. Q tags with a attribute of "type=MI" have been inserted automatically after S insertion.
quote = 35
QUOTE marks quotations from outside sources, including extensive quotations from Winston's diary and Goldstein's treatise.
s = 6732
S tags have been inserted automatically and then cleaned up by hand in the locations (character offsets) provided by MTSeg version 1.3.1 using the Hungarian resource files.
text = 1
title = 40

Profile Description

Language use:
ns-hu: Newspeak Hungarian
ns-jg-hu: Newspeak official jargon Hungarian

Revision Description



TEI header (Russian text)

creator: SIT
status: update
date: 1997-09-27 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Nineteen Eighty-Four, Russian
Responsibility Statement:
Paul Sokolovsky, Sergey Sryvkin
Proofreading, hyphenation deletion, formatting, inserting paragraph and sub-paragraph level tagging.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 76469 words WordCount represents the number of words in this text exclusive of tags and header information, counted before markup process. ByteCount reflects the approximate size of the file containing the doctype and cesDoc element including all text, tags and header information.
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Severodonetsk Institute of Technology, East-Ukraine State University
Address:
Sovetsky st., bl. 3a, Severodonetsk, Lugansk reg., Ukraine
Address:
eAddress: Paul.Sokolovsky@technologist.com
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Nineteen Eighty-Four, Russian
Responsibility Statement:
Paul Sokolovsky, Sergey Sryvkin
Proofreading, hyphenation deletion, formatting, inserting paragraph and sub-paragraph level tagging.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Severodonetsk Institute of Technology, East-Ukraine State University
Address:
Sovetsky st., bl. 3a, Severodonetsk, Lugansk reg., Ukraine
Address:
eAddress: Paul.Sokolovsky@technologist.com
Availiability:

Available for research purposes upon receipt of signed agreement

January 1st, 1998
Source Description:
Title Statement:
Title:
Orwell's 1984, Russian: plaintext electronic edition
Responsibility Statement:
Maxim Moshkov's Library
Made the electronic edition available on the Internet
Publications Statement:
Distributor:
Maxim Moshkov's Library
Address:
http://www.moshkow.orc.ru/koi http://www.alkar.net/moshkow/html-KOI
Unknown
Source Description:
Title:
Nineteen Eighty Four (Russian)
George Orwell Translator: V. Golyshev Unknown
Publisher:
Unknown
Place of publication:
Unknown

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106 This text is volunteer contribution to project.

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

Quotation:

No quotation marks are preserved in text. Due to stipulations of russian written language, only doublequotes used in rendition ("PRE ldquo POST rdquo")

Segmentation:

Marked up to the paragraph level: P, QUOTE, NOTE, plus marking of sub-paragraph element Q. Some marking of particular sub-paragraph elements: NAME, DATE, TIME, MENTIONED, FOREIGN, ABBR.

No hyphenation marks are present in text.

Class declaration:
abbr = 11
All abbreviations are marked.
body = 1
date = 36
All dates which contain one or more digits (the characters 0-9) are marked, including dates specifying day/month/year and dates consisting only of a year. The attribute 'iso8601' is used consistently. If there were two dates in one phrase, one consisting of digits and other lexical, latter marked up too, e.g. "in 1944 and forty-five" No attempt was made to identify or mark dates in other forms.
div = 28
foreign = 348
In some mteO-??.ces it was pointed that only hilited in typographic text words were marked. We, rather, markup newspeak words, if they by some reasons, mostly morphological, cannot be correct for russian. The one typical example is translation of 'telescreen'. Newspeak idea was wonderfully, as we think, carried into it. Instead of just literally translating "telescreen" into "теле-экран", the translator contracted (unusual phenomena for russian) 'е' & 'э' resulting in "телекран". It sounds even more awful, knowing that "кран" stem has no semantic relation to original "экран". So, this word can't be in plain russian - it's 100% newspeak-russian!
head = 1
hi = 10
This applies to rend attribute of other tags too. As our primary source for markup was an electronic plaintext version, no character-level typographical rendition except capitalization was present in original. Para-level included only line-breaking & centering. Though we look in printed book in process, we decided not to put character-level rendition, because book is different version, and because CES guides that all rendition should be resolved in descreptive tags & no rendition attrs which mere purpose is to recreate original view should be left. So, only CA & CE values of rend are used. As in others mteO-??, capitalized text was decapitilized.
item = 4
l = 39
list = 1
mentioned = 216
CES and TEI give little vague criteria on what to mark mentioned. We were trying to inherit occurances from Oen, though somewhere it may be inconsistent.
name = 2105
All names of people, places, organizations, products, and events, are marked. Person names in the genitive are not marked. All names of countries and towns are marked with type=place. Names of rivers and oceans are marked too with type=place. Some other proper-nouns(groups) denoted places were marked, e.g. Golden Country & Chestnut Tree Café
note = 2
Strange, but electronic version had no notes, though in printed reference they exist. We have reinserted them.
num = 12
Numbers are marked only if corresponding one in english version is marked too So, there only some occurences are marked.
lg = 10
ptr = 2
q = 2160
The Q tag is used to mark slogans and quoted dialogue. The attribute "broken=yes" is currently not inserted when no sentence terminating punctuation (either inside the Q itself or in the intervening text between two Qs) appears between two dialogue fragments by the same speaker.
quote = 30
QUOTE marks quotations from outside sources, including extensive quotations from Winston's diary and Goldstein's treatise.
ref = 1
s = 0
S tags have not yet been insterted.
text = 1
title = 44

Profile Description

Language use:
ns-ru: Newspeak Russian
ns-jg-ru: Newspeak official jargon Russian
ru-cl: Russian colloquial

Revision Description



TEI header (Lithuanian text)

creator: ET
status: update
date: 1997-11-03 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Nineteen Eighty-Four, Lithuanian
Responsibility Statement:
Andrius Utka
Overall editorship
Responsibility Statement:
Aurelijus Gruodis
Scanning and automatic transliteration
Responsibility Statement:
Tomaž Erjavec
CES conformance
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 71252 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
TELRI
Address:
IDS, Mannheim
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Nineteen Eighty-Four, Lithuanian
Responsibility Statement:
Andrius Utka
Overall editorship
Responsibility Statement:
Aurelijus Gruodis
Scanning and automatic transliteration
Responsibility Statement:
Tomaž Erjavec
CES conformance
Edition Statement:

TELRI Final Release

Publications Statement:
Distributor:
TELRI
Address:
IDS, Mannheim
Availiability:

Available on receipt of signed agreement

January 1st, 1998
Source Description:
Title:
Džordžas Orvelas 1984-ieji
George Orwell From English translated by Virgilijus Čepliejus; 1991
Publisher:
Vyturys
Place of publication:
Vilnius

Encoding Description

Project description:

TELRI

Class declaration:
body = 1
div = 28
head = 28
hi = 136
The highlighting tag is used to mark words (including Newspeak words) and phrases, which were typographically distinguished in the printed version of the text.
item = 4
l = 36
list = 1
note = 1
p = 1331
lg = 12
ptr = 1
q = 40
Only used for mentioned words and 'short' passages, to avoid clasj with S hierarchy. One Q marked with TYPE=MI, to deal with the two embedded sentences under Olt.1.5.24.1
quote = 38
s = 6675
text = 1
title = 7

Profile Description

Language use:
ns-lt: Newspeak Lithuanian

Revision Description



TEI header (Serbian text)

creator: CK
status: update
date: 1997-12-03 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Nineteen Eighty-Four, Serbian
Responsibility Statement:
Cvetana Krstev
Error correction, CES1 conformance.
Dusko Vitas
Consulting.
Tomaz Erjavec
CES1 conformance, encoding harmonisation with the MULTEXT-East '1984' corpus.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 89749 words WordCount represents the number of words in this text exclusive of tags and header information. ByteCount reflects the size of the file containing the doctype and cesDoc element including all text, tags and header information.
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Computer Science Departement Faculty of Mathematics
Address:
Studentski trg 16, 11000 Belgrade, Yugoslavia (Serbia)
Address:
eAddress: cvetana@matf.bg.ac.yu
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Nineteen Eighty-Four, Serbian
Responsibility Statement:
Cvetana Krstev
Error correction, CES1 conformance.
Dusko Vitas
Consulting.
Tomaz Erjavec
CES1 conformance, encoding harmonisation with the MULTEXT-East '1984' corpus.
Edition Statement:

TELRI Final Release

Publications Statement:
Distributor:
Computer Science Departement Faculty of Mathematics
Address:
Studentski trg 16, 11000 Belgrade, Yugoslavia (Serbia)
Address:
eAddress: cvetana@matf.bg.ac.yu
Availiability:

Available for research purposes upon receipt of signed agreement

January 1st, 1998
Source Description:
Title Statement:
Title:
Orwell's 1984: electronic edition
Responsibility Statement:
Oxford Text Archive
The four versions of Orwell's 1984 in the OTA were all prepared by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental And African Studies at London University. The texts here have not been encoded or proofread in any way since they were produced (other than the English text, which was converted to an SGML like encoding by John Price-Wilkin, and subsequently automatically converted to conform to the OTA's dtd by myself and Alan Morrison. The other languages were converted to TEI conformant SGML by the ECI project 1993. --LB, Nov 1992
Edition Statement:

Public Domain TEI edition prepared at the Oxford Text Archive

Publications Statement:
Distributor:
Oxford Text Archive
Address:
Oxford University Computing Service 13 Banbury Road Oxford OX2 6NN UK archive@ox.ac.uk
Availiability:

Freely available for non-commercial use provided that this header is included in its entirety with any copy distributed

19 Nov 1992
Source Description:
Title:
1984
George Orwell Translator: Vlada Stojiljković Edition: Second edition 1984
Publisher:
Beogradski izdavačko-grafički zavod
Place of publication:
Beograd

Encoding Description

Project description:

TELRI

Editorial declaration:
Normalisation:

Corpus Encoding Specification, Version 4.3 CES LEVEL: 1

Typographical mistakes corrected while preparing the electronic edition, though not systematically.

Quotation:

Rendition attribute values on HI, Q and QUOTE tags are adapted from ISOpub and ISOnum standard entity set names The 'default' rendition of Q (PRE mdash) has not been included in Q

All end-of-line hyphenation removed.

Segmentation:

Marked up to the level of paragraph: P, QUOTE, LIST, POEM plus marking of particular sub-paragraph elements: NAME, Q. Page breaks left in the document as comments.

End-of-line hyphenation present in the OTA digital original.

Class declaration:
abbr = 14
body = 1
date = 39
div = 28
foreign = 7
head = 27
hi = 323
item = 4
l = 32
list = 1
name = 1371
note = 1
p = 1282
lg = 10
q = 2245
Q tags with a attribute of "type=MI" have been inserted automatically after S insertion.
quote = 35
s = 6643
S tags have been inserted automatically using the awk script written for this purpose and then cleaned up by hand.
text = 1
title = 4

Profile Description

1997-12-03
Language use:
ns-sr: Newspeak Serbian

Revision Description



TEI header (English text)

creator: ET
status: update
date: 1997-09-29 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Speech, English
Responsibility Statement:
Tomaž Erjavec, IJS E8
CES encoding
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 2250 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Speech, English
Responsibility Statement:
Tomaž Erjavec, IJS E8
CES encoding
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
Availiability:

Freely available

October 1, 1997
Source Description:
Title Statement:
Title:
Blocks O..R, 0..9 from the English section of the SAM EUROM.1 speech database.
Responsibility Statement:
Daniel Hirst, Laboratoire Parole et Langage
Selected the passages
Publications Statement:
Distributor:
EUROM
Address:
Unknown
Availiability:

Freely available

Unknown

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.3 CES LEVEL: 1

Segmentation:

Marked up for DIV, P, S

Class declaration:
body = 1
div = 40
head = 40
p = 40
s = 196
Not all passages have 5 sentences!
text = 1

Profile Description

Revision Description



TEI header (Romanian text)

creator: ET
status: update
date: 1997-09-29 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Speech, Romanian
Responsibility Statement:
ICI
Translation from English
Tomaž Erjavec, IJS E8
CES encoding
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 2283 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Speech, Romanian
Responsibility Statement:
ICI
Translation from English
Tomaž Erjavec, IJS E8
CES encoding
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
Availiability:

Freely available

October 1, 1997
Source Description:
Title Statement:
Title:
Blocks O-R, 0..9 from the English section of the SAM EUROM.1 speech database.
Responsibility Statement:
Daniel Hirst, Laboratoire Parole et Langage
Selected the passages
Publications Statement:
Distributor:
EUROM
Address:
Unknown
Availiability:

Freely available

Unknown

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.3 CES LEVEL: 1

Segmentation:

Marked up for DIV, P

Class declaration:
body = 1
div = 40
head = 40
p = 40
s = 194
Probable errors!
text = 1

Profile Description

Revision Description



TEI header (Slovene text)

creator: ET
status: update
date: 1996-04-18 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Speech, Slovene
Responsibility Statement:
Tomaž Erjavec, IJS E8
CES conformance, translation correction
Damjan Bojadžiev, IJS E8
English-Slovene translation
Aleš Dobinikar, IJS E8
Recording supervision, digitisation
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 1957 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Speech, Slovene
Responsibility Statement:
Tomaž Erjavec, IJS E8
CES conformance, translation correction
Damjan Bojadžiev, IJS E8
English-Slovene translation
Aleš Dobinikar, IJS E8
Recording supervision, digitisation
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
Availiability:

Freely available

October 1, 1997
Source Description:
Title Statement:
Title:
Blocks O..R, 0..9 from the English section of the SAM EUROM.1 speech database.
Responsibility Statement:
Daniel Hirst, Laboratoire Parole et Langage
Selected the passages
Publications Statement:
Distributor:
EUROM
Address:
Unknown
Availiability:

Freely available

Unknown

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.3 CES LEVEL: 1

Segmentation:

Marked up for DIV, P, S

Class declaration:
body = 1
div = 40
head = 40
p = 40
s = 196
text = 1

Profile Description

Revision Description



TEI header (Czech text)

creator: ET
status: update
date: 1997-09-29 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Speech, Czech
Responsibility Statement:
Vladimír Petkevič
Translation from English
Tomaž Erjavec, IJS E8
CES encoding
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 2073 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Speech, Czech
Responsibility Statement:
Vladimír Petkevič
Translation from English
Tomaž Erjavec, IJS E8
CES encoding
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
Availiability:

Freely available

October 1, 1997
Source Description:
Title Statement:
Title:
Blocks O..R, 0..9 from the English section of the SAM EUROM.1 speech database.
Responsibility Statement:
Daniel Hirst, Laboratoire Parole et Langage
Selected the passages
Publications Statement:
Distributor:
EUROM
Address:
Unknown
Availiability:

Freely available

Unknown

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.3 CES LEVEL: 1

Segmentation:

Marked up for DIV, P, S

Class declaration:
body = 1
div = 40
head = 40
p = 40
s = 194
Probable errors!
text = 1

Profile Description

Revision Description



TEI header (Bulgarian text)

creator: ET
status: update
date: 1997-09-29 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Speech, Bulgarian
Responsibility Statement:
Ludmila Dmitrova
Translation from English
Tomaž Erjavec, IJS E8
CES encoding
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 2062 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Speech, Bulgarian
Responsibility Statement:
Ludmila Dmitrova
Translation from English
Tomaž Erjavec, IJS E8
CES encoding
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
Availiability:

Freely available

October 1, 1997
Source Description:
Title Statement:
Title:
Blocks O..R, 0..9 from the English section of the SAM EUROM.1 speech database.
Responsibility Statement:
Daniel Hirst, Laboratoire Parole et Langage
Selected the passages
Publications Statement:
Distributor:
EUROM
Address:
Unknown
Availiability:

Freely available

Unknown

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.3 CES LEVEL: 1

Segmentation:

Marked up for DIV, P

Class declaration:
body = 1
div = 40
head = 40
p = 40
s = 201
text = 1

Profile Description

Revision Description



TEI header (Estonian text)

creator: ET
status: update
date: 1997-09-29 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Speech, Estonian
Responsibility Statement:
Heiki-Jaan Kaalep
Translation from English
Tomaž Erjavec, IJS E8
CES encoding
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 1887 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Speech, Estonian
Responsibility Statement:
Heiki-Jaan Kaalep
Translation from English
Tomaž Erjavec, IJS E8
CES encoding
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
Availiability:

Freely available

October 1, 1997
Source Description:
Title Statement:
Title:
Blocks O..R, 0..9 from the English section of the SAM EUROM.1 speech database.
Responsibility Statement:
Daniel Hirst, Laboratoire Parole et Langage
Selected the passages
Publications Statement:
Distributor:
EUROM
Address:
Unknown
Availiability:

Freely available

Unknown

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.3 CES LEVEL: 1

Segmentation:

Marked up for DIV, P, S

Class declaration:
body = 1
div = 40
head = 40
p = 40
s = 199
text = 1

Profile Description

Revision Description



TEI header (Hungarian text)

creator: ET
status: update
date: 1997-09-29 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Speech, Hungarian
Responsibility Statement:
Laszlo Tihanyi
Translation from English
Tomaž Erjavec, IJS E8
CES encoding
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 1705 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Speech, Hungarian
Responsibility Statement:
Laszlo Tihanyi
Translation from English
Tomaž Erjavec, IJS E8
CES encoding
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Dept. for Intelligent Systems, Jozef Štefan Institute
Address:
Jamova 39, SI-1000 Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
Availiability:

Freely available

October 1, 1997
Source Description:
Title Statement:
Title:
Blocks O..R, 0..9 from the English section of the SAM EUROM.1 speech database.
Responsibility Statement:
Daniel Hirst, Laboratoire Parole et Langage
Selected the passages
Publications Statement:
Distributor:
EUROM
Address:
Unknown
Availiability:

Freely available

Unknown

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.3 CES LEVEL: 1

Segmentation:

Marked up for DIV, P

Class declaration:
body = 1
div = 40
head = 40
p = 40
s = 196
Probable errors!
text = 1

Profile Description

Revision Description



TEI header (Romanian text)

creator: SB
status: update
date: 1996-01-15 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Fiction, Romanian
Responsibility Statement:
Mircea Nicolescu, Liviu Anca students, Politechnic University of Bucharest
tagged text to conform to CES
Ştefan Bruda Center for Artificial Intelligence NLP division Romanian Academy of Sciences
Error correction and CES1 conformance.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 164263 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Romanian Academy of Sciences, Centre for Artificial Intelligence
Address:
13, 13 Septembrie Str., Bucharest, Romania
Address:
eAddress: tufis@valhalla.racai.ro
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Fiction, Romanian
Responsibility Statement:
Mircea Nicolescu, Liviu Anca students, Politechnic University of Bucharest
tagged text to conform to CES
Ştefan Bruda Center for Artificial Intelligence NLP division Romanian Academy of Sciences
Error correction and CES1 conformance.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Romanian Academy of Sciences, Centre for Artificial Intelligence
Address:
13, 13 Septembrie Str., Bucharest, Romania
Address:
eAddress: tufis@valhalla.racai.ro
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title:
TESTAMENT ÎNTRE ÎNGER ŞI DIAVOL
MIHAI RĂDULESCU 1995
Publisher:
Tipografia PIKA S.R.L.
Place of publication:
Str. Vişinilor nr. 20, Bucureşti
Title:
Obreja
Mihai Radulescu 1993
Publisher:
Editura Ramida
Place of publication:
Bucharest
Title:
Flacari sub cruce
Mihai Radulescu 1995
Publisher:
Editura Ramida
Place of publication:
Bucharest

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

Segmentation:

Marked up to the level of paragraph: P, QUOTE, LIST, POEM plus marking of particular sub-paragraph elements: HI, Q, FOREIGN

Class declaration:
group = 1
body = 3
div = 80
head = 136
hi = 816
l = 222
p = 2844
lg = 18
q = 2193
quote = 341
foreign = 1
text = 1

Profile Description

1996-05-06

Revision Description



TEI header (Slovene text)

creator: ET
status: update
date: 1996-04-18 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Fiction, Slovene
Responsibility Statement:
Tomaž Erjavec, Dept. for Intelligent Systems Jozef Štefan Institute
Error correction and CES conformance.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 95373 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Dept. for Intelligent Systems Jozef Štefan Institute,
Address:
Jamova 39, Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Fiction, Slovene
Responsibility Statement:
Tomaž Erjavec, Dept. for Intelligent Systems Jozef Štefan Institute
Error correction and CES conformance.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Dept. for Intelligent Systems Jozef Štefan Institute,
Address:
Jamova 39, Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title Statement:
Title:
Digital form of 'Galjot', obtained via OCR
Responsibility Statement:
The Slovene Society for Blind and Visually Impaired
OCR'ed the novel
Publications Statement:
Distributor:
The Slovene Society for Blind and Visually Impaired
Address:
Ljubljana
Availiability:

Unknown
Source Description:
Title:
Galjot
Jančar, Drago
Publisher:
Mladinska knjiga
1984
Place of publication:
Ljubljana, Slovenia

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

The OCR'ed text of the novel has been automtaically spell-checked.

Quotation:

Rendition attribute values on Q and QUOTE tags are adapted from ISOpub and ISOnum standard entity set names Spoken passages are marked by Q even where there are no typographical marks to denote them.

All text semi-automatically dehyphenated; errors possible where the two parts of the word are both words

Segmentation:

Two levels of DIV are used: the first denotes the chapters, the second divisions which are marked by spacing in the original text. DIV type=chapter is usually followed by a HEAD and OPENER. Marked up to the level of paragraph plus marking of particular sub-paragraph elements: Q, ABBR.

Class declaration:
abbr = 31
body = 1
cell = 75
div = 208
foreign = 5
head = 52
hi = 47
item = 11
l = 4
list = 1
opener = 25
p = 1850
lg = 2
q = 903
quote = 4
row = 15
table = 3
text = 1

Profile Description

1996-04-18

Revision Description



TEI header (Czech text)

creator: VP
status: update
date: 1996-04-29 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Fiction, Czech
Responsibility Statement:
Vladimír Petkevič
CES-marked up and checked for correctness down to the subparagraph level
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 82000 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic Ú FFUK
Address:
Celetná 13, Prague, Czech Republic
Address:
eAddress: Vladimir.Petkevic@ff.cuni.cz
Address:
eAddress: ftp: ucnk.ff.cuni.cz directory: pub/corpora/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Fiction, Czech
Responsibility Statement:
Vladimír Petkevič
CES-marked up and checked for correctness down to the subparagraph level
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic Ú FFUK
Address:
Celetná 13, Prague, Czech Republic
Address:
eAddress: Vladimir.Petkevic@ff.cuni.cz
Address:
eAddress: ftp: ucnk.ff.cuni.cz directory: pub/corpora/ME
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title Statement:
Title:
Opera - průvodce operní tvorbou Received in electronic form from the publisher: Nakladatelství Svoboda - Libertas, Praha
Responsibility Statement:
publisher: Nakladatelství Svoboda - Libertas, Praha
typed in in T602 (Czech editor) format
Publications Statement:
Distributor:
publisher: Nakladatelství Svoboda - Libertas, Praha The publisher is the proprietor of the electronic version of the book and the distributor of the paper version of the book. The electronic version was given to the Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic ÚTKL FFUK, for reserach purposes
Address:
Prague, Czech Republic
Availiability:

Electronic form available for non-profit purposes for: Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic ÚTKL FFUK

1993
Source Description:
Title:
Opera - průvodce operní tvorbou
Anna Hostomská 1993
Publisher:
Nakladatelství Svoboda - Libertas, Praha
Place of publication:
Prague, Czech Republic

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

The text of the novel has been automatically spell-checked.

The text contains no hyphens

Segmentation:

Three levels of DIV are used: the one first denotes the chapters, the second one the composers and the third one the operas. Marked up down to the paragraph level and to some subparagraph level elements. Sentences are not marked up.

Class declaration:
body = 1
abbr = 133
date = 456
div = 176
foreign = 116
head = 175
hi = 574
name = 6241
num = 913
p = 837
q = 567
text = 1

Profile Description

1996-05-03

Revision Description



TEI header (Bulgarian text)

creator: LD
status: update
date: 1996-05-14 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Fiction, Bulgarian
Responsibility Statement:
Ludmila Dimitrova
Inserting paragraph and some sub-paragraph level tagging
Lydia Sinapova
Correcting spelling of the electronic text Inserting additional paragraph and some sub-paragraph level tagging.
Responsibility Statement:
Ludmila Dimitrova
Comparing the electronic version with the printed publications of the novels 'PASSION or the Death of Alice' by Emilia Dvorianova and'I want, I believe, I can' by Julia Berberyan and checking the electronics versions. Modifing full Bulgarian Fiction corpus to conform to CES V4.0 inserting paragraph and sub-paragraph level tagging. Checked and modified markup down to sub-paragraph level.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 97251 words WordCount represents the number of words in this text exclusive of tags and header information. Microsoft Word 6.0 was used to count words. ByteCount reflects the approximate size of the file containing the doctype and cesDoc element including all text, tags and header information.
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Institute of Mathematics, Bulgarian Academy of Sciences
Address:
Acad G. Bonchev st. bl.8 1113 Sofia, Bulgaria
Address:
eAddress: mult@ling.math.acad.bg
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Fiction, Bulgarian
Responsibility Statement:
Ludmila Dimitrova
Inserting paragraph and some sub-paragraph level tagging
Lydia Sinapova
Correcting spelling of the electronic text Inserting additional paragraph and some sub-paragraph level tagging.
Responsibility Statement:
Ludmila Dimitrova
Comparing the electronic version with the printed publications of the novels 'PASSION or the Death of Alice' by Emilia Dvorianova and'I want, I believe, I can' by Julia Berberyan and checking the electronics versions. Modifing full Bulgarian Fiction corpus to conform to CES V4.0 inserting paragraph and sub-paragraph level tagging. Checked and modified markup down to sub-paragraph level.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Institute of Mathematics, Bulgarian Academy of Sciences
Address:
Acad G. Bonchev st. bl.8 1113 Sofia, Bulgaria
Address:
eAddress: mult@ling.math.acad.bg
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title Statement:
Title:
Electronic form of the novel "Passion or the death of Alice"
Responsibility Statement:
Publishing house "OBSIDIAN"
Provided the electronic version of the novel "Passion or the death of Alice"
Publications Statement:
Distributor:
Publishing house "OBSIDIAN"
Address:
Sofia 1124, Dobromir Hriz st. 31, Bulgaria
Availiability:

Available for internal use only by the publishing house

1995
Source Description:
Title:
Passion или смъртта на Алиса
Емилия Дворянова 1995
Publisher:
Publishing House "OBSIDIAN"
Place of publication:
Sofia
Title Statement:
Title:
Electronic form of the novel "I Want, I believe, I can"
Responsibility Statement:
Publishing house "ABAGAR HOLDING"
Provided the electronic version of the novel "I Want, I believe, I can"
Publications Statement:
Distributor:
Publishing house "ABAGAR HOLDING"
Address:
Sofia 1124, Dobromir Hriz st. 31, Bulgaria
Availiability:

Available for internal use only by the publishers

1995
Source Description:
Title:
Искам, вярвам, мога
Юлия Берберян 1995
Publisher:
Publishing House "ABAGAR HOLDING"
Place of publication:
Sofia

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

Quotation:

Rendition attribute values on Q, QUOTE and MENTIONED tags are adapted from ISOpub and ISOnum standard entity set names when used. Two rendition short-cuts are used, 'rend=mdash' stands for 'rend="PRE mdash POST mdash"' 'rend=dblq' stands for 'rend="PRE ldquo POST rdquo"' 'rend="PRE mdash" (or "PRE ldquo") is used when the quoted dialogue ends up with the paragraph (there is no other typographical distinction). 'rend="POST mdash" (or "POST rdquo") is used when there is no typographical distinction (except ordinary punctuation) for the beginning of the quoted dialogue. No default rendition is used.

Segmentation:

Marked up to the level of paragraph: P, QUOTE, POEM, plus marking of sub-paragraph element Q. Some marking of particular sub-paragraph elements: NAME, DATE, TIME, MENTIONED, FOREIGN, ABBR.

No end-of-line hyphenation present.

Class declaration:
abbr = 952
All abbreviations are marked.
body = 1
byline = 2
closer = 2
date = 110
All dates which contain one or more digits (the characters 0-9) are marked, including dates specifying day/month/year and dates consisting only of a year. No attempt was made to identify or mark dates in other forms.
dateline = 2
div = 20
docauthor = 2
foreign = 4
Only Latin words are marked as FOREIGN.
head = 18
hi = 233
The highlighting tag is used to mark words and phrases which were typographically distinguished, and for which no other more precise tag is applicable. In most of these cases, such highlighting signifies emphasis.
l = 2
mentioned = 206
name = 3410
All names of people, places, organizations, products, and events, are marked. Some person names in the genitive are also marked.
num = 996
Anything containing one or more digits (the characters 0-9) that is not part of a date are marked as a number. Numbers appearing in the form: 6/4 or 10:30 are tagged separately (in the tagged text scores of sport games are represented in such way).
opener = 2
p = 1332
lg = 1
q = 675
The Q tag is used to mark quoted dialogue.
quote = 57
QUOTE marks quotations from outside sources.
text = 1
time = 74
title = 52

Profile Description

1997-03-18

Revision Description



TEI header (Estonian text)

creator: HJK
status: update
date: 1995-10-18 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Fiction, Estonian
Responsibility Statement:
Urve Talvik
entered the text
Riina Mosna
entered the text
Heiki-Jaan Kaalep
supervised the work
Heiki-Jaan Kaalep
modified the header for version 4
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 104435 words WordCount represents the number of words in this text exclusive of tags and header information. ByteCount reflects the approximate size of the file containing the doctype and cesDoc element including all text, tags and header information.
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
TÜ arvutuslingvistika uurimisgrupp
Address:
Tiigi 78-232, Tartu, Estonia
Address:
eAddress: hkaalep@psych.ut.ee
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Fiction, Estonian
Responsibility Statement:
Urve Talvik
entered the text
Riina Mosna
entered the text
Heiki-Jaan Kaalep
supervised the work
Heiki-Jaan Kaalep
modified the header for version 4
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
TÜ arvutuslingvistika uurimisgrupp
Address:
Tiigi 78-232, Tartu, Estonia
Address:
eAddress: hkaalep@psych.ut.ee
Availiability:

Freely available

May 1, 1997
Source Description:
Viivi Luik
Title:
Seitsmes rahukevad
Mihkel Tiks
Title:
Korvpalliromaan
Mats Traat
Title:
Üksi rändan
Mari Saat
Title:
Õun valguses ja varjus
Aimee Beekman
Title:
Loobumisvõimalus
Albert Uustulnd
Title:
Tuulte tallermaa
Teet Kallas
Title:
Öö neljandas mikrorajoonis
Enn Vetemaa
Title:
Möbiuse leht
Aadu Hint
Title:
Oma saar
Nikolai Baturin
Title:
Noor jää
Ene Mihkelson
Title:
Korter
Jaak Jõerüüt
Title:
Raisakullid
Raimond Kaugver
Title:
Pariisi lõbusad naised
Egon ja Vaike Rannet
Title:
Kivid ja leib
Nasta Pino
Title:
Igal õhtul Solenzaras
Raimo Männis
Title:
Heinatants
Mihkel Mutt
Title:
Keerukuju
Heljo Mänd
Title:
Umbjärv
Endla Tegova
Title:
Humalaaias
Arvo Valton
Title:
Üksildased ajas
Anu Raud
Title:
Jutte muist muinas- muist muid
Paul Kuusberg
Title:
Habemik
Milvi Seping
Title:
Värvilised linnud
Toomas Vint
Title:
Suur isane kala akvaariumis
Jaak Jõerüüt
Title:
Meeste tantsud
Einar Maasik
Title:
Tere Maria
Oskar Kruus
Title:
Ärkamised
Arvo Valton
Title:
Arvid Silberi maailmareis
Enn Vetemaa
Title:
Ah soo... või nii!!!
Mati Unt
Title:
Räägivad ja vaikivad
Enn Kreem
Title:
Kolm laiust
Andres Vanapa
Title:
Taevatalu
Raimond Kaugver
Title:
Meie pole süüdi
Herman Sergo
Title:
Näkimadalad III
Herman Sergo
Title:
Näkimadalad II
Jaan Kross
Title:
Professor Martensi ärasõit
Erni Krusten
Title:
Metalliotsija
Lennart Meri
Title:
Hõbevalgem
Rein Põder
Title:
Hilised astrid
Herman Sergo
Title:
Näkimadalad I
Ine Viiding
Title:
Jurmala lood
Ardi Liives
Title:
Vastuarmastus 2.
Endla Tegova
Title:
Laulatatud
Vladimir Beekman
Title:
Narva kosk
Oskar Kruus
Title:
Naiselikkuse seadus
Andrus Kasemaa
Title:
Rannamännid
Heino Kiik
Title:
Elupadrik
Ardi Liives
Title:
Passioon
Oskar Kruus
Title:
Aeg atra seada
Herta Laipaik
Title:
Hallid luiged
Endla Tegova
Title:
Põlvili

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 2.0 CES LEVEL: 1

Segmentation:

Up to the level of sentences

No end-of-line hyphenation

Class declaration:
abbr = 20
body = 1
byline = 51
date = 51
distinct = 85
div = 91
docauthor = 51
foreign = 30
head = 34
hi = 311
item = 17
l = 13
list = 3
name = 3689
num = 51
p = 2542
lg = 4
q = 1261
quote = 30
ref = 1
s = 9836
text = 1
title = 51

Profile Description

Revision Description



TEI header (Hungarian text)

creator: CO
status: update
date: 1996-04-30 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Fiction, Hungarian
Responsibility Statement:
Csaba Oravecz
CES1 conformant tagging
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 72002 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Research Institute for Linguistics, Hungarian Academy of Sciences
Address:
Budapest, Színház u. 5-9.
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Fiction, Hungarian
Responsibility Statement:
Csaba Oravecz
CES1 conformant tagging
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Research Institute for Linguistics, Hungarian Academy of Sciences
Address:
Budapest, Színház u. 5-9.
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title Statement:
Title:
National Corpus for the Hungarian Historical Dictionary SGML like encoded version
Publications Statement:
Distributor:
Research Institute for Linguistics, Hungarian Academy of Sciences
Address:
Budapest, Színház u. 5-9.
Availiability:

Available for research purposes upon agreement

Unknown
Source Description:
Title:
Számadás
Veres Péter
Title:
Budai oroszlán
Sőtér István
Title:
Falusi krónika
Veres Péter
Title:
Égető Eszter
Németh László
Title:
selected chapters from 20th century Hungarian novels

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

Class declaration:
abbr = 5
body = 1
byline = 4
date = 4
div = 10
docauthor = 4
foreign = 7
head = 1
hi = 26
mentioned = 49
name = 8
num = 4
p = 919
q = 565
quote = 1
text = 1
title = 7

Profile Description

Revision Description



TEI header (Romanian text)

creator: SB
status: update
date: 1995-04-29 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Newspapers, Romanian
Responsibility Statement:
Ştefan Bruda Center for Artificial Intelligence NLP Group Romanian Academy of Sciences
Error correction and CES1 conformance.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 27863 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Romanian Academy of Sciences, Center for Artificial Intelligence
Address:
13, 13 Septembrie Str., Bucharest, Romania
Address:
eAddress: tufis@valhalla.racai.ro
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Newspapers, Romanian
Responsibility Statement:
Ştefan Bruda Center for Artificial Intelligence NLP Group Romanian Academy of Sciences
Error correction and CES1 conformance.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Romanian Academy of Sciences, Center for Artificial Intelligence
Address:
13, 13 Septembrie Str., Bucharest, Romania
Address:
eAddress: tufis@valhalla.racai.ro
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title:
România Liberă
1995-04-14

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

Segmentation:

Marked up to the level of paragraph: P, QUOTE, LIST, POEM plus marking of particular sub-paragraph elements: Q, HI

Class declaration:
body = 1
div = 138
head = 179
hi = 681
p = 573
q = 76
text = 1
byline = 77
sp = 26

Profile Description

1996-05-06

Revision Description



TEI header (Slovene text)

creator: ET
status: update
date: 1996-05-07 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Newspapers, Slovene
Responsibility Statement:
Tomaž Erjavec, LST group, Dept. for Intelligent Systems Jozef Štefan Institute
CES1 conformance.
Miro Romih Amebis d.o.o
Up-translation from diskette format, typographical error correction.
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 101749 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Dept. for Intelligent Systems Jozef Štefan Institute,
Address:
Jamova 39, Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Newspapers, Slovene
Responsibility Statement:
Tomaž Erjavec, LST group, Dept. for Intelligent Systems Jozef Štefan Institute
CES1 conformance.
Miro Romih Amebis d.o.o
Up-translation from diskette format, typographical error correction.
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Dept. for Intelligent Systems Jozef Štefan Institute,
Address:
Jamova 39, Ljubljana, Slovenia
Address:
eAddress: tomaz.erjavec@ijs.si
Address:
eAddress: http://nl.ijs.si/ME
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title Statement:
Title:
Original digital form of the 'Dnevnik' articles: editor's diskettes with idiosyncratic markup
Responsibility Statement:
The 'Dnevik' Daily
Collected the edited the texts from authors
Publications Statement:
Distributor:
The 'Dnevik' Daily
Address:
Ljubljana, Slovenia
Availiability:

Unknown
Source Description:
Title:
45 articles from the 'Dnevnik' Daily
Publisher:
Dnevik
8--10 1995
Place of publication:
Ljubljana, Slovenia

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.2 CES LEVEL: 1

The OCR'ed text of the novel has been automtaically spell-checked.

Quotation:

No rendition attribute values on Q 'Top level' Q are in '"', inner Qs in "'"

All text semi-automatically dehyphenated; errors possible where the two parts of the word are both words

Segmentation:

Each article proper is in a DIV type="article" The text of the article is in a DIV type="articletext" The sections of articletext, usu. with HEADER are in DIV type="articlepart" After articletext come Figures (DIV type="figure") and frames (DIV type="frame") Marked up to the level of paragraph plus marking of particular sub-paragraph elements: Q DATE: only for date of approx. article publication NAME: only where they were typographically marked in the original

Class declaration:
body = 1
byline = 54
date = 45
div = 396
docauthor = 55
figdesc = 67
figure = 67
head = 379
name = 83
opener = 45
p = 1204
q = 881
text = 1

Profile Description

1996-05-07 The CES1 Slovene Newspaper corpus comes into being

Revision Description



TEI header (Czech text)

creator: VP
status: update
date: 1996-05-10 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Newspapers, Czech
Responsibility Statement:
Vladimír Petkevič
Checked and modified markup up for correctness down to the subparagraph level
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 90410 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic ÚTKL FFUK
Address:
Celetná 13, Prague, Czech Republic
Address:
eAddress: Vladimir.Petkevic@ff.cuni.cz
Address:
eAddress: ftp: ucnk.ff.cuni.cz directory: pub/corpora/ME
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Newspapers, Czech
Responsibility Statement:
Vladimír Petkevič
Checked and modified markup up for correctness down to the subparagraph level
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic ÚTKL FFUK
Address:
Celetná 13, Prague, Czech Republic
Address:
eAddress: Vladimir.Petkevic@ff.cuni.cz
Address:
eAddress: ftp: ucnk.ff.cuni.cz directory: pub/corpora/ME
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title Statement:
Title:
AA Lidové noviny - collection of articles, 1991-1994; Obtained in electronic form (WordPerfect format)
Responsibility Statement:
publisher: Lidové noviny, Praha
typed in in electronic form (WordPerfect format)
Publications Statement:
Distributor:
publisher: Lidové noviny, Praha distributor of the paper version of newspaper articles The electronic texts were made available for the the Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic ÚTKL FFUK for research purposes
Address:
Prague, Czech republic
Availiability:

Electronic form available for non-profit purposes It was made available for: Institute of Theoretical and Computational Linguistics, Faculty of Philosophy, Charles University, Czech Republic ÚTKL FFUK

1991-1994
Source Description:
Title:
Lidové noviny - collection of 451 articles from the 1991-1994 period
various newspapermen 1991-1994
Publisher:
Lidové noviny
Place of publication:
Praha

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

The texts contain no hyphens

Segmentation:

One level of DIV is used: each DIV denotes a separate article. Marked up down to the paragraph level and to some subparagraph level elements. Sentences are not marked up.

Class declaration:
abbr = 1239
body = 1
byline = 76
date = 547
dateline = 189
div = 450
docauthor = 74
foreign = 9
head = 537
hi = 59
name = 802
num = 1225
opener = 189
p = 1360
q = 587
text = 1

Profile Description

1996-05-10

Revision Description



TEI header (Bulgarian text)

creator: LS
status: update
date: 1996-05-14 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Newspapers, Bulgarian
Responsibility Statement:
Lydia Sinapova
Typing-in excerpts from Capital and Continent, Excerpting paragraph and some sub-paragraph level tagging to the electronic and typed-in texts.
Lydia Sinapova
Modified Newspaper corpus markup down to sub-paragraph level to conform to CES V4.0
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 96538 words WordCount represents the number of words in this text exclusive of tags and header information. Microsoft Word 6.0 was used to count words. ByteCount reflects the approximate size of the file containing the doctype and cesDoc element including all text, tags and header information. The size of the file with Cyrillics represented by SGML entities is approximately 5 times larger than the size of the originally tagged Cyrillic file.
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Institue of Mathematics, Bulgarian Academy of Sciences
Address:
Acad G. Bonchev st. bl.8 1113 Sofia, Bulgaria
Address:
eAddress: mult@ling.math.acad.bg
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Newspapers, Bulgarian
Responsibility Statement:
Lydia Sinapova
Typing-in excerpts from Capital and Continent, Excerpting paragraph and some sub-paragraph level tagging to the electronic and typed-in texts.
Lydia Sinapova
Modified Newspaper corpus markup down to sub-paragraph level to conform to CES V4.0
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Institue of Mathematics, Bulgarian Academy of Sciences
Address:
Acad G. Bonchev st. bl.8 1113 Sofia, Bulgaria
Address:
eAddress: mult@ling.math.acad.bg
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title:
Capital (Bulgarian) April 29 - May 5, 1996
Publisher:
AII OOD
1996-05-28
Place of publication:
Sofia, Bulgaria
Title:
Continent (Bulgarian) 1995, January 15
Publisher:
Publishing House "MEGAPRESS" AD
1995-01-15
Place of publication:
Sofia, Bulgaria
Title Statement:
Title:
Selected articles from Pari Daily, in electronic form
Responsibility Statement:
Tsvetan Petrov - vice editor
The electronic texts of the excerpts from "Pari" were prepared by the journalistst for internal usage only and kindly provided by Mr. Tsvetan Petrov for the MTE project in DOS Word 5 format. Not all of the actually published articles were included in the electronic files
Publications Statement:
Distributor:
PARI Daily
Address:
1000 Sofia, "Tsarigradsko shosse" blvd 47
Availiability:

The electronic texts are property of their authors and are not distributed

May 02, May 03 1996
Source Description:
Title:
Pari (Bulgarian) May 02, 1996
Title:
Pari (Bulgarian) May 03, 1996
Publisher:
"RUBICON" - Izdatelsko-targovski kompleks PARI OOD
May 02, May 03 1996
Place of publication:
Sofia, Bulgaria
Title Statement:
Title:
Selected articles from Standart Daily, in electronic form
Responsibility Statement:
Kiril Simov
The electronic texts of the excerpts from "Standart" were prepared by the journalistst for internal usage only. They were provided for the MTE project in DOS Word 5 format by Kiril Simov.
Publications Statement:
Distributor:
"Standart news" AD
Address:
1303 Sofia, Antim I, 53
Availiability:

The electronic texts are property of their authors and are not distributed

February, May, 1995
Source Description:
Title:
Standart Daily
Publisher:
"Standart news" AD
February, May 1995
Place of publication:
Sofia, Bulgaria

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.3 CES LEVEL: 1

Quotation:

No quotation marks are preserved in text. Rendition attribute values on Q and QUOTE tags are adapted from ISOpub and ISOnum standard entity set names Two rendition short-cuts are used, 'rend=mdash' stands for 'rend="PRE mdash POST mdash"' 'rend=dblq' stands for 'rend="PRE ldquo POST rdquo"' 'rend="PRE mdash" (or "PRE ldquo") is used when the quoted dialogue ends up with the paragraph (there is no other typographical distinction). 'rend="POST mdash" (or "POST rdquo") is used when there is no typographical distinction (except ordinary punctuation) for the beginning of the quoted dialogue. No default rendition is used.

Segmentation:

Marked up to the level of paragraph: P, SP, QUOTE, NOTE, CAPTION, LIST, FIGURE, plus marking of sub-paragraph element Q. Some marking of particular sub-paragraph elements: NAME, TITLE, DATE, TIME, MENTIONED, DISTINCT, FOREIGN, ABBR.

No end-of-line hyphenation present.

Class declaration:
abbr = 1295
All abbreviations are marked. The 'expan' attribute is not always used.
body = 1
caption = 143
This tag is used mainly for phrases accompanying figures (with type=attached) and for phrases that are in some way separated form the surrounding text (type = display).
byline = 302
date = 395
All dates which contain one or more digits (the characters 0-9) are marked, including dates specifying day/month/year, day/month, and dates consisting only of a year. The attribute 'iso8601' is not used consistently.
dateline = 29
distinct = 6
This tag is used for foreign words that are not commonly used in Bulgarian but are written in Cyrillics. Foreign words that are used widely (e.g. computer) are not tagged with this tag.
div = 560
docauthor = 105
figdesc = 48
figure = 48
This tag is used to mark occurrences of figures. No reference to objects, representing the figures is made.
foreign = 9
This tag is used only with words that are not names.
head = 500
hi = 376
The highlighting tag is used to mark words and phrases which were typographically distinguished, and for which no other more precise tag is applicable. In most of these cases, such highlighting signifies emphasis.
item = 50
label = 2
list = 11
measure = 18
mentioned = 10
name = 4967
All names of people, places, organizations, products, events, programmes, are marked. This tag was used also for names of documents in cases, where 'title' did not seem very appropriate. Person names in the genitive are not marked.
note = 11
num = 1555
Anything containing one or more digits (the characters 0-9) that is not part of a date, and all roman numerals, are marked as a number. In cases where a ratio is expressed (per cent, per thousand), the entire phrase (e.g., "10 per cent") is marked as a number.
opener = 29
p = 1440
ptr = 23
q = 228
The Q tag is used to mark quoted dialogue. The attribute "broken=yes" is used when no sentence terminating punctuation (either inside the Q itself or in the intervening text between two Qs) appears between two dialogue fragments by the same speaker.
quote = 1
ref = 7
s = 155
S tag is used only in case of broken sentences
sp = 80
SP tag is used for interviews
speaker = 6
text = 1
term = 4
Words with specific usage in a particular domain are tagged with this tag. No attempt is made to identify the type of domain.
time = 13
All times which contain one or more digits (the characters 0-9) are marked. The attribute 'iso8601' is not used consistently.
title = 254
Titles of newspapers, books, songs, pictures, movies, and any other art objects are marked.

Profile Description

1996-03-20

Revision Description



TEI header (Estonian text)

creator: HJK
status: update
date: 1995-10-18 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Newspapers, Estonian
Responsibility Statement:
Urve Talvik
entered the text
Riina Mosna
entered the text
Heiki-Jaan Kaalep
supervised the work
Heiki-Jaan Kaalep
modified the header for version 4
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 112003 words WordCount represents the number of words in this text exclusive of tags and header information. ByteCount reflects the approximate size of the file containing the doctype and cesDoc element including all text, tags and header information.
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
TÜ arvutuslingvistika uurimisgrupp
Address:
Tiigi 78-232, Tartu, Estonia
Address:
eAddress: hkaalep@psych.ut.ee
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Newspapers, Estonian
Responsibility Statement:
Urve Talvik
entered the text
Riina Mosna
entered the text
Heiki-Jaan Kaalep
supervised the work
Heiki-Jaan Kaalep
modified the header for version 4
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
TÜ arvutuslingvistika uurimisgrupp
Address:
Tiigi 78-232, Tartu, Estonia
Address:
eAddress: hkaalep@psych.ut.ee
Availiability:

Freely available

October 1, 1997
Source Description:
Title:
Õhtuleht 25/04/1985
Title:
Noorte Hääl 02/11/1985
Title:
Noorte Hääl 26/12/1985
Title:
Õhtuleht 26/12/1985
Title:
Rahva Hääl 21/03/1985
Title:
Rahva Hääl 15/05/1985
Title:
Rahva Hääl 19/05/1985
Title:
Noorte Hääl 28/05/1985
Title:
Noorte Hääl 29/05/1985
Title:
Punane Täht 11/06/1985
Title:
Sirp ja Vasar 20/09/1985

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 2.0 CES LEVEL: 1

Segmentation:

Up to the level of sentences

No end-of-line hyphenation

Class declaration:
abbr = 1864
author = 168
bibl = 168
body = 1
byline = 333
corr = 1
date = 11
distinct = 131
div = 388
docauthor = 333
foreign = 4
head = 356
hi = 1008
item = 244
list = 39
name = 7629
note = 2
num = 344
p = 2423
q = 385
quote = 89
ref = 3
s = 7758
text = 1
title = 333

Profile Description

Revision Description



TEI header (Hungarian text)

creator: CO
status: update
date: 1996-04-20 (created) 2004-05-10 (updated)

File Description

Title Statement:
Title:
Multext-East cesDoc corpus: Newspapers, Hungarian
Responsibility Statement:
Csaba Oravecz
CES1 conformant tagging
Responsibility Statement:
Tomaž Erjavec
Conversion to XML/TEI P4
Edition Statement:

Version 3

Extent: 92233 words
Publications Statement:
Address:
http://nl.ijs.si/ME/V3/
Distributor:
Research Institute for Linguistics, Hungarian Academy of Sciences
Address:
Budapest, Színház u. 5-9.
2004-05-10
Source Description:
Title Statement:
Title:
Multext-East CES1: Newspapers, Hungarian
Responsibility Statement:
Csaba Oravecz
CES1 conformant tagging
Edition Statement:

MTE Final Release

Publications Statement:
Distributor:
Research Institute for Linguistics, Hungarian Academy of Sciences
Address:
Budapest, Színház u. 5-9.
Availiability:

Available for research purposes upon receipt of signed agreement

October 1, 1997
Source Description:
Title Statement:
Title:
Magyar Hírlap: Pre-edited ASCII text version
Publications Statement:
Distributor:
Magyar Hírlap Publishing House Ltd.
Address:
Budapest, Kerepesi út 29/b.
Availiability:

Available for research purposes upon agreement

Unknown
Source Description:
Title:
Magyar Hírlap, 25/01/1996 issue
Title:
Magyar Hírlap, 31/01/1996 issue
Title:
Magyar Hírlap, 22/01/1996 issue
Title:
MULTEXT-East Hungarian Newspaper Corpus

Encoding Description

Project description:

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106

Editorial declaration:
Normalisation:

Corpus Encoding Standard, Version 4.0 CES LEVEL: 1

Class declaration:
abbr = 289
body = 1
byline = 341
caption = 2
div = 316
docauthor = 179
foreign = 2
head = 545
hi = 7
mentioned = 131
name = 1110
note = 19
p = 1293
q = 218
ref = 15
sp = 145
text = 1
title = 11

Profile Description

Revision Description