creator: | et | |
---|---|---|
status: | update | |
date: | 2000-10-30 (created) | 2004-04-09 (updated) |
Available for research purposes upon receipt of signed agreement.
Available for research purposes upon receipt of signed agreement.
Available for research purposes upon receipt of signed agreement.
MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. <http://nl.ijs.si/ME/>
Since the CD-ROM release of the cesAna Orwells, many errors of linguistic annotations have been corrected in the individual texts.
In the process of conversion to TEI, various format errors were detected and corrected.
All the novels have their markup normalised: a) structure annotation with DIV and P (attributes ID and TYPE); b) segmentation annotation with S (attribute ID); c) tokenisation annotation with W, C, (attribute TYPE) d) linguistic annotation with W attributes LEMMA and ANA.
All the novels use Unicode XML character entities to represent non-ASCII characters.
QUOTEs have been in general changed to P
Q markup has been in some novels (see individual Headers) omitted, while it is in others present as punctuation mark "; its C element is marked with TYPE="open" or ="close"
Segmentation into paragraphs follows the printed sources; it therefore not 1-1 with the English original. Segmentation into sentences was performed automatically and then hand-validated.
Tokenisation into words and punctuation symbols was perfumed on the basis of MULTEXT-East lexica, mostly with the MULTEXT tools 'mtseg' and then hand-validated.
No end-of-line hyphenation present in texts.
The linguistic interpretation of the text consists of marking up the word tokens with their context disambiguated lemma and MULTEXT-East morphosyntactic description. The various texts have undergone various amounts of validation, so error-rates between them differ.
The two-letter language codes follow ISO 639.
The MULTEXT-East morphosyntactic descriptions (MSDs) follow the revised common tables of lexical specifications MULTEXT-East/Concede. The lexical MSDs have been converted to a FSLIB, a feature-structure library, while their decomposition into features is given in a FLIB, a feature library. The words in the texts have theirs MSD encoded as the value of the ANA (#IDREF) attribute. This attribute refers to a FS, which, in turn, refers via its #IDREFS FEATS to the Fs that define it.
creator: | et | |
---|---|---|
status: | update | |
date: | 2000-10-30 (created) | 2004-04-09 (updated) |
MULTEXT-East Morphosyntactic Specifications, Version 3 BETA
Freely available.
MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106
Concede: Consortium for Central European Dictionary Encoding. EU Copernicus Project PL96-1142
creator: | ET | |
---|---|---|
status: | update | |
date: | 1997-12-15 (created) | 2004-03-05 (updated) |
Freely available for non-commercial use provided that this Header is included in its entirety with any copy distributed
Public Domain TEI edition prepared at the Oxford Text Archive
Freely available for non-commercial use provided that this Header is included in its entirety with any copy distributed
MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106
Concede: Consortium for Central European Dictionary Encoding. EU Copernicus Project PL96-1142
creator: | DT | |
---|---|---|
status: | update | |
date: | 1997-11-04 (created) | 2004-03-05 (updated) |
Available for research purposes upon receipt of signed agreement.
Available for research purposes upon receipt of signed agreement.
MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106
Concede: Consortium for Central European Dictionary Encoding. EU Copernicus Project PL96-1142
The Concede project had the aim of developing a unified dictionary encoding schema and the experiments were done with lexical tokens extracted from Orwell's "1984" multilingual corpus developed within the MULTEXT-East project. The headword extraction considered various frequency intervals and considering all word categories (POS) so that different kinds of encoding problems be revealed. The MULTEXT-East corpus has been significantly improved for the purpose of CONCEDE project.
creator: | VP | |
---|---|---|
status: | update | |
date: | 1997-11-28 (created) | 2004-03-05 (updated) |
Freely available for non-commercial use provided that this Header is included in its entirety with any copy distributed
MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106
Concede: Consortium for Central European Dictionary Encoding. EU Copernicus Project PL96-1142
creator: | ET | |
---|---|---|
status: | update | |
date: | 1997-11-04 (created) | 2004-04-06 (updated) |
Freely available for non-commercial use provided that this Header is included in its entirety with any copy distributed
Available for research purposes upon receipt of signed agreement.
Public Domain TEI edition prepared at the Oxford Text Archive
Freely available for non-commercial use provided that this Header is included in its entirety with any copy distributed
MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106
Concede: Consortium for Central European Dictionary Encoding. EU Copernicus Project PL96-1142
creator: | CK | |
---|---|---|
status: | update | |
date: | 2004-04-06 (created) | 2004-04-09 (updated) |
Available for non-commercial use provided that this Header is included in its entirety with any copy distributed
TELRI Final Release
Public Domain TEI edition prepared at the Oxford Text Archive
Freely available for non-commercial use provided that this header is included in its entirety with any copy distributed
MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106
creator: | LD | |
---|---|---|
status: | update | |
date: | 1997-11-30 (created) | 2004-03-05 (updated) |
Freely available for non-commercial use provided that this Header is included in its entirety with any copy distributed
MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106
Concede: Consortium for Central European Dictionary Encoding. EU Copernicus Project PL96-1142
creator: | HJK | |
---|---|---|
status: | update | |
date: | 1997-11-28 (created) | 2004-03-05 (updated) |
Freely available
Freely available
Freely available
MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106
Concede: Consortium for Central European Dictionary Encoding. EU Copernicus Project PL96-1142
creator: | OCS | |
---|---|---|
status: | update | |
date: | 1997-11-24 (created) | 2004-03-05 (updated) |
Freely available for non-commercial use provided that this Header is included in its entirety with any copy distributed
MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106
Concede: Consortium for Central European Dictionary Encoding. EU Copernicus Project PL96-1142