<cesHeader version="4.1" type="text" lang=en creator=ET status="update" date.created="1997-12-15" date.updated="1997-12-20" > <filedesc> <titlestmt> <h.title>Multext-East cesAna: Nineteen Eighty-Four, English</h.title> <respstmt> <respname>Nancy Ide</respname> <resptype>Overall Responsibility</resptype> <respname>Greg Priest-Dorman</respname> <resptype>Generation of Lexical Data</resptype> <respname>Vladimír Petkevič</respname> <resptype>Conversion to cesAna DTD</resptype> </respstmt> </titlestmt> <editionstmt version="1.0">MTE Final Release</editionstmt> <extent> <wordCount>103997</wordCount> <byteCount units="MB">cca 38</byteCount> <extnote>wordCount represents the number of TOK TYPE=WORD elements in the text.</extnote> </extent> <publicationstmt> <distributor> Department of Computer Science, Vassar College </distributor> <pubaddress>Poughkeepsie, New York 12604-0252 USA</pubaddress> <eaddress type="email">ide@cs.vassar.edu</eaddress> <availability status="restricted"> Available for research purposes upon receipt of signed agreement </availability> <pubDate value="1998-01-01">January 1st, 1998</pubDate> </publicationstmt> <sourcedesc> <biblfull> <titlestmt> <h.title>Multext-East CES1: Nineteen Eighty-Four, English</h.title> </titlestmt> <publicationstmt> <distributor> Department of Computer Science, Vassar College </distributor> <pubaddress>Poughkeepsie, New York 12604-0252 USA</pubaddress> <eaddress type="email">ide@cs.vassar.edu</eaddress> <availability status="restricted"> Available for research purposes upon receipt of signed agreement </availability> <pubDate value="1997-10-01">October 1, 1997</pubDate> </publicationstmt> <sourcedesc> <biblfull> <titlestmt> <h.title> The European Corpus Initiative Multilingual Corpus 1: 1984 by George Orwell (English) </h.title> <respstmt> <respname>Association for Computational Linguistics</respname> <resptype>Converted from OTA's DTD to ECI DTD</resptype> </respstmt> </titlestmt> <publicationstmt> <distributor>ACL</distributor> <pubaddress>ACL</pubaddress> <availability status=restricted> Available for research purposes upon receipt of signed agreement </availability> <pubdate>1994</pubdate> </publicationstmt> <sourcedesc> <biblfull> <titlestmt> <h.title>Orwell's 1984: electronic edition</h.title> <respstmt> <respname>Oxford Text Archive</respname> <resptype> The four versions of Orwell's 1984 in the OTA were all prepared by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental And African Studies at London University. The texts here have not been encoded or proofread in any way since they were produced (other than the English text, which was converted to an SGML like encoding by John Price-Wilkin, and subsequently automatically converted to conform to the OTA's dtd by myself and Alan Morrison. The other languages were converted to TEI conformant SGML by the ECI project 1993.) --LB, Nov 1992 </resptype> </respstmt> </titlestmt> <editionstmt> Public Domain TEI edition prepared at the Oxford Text Archive </editionstmt> <publicationstmt> <distributor>Oxford Text Archive</distributor> <pubaddress> Oxford University Computing Service 13 Banbury Road Oxford OX2 6NN UK archive@ox.ac.uk </pubaddress> <availability status=restricted> Freely available for non-commercial use provided that this header is included in its entirety with any copy distributed </availability> <pubdate>19 Nov 1992</pubdate> </publicationstmt> <sourcedesc> <biblstruct> <monogr> <h.title>1984</h.title> <h.author>George Orwell</h.author> <imprint> <pubdate>1949; reprinted 1961</pubdate> <publisher>New American Library</publisher> <pubplace>New York</pubplace> </imprint> </monogr> </biblstruct> </sourcedesc> </biblfull> </sourcedesc> </biblfull> </sourcedesc> </biblfull> </sourcedesc> </filedesc> <encodingdesc> <projectdesc> MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. EU Copernicus Project COP106 </projectdesc> <editorialdecl> <transduction> In the cesDoc to cesAna conversion, DIV, QUOTE, Q tags and HEAD, POEM, LIST elements have been omitted. cesDoc P elements are encoded as PAR, and S as S. cesDoc sub-S level tags are omitted: DATE, NAME, ABBR, etc. </transduction> <quotation> Q and QUOTE tags from the cesDoc source not retained. </quotation> <segmentation> S segmentation same as in cesDoc source (hand-validated). TOK segmentation performed with mtseg and manually corrected, </segmentation> </editorialdecl> <tagsdecl> <tagusage gi=chunklist occurs=1> Element corresponds to TEXT of the cesDoc source </tagusage> <tagusage gi=chunk occurs=1> Element corresponds to BODY of the cesDoc source </tagusage> <tagusage gi=par occurs=1286> Elements correspond to P elements of the cesDoc source. The FROM attribute gives the reference to the ID of the corresponding cesDoc P element. </tagusage> <tagusage gi=s occurs=6701> Elements correspond to S elements of the cesDoc source The FROM attribute gives the reference to the ID of the corresponding cesDoc S element. </tagusage> <tagusage gi=tok occurs=118102> Tokens are of TYPE=WORD or PUNCT, with the CLASS attribute giving the mtseg class of the token (ABBR, COMP, INIT, TTL). The FROM attribute gives reference to the ID of the corresponding cesDoc S element in which the token in question appears along with the character offset of the token within the sentence (the character offset is appended to the sentence ID). </tagusage> <tagusage gi=orth occurs=118102> Contains the orthography of the token, as found in the cesDoc source (except for COMP, which have underscore instead of blank). </tagusage> <tagusage gi=disamb occurs=187526> Contains disambiguated lexical information for WORDs. Disambiguation performed by Eric Brill's Unsupervised Part-of-Speech Tagger Version 0.8. Trained on chapters 1&2 of Multext-East CES1: Nineteen Eighty-Four, English. A token with several DISAMBs indicates that the Brill Tagger was not able to fully disambiguate the token. In such a case all equally-weighted possibilities are listed. </tagusage> <tagusage gi=lex occurs=214404> Contains undisambiguated lexical information for WORDs. </tagusage> <tagusage gi=base occurs=401930> Base or lemmma of a WORD. In the event that the base of the WORD was not known, the content of this tag will be "??" (two question marks). </tagusage> <tagusage gi=msd occurs=401930> Morphosyntactic description of a WORD. In the event that the MSD of the WORD was not known, the content of this tag will be "??" (two question marks). </tagusage> <tagusage gi=ctag occurs=416035> Corpus tag (for tok type=WORD and for tok type=PUNCT). In the event that the CTAG of the WORD was not known, the content of this tag will be "??" (two question marks). </tagusage> </tagsdecl> </encodingdesc> <profiledesc> <creation date="1997-12-15"></creation> <langusage> <![ %ONECOMPONENT [ &ISOlang; ]]> <language id="ns" iso639="none">Newspeak</language> <language id="ns-jg" iso639="none">Newspeak official jargon</language> <language id="en-ck" iso639="none">British Cockney English</language> </langusage> </profiledesc> <revisiondesc> <change> <changedate>1997-12-16</changedate> <respname>Vladimír Petkevič</respname> <h.item>Revised several tagusage descriptions, and supplied counts in the header. </h.item> </change> <change> <changedate>1997-12-20</changedate> <respname>Tomaz Erjavec, IJS</respname> <h.item>Modified EDITIONSTMT</h.item> </change> </revisiondesc> </cesHeader>