This document is a HTML 3.2 rendering of a
Corpus Encoding Specification
DTD document, produced in the scope of the
Note that this HTML translation does not contain all the information from the cesHeader.
- Title Statement
- Multext-East CES1: Nineteen Eighty-Four, Bulgarian
- Lydia Sinapova, Ludmila Dimitrova, Kiril Simov
Typing-in '1984', inserting paragraph and some
sub-paragraph level tagging.
- Lydia Sinapova
Modified full Orwell markup down to sub-paragraph level
to conform to CES V4.0, using the English version as a base
Added tagging of sentences in paragraphs using MtSgml
and Bulgarian resources.
- MTE Final Release
- 87235 words
WordCount represents the number of words in this text exclusive
of tags and header information. Microsoft Word 6.0 was used to count words.
ByteCount reflects the approximate
size of the file containing the doctype and cesDoc element including
all text, tags and header information.
- Publication Statement
Institue of Mathematics,
Bulgarian Academy of Sciences
Acad G. Bonchev st. bl.8
1113 Sofia, Bulgaria
- Electronic address:
Available for research purposes upon receipt of signed agreement
- Publication date:
- October 1, 1997
- Source Description
- Structured Bibliography
- Nineteen Eighty Four (Bulgarian)
- George Orwell
- Publication date:
- Project Description:
Multilingual Text Tools and Corpora for Central and
Eastern European Languages.
EU Copernicus Project COP106
- Tag declaration:
- abbr = 28
All abbreviations are marked.
- body = 1
- date = 40
All dates which contain one or more digits (the characters 0-9) are
marked, including dates specifying day/month/year and dates consisting
only of a year. The attribute 'iso8601' is used consistently except in two cases:
when the date specifies year and consists only of digits, and within quoted
Newspeak sentences. No attempt was made to identify or
mark dates in other forms.
- div = 28
- foreign = 29
Only those Newspeak words which are
typographically distinguished in the printed version of the text
are marked as FOREIGN if they do not appear in some other tag where the
lang attribute provides the language information.
Latin words are also marked.
- head = 1
- hi = 103
The highlighting tag is used to mark words and phrases
which were typographically distinguished, and
for which no other more precise tag is applicable.
In most of these cases, such highlighting signifies
- item = 4
- l = 26
- list = 1
- mentioned = 256
- name = 1704
All names of people, places, organizations,
products, and events, are marked.
Person names in the genitive are not marked.
All names of countries and towns are marked with type=place.
Names of rivers and oceans are not marked.
While in the English version the word INGSOC is marked
with NAME LANG=NS, in the Bulgarian version it is marked
only if typographically distinguished from the rest of the text.
In the English version the word NEWSPEAK is marked with
NAME TYPE=LANGUAGE, while in the Bulgarian version it is marked
only if typographically distinguished from the
rest of the text.
- note = 8
- num = 34
Anything containing one or more digits (the characters 0-9) that is
not part of a date, and all roman numerals, are marked as a
number. In cases where a ratio is expressed (per cent, per thousand),
the entire phrase (e.g., "10 per cent") is marked as a number.
- p = 1321
- poem = 7
- ptr = 8
- q = 2203
The Q tag is used to mark quoted dialogue.
The attribute "broken=yes" is used when no sentence terminating punctuation
(either inside the Q itself or in the intervening text between two Qs)
appears between two dialogue fragments by the same speaker.
Q tags with a attribute of "type=MI" have been inserted
automatically after S insertion.
- quote = 34
QUOTE marks quotations from outside sources, including extensive
quotations from Winston's diary and Goldstein's treatise.
- s = 6649
S tags have been inserted automatically and then cleaned up by hand in
the locations (character offsets) provided by MTSeg version 1.3.1
using the Slovene resource files.
- text = 1
- title = 41
- Date: 1996-10-25
- Replaced Q tags with MENTIONED tags where appropriate
- linked broken Q tags with "prev" and "next" attributes
- all occurrences of "..." have been replaced with the
ISO_8879:1986 Publishing entity "hellip"
- Date: 1996-02-20
- Replaced Q and HI tags with MENTIONED tags
in accordance to the English tagging where appropriate
- Changes in the use of NAME tag with TYPE=PLACE -
removed where previously used for names of rivers and oceans
- Tagged using Bulgarian jargon with LANG=BG-CL
corresponding to English LANG=EN-CK
- Using LIST tag corresponding to the English tagging
- Date: 1997-03-20
Tomaz Erjavec, IJS
- Normalisation of corpus component CESHEADER elements:
CESHEADER, EDITIONSTMT, TITLESTMT/H.TITLE
- ISO LANGUAGEs implemented as marked section PUBLIC ent
- Language (WSDs) implemented as PUBLIC entities
- Newspeak LANGUSAGE/LANGUAGE IDs now ns-xx for lang xx
- Now every QUOTE in 1984 has at least one P
- Date: 1997-03-27
Tomaz Erjavec, IJS
- Substituted IGCY entity with JCY
- Date: 1997-04-04
- inserted S tags in the locations given by MtSeg
inserted Q tags where necessary as a result of
S tag insertion
- updated TAGUSAGE for Q and S
- Date: 1997-08-06
- Removed empty S Obg.126.96.36.199.1 and
empty Q Obg.188.8.131.52.1
- updated TAGUSAGE for Q and S, BYTECOUNT
- Date: 1997-09-25
- Changed editionStmt, byteCount, pubDate, Availability
to final form