This document is a HTML 3.2 rendering of a
Corpus Encoding Specification
DTD document, produced in the scope of the
MULTEXT-East
project, by
Fred.
Note that this HTML translation does not contain all the information from the cesHeader.
CES header
Creator: Ştefan Bruda
Created: 1995-12-10
Updated: 1997-10-03
File Description
- Title Statement
- Title:
-
MULTEXT-EAST corpus: 1984 Romanian
- Responsibility
-
Dan Tufiş
Center for Artificial Intelligence
NLP division
Romanian Academy
(
Overal editorship.
)
Ştefan Bruda
Center for Artificial Intelligence
NLP division
Romanian Academy
(
Error correction and CES1 conformance.
)
Greg Priest-Dorman
(
Added tagging of sentences in paragraphs using MtSgml and
Romanian resources.
)
- Edition:
- MTE Final Release
- Extent:
- 118093 words
1272607 bytes
Note:
wordCount computed considering clitics as distinct words
and several words making a compound just one word. This count was computed
on the segmented document with word mark-up.
If the counting ignores clitics and compounds the wordCount would be 98074;
the sequence that provided this count is the following:
sed -e '1,/<\/ces[Hh]eader>/d' < ces-file | sed -e 's/<[^<].*>//g' | sed -e 's/<.*$//g' |sed -e 's/^.*>//g' | wc -w
bytecount - disk space occupied by the full sgml text
- Publication Statement
- Distributor:
-
Romanian Academy,
Centre for Artificial Intelligence
- Address:
-
13, 13 Septembrie Str.,
Bucharest, Romania
- Electronic address:
-
tufis@valhalla.racai.ro
- Availiability:
-
Available for research purposes upon receipt of signed agreement
- Publication date:
- October 1, 1997
- Source Description
- Structured Bibliography
- Monography
- Title:
-
O mie nouă sute optzeci şi patru
- Author:
-
George Orwell
- Imprint
- Publication date:
-
1991
- Publisher:
-
Editura Univers
- Place:
-
Bucharest
Encoding Description
- Project Description:
-
MULTEXT-East:
Multilingual Text Tools and Corpora for Central and Eastern
European Languages.
EU Copernicus Project COP106
- Tag declaration:
- name = 2159
- title = 1
- div = 28
- text = 1
- foreign = 429
- l = 26
- body = 1
- quote = 23
- item = 4
- p = 1335
- num = 3
- poem = 7
- hi = 413
- q = 2137
Q tags with a attribute of "type=MI" have
been inserted automatically after S insertion.
- head = 28
- s = 6487
S tags have been inserted automatically and then cleaned up
by hand in the locations (character offsets) provided by MTSeg
version 1.3.1 using the Romanian resource files.
- note = 3
- abbr = 3
- list = 1
- date = 7
Revision Description
- Date: 97-06-30
Dan Tufiş
-
Corrected several typos and added missing punctuation (mainly commas)
The Bytecount and Wordcount were updated.
- Date: 97-06-23
Dan Tufiş
-
Deleted empty Ss and Qs; inserted missing Ss;
- Date: 97-06-19
Ştefan Bruda
-
Corrected some typos; eliminated the blanks before
punctuation marks and between markup and words.
- Date: 97-05-16
Ştefan Bruda
-
Made some changes into the paragraph structure for a better
alignment to the English version; added a new paragraph
which was not translated from English; updated tagusage.
- Date: 97-04-3
Dan Tufiş
-
Eliminated spaces around punctuation, corrected some mark-up
- Date: 97-03-6
Dan Tufiş
-
Added some lines overlooked when keyboarded.
Corrected some typos.
- Date: 97-02-18
Dan Tufiş
-
Corrected extent section of the header
- Date: 96-11-5
Ştefan Bruda
-
Corrected the header, so it better corresponds to
CES recommendations
- Date: 96-11-5
Georgiana Rotariu
-
Added name tags
- Date: 96-5-6
Ştefan Bruda
-
Corrected the header, so it better corresponds to
CES recommendations
- Date: 96-5-6
Ştefan Bruda
-
Added div tags
- Date: 95-12-10
Ştefan Bruda
-
Marked-up to CES1 compliance
- Date: 1997-04-02
Greg Priest-Dorman
- inserted S tags in the locations given by MtSeg
-
inserted Q and HI tags where necessary as a result of
S tag insertion
- updated and sorted TAGUSAGE
- Date: 1997-04-11
Ştefan Bruda
- added "dummy" Ps inside QUOTES for aligning purposes;
such paragraphs has the value "DUMMY" for rend attribute.
- updated TAGUSAGE
- Date: 1997-04-13
Greg Priest-Dorman
- segmented newly added Ps with MtSeg
- inserted S tags in the locations given by MtSeg
- changed header to comply with Tomaz's header style
- changed lang="latin" to lang="la"
- removed rend="DUMMY" from Ps
- removed QUOTE /QUOTE pairs and moved QUOTE rend to P
where appropriate
- updated TAGUSAGE
- removed blank lines
- Date: 1997-09-25
Tomaž Erjavec
- Changed editionStmt, byteCount, pubDate, Availability
to final form
- Date: 97-10-03
Vasile Pătraşcu
-
Corrected several typos and added missing punctuation (mainly commas)
The Tagusage, Bytecount and Wordcount were updated. Entities that
were counted as words are those that were identified by the segmenter
that is words, clitics, compounds (counted as one unit, irrespective
of the number of constituents), punctuation, numbers.