This document is a HTML 3.2 rendering of a
Corpus Encoding Specification
DTD document, produced in the scope of the
MULTEXT-East
project, by
Fred.
Note that this HTML translation does not contain all the information from the cesHeader.
CES header
Creator: LS
Created: 1996-05-14
Updated: 1997-09-25
File Description
- Title Statement
- Title:
- Multext-East CES1: Newspapers, Bulgarian
- Responsibility
- Lydia Sinapova
(
Typing-in excerpts from Capital and Continent,
Excerpting paragraph and some sub-paragraph
level tagging to the electronic and typed-in
texts.
)
Lydia Sinapova
(
Modified Newspaper corpus markup down to sub-paragraph level
to conform to CES V4.0
)
- Edition:
- MTE Final Release
- Extent:
- 96538 words
3241551 bytes
Note:
WordCount represents the number of words in this text exclusive
of tags and header information. Microsoft Word 6.0 was used to count words.
ByteCount reflects the approximate
size of the file containing the doctype and cesDoc element including
all text, tags and header information. The size of the file with Cyrillics
represented by SGML entities is approximately 5 times larger than the size
of the originally tagged Cyrillic file.
- Publication Statement
- Distributor:
-
Institue of Mathematics,
Bulgarian Academy of Sciences
- Address:
-
Acad G. Bonchev st. bl.8
1113 Sofia, Bulgaria
- Electronic address:
- mult@ling.math.acad.bg
- Availiability:
-
Available for research purposes upon receipt of signed agreement
- Publication date:
- October 1, 1997
- Source Description
- Structured Bibliography
- Monography
- Title:
- Capital (Bulgarian) April 29 - May 5, 1996
- Imprint
- Publisher:
- AII OOD
- Publication date:
- 1996-05-28
- Place:
- Sofia, Bulgaria
- Monography
- Title:
- Continent (Bulgarian) 1995, January 15
- Imprint
- Publisher:
- Publishing House "MEGAPRESS" AD
- Publication date:
- 1995-01-15
- Place:
- Sofia, Bulgaria
- Full Bibliography
- Title Statement
- Title:
- Selected articles from Pari Daily,
in electronic form
- Responsibility
- Tsvetan Petrov - vice editor
(
The electronic texts of the excerpts from
"Pari" were prepared by the journalistst
for internal usage only and kindly provided
by Mr. Tsvetan Petrov for the MTE project
in DOS Word 5 format.
Not all of the actually published articles
were included in the electronic files
)
- Publication Statement
- Distributor:
- PARI Daily
- Address:
- 1000 Sofia, "Tsarigradsko shosse" blvd 47
- Availiability:
-
The electronic texts are property of their authors
and are not distributed
- Publication date:
- May 02, May 03 1996
- Source Description
- Structured Bibliography
- Monography
- Title:
- Pari (Bulgarian) May 02, 1996
- Title:
- Pari (Bulgarian) May 03, 1996
- Imprint
- Publisher:
-
"RUBICON" - Izdatelsko-targovski kompleks PARI OOD
- Publication date:
- May 02, May 03 1996
- Place:
- Sofia, Bulgaria
- Full Bibliography
- Title Statement
- Title:
- Selected articles from Standart Daily,
in electronic form
- Responsibility
- Kiril Simov
(
The electronic texts of the excerpts from
"Standart" were prepared by the journalistst
for internal usage only.
They were provided for the MTE project in
DOS Word 5 format by Kiril Simov.
)
- Publication Statement
- Distributor:
- "Standart news" AD
- Address:
- 1303 Sofia, Antim I, 53
- Availiability:
-
The electronic texts are property of their authors
and are not distributed
- Publication date:
- February, May, 1995
- Source Description
- Structured Bibliography
- Monography
- Title:
- Standart Daily
- Imprint
- Publisher:
-
"Standart news" AD
- Publication date:
- February, May 1995
- Place:
- Sofia, Bulgaria
Encoding Description
- Project Description:
-
MULTEXT-East:
Multilingual Text Tools and Corpora for Central and
Eastern European Languages.
EU Copernicus Project COP106
- Tag declaration:
- abbr = 1295
All abbreviations are marked.
The 'expan' attribute is not always used.
- body = 1
- caption = 143
This tag is used mainly for phrases accompanying figures (with type=attached)
and for phrases that are in some way separated form the surrounding
text (type = display).
- byline = 302
- date = 395
All dates which contain one or more digits (the characters 0-9) are
marked, including dates specifying day/month/year, day/month,
and dates consisting only of a year. The attribute 'iso8601'
is not used consistently.
- dateline = 29
- distinct = 6
This tag is used for foreign words that are not commonly used in Bulgarian
but are written in Cyrillics. Foreign words that are used widely (e.g.
computer) are not tagged with this tag.
- div = 560
- docAuthor = 105
- figDesc = 48
- figure = 48
This tag is used to mark occurrences of figures. No reference to objects,
representing the figures is made.
- foreign = 9
This tag is used only with words that are not names.
- head = 500
- hi = 376
The highlighting tag is used to mark words and phrases
which were typographically distinguished, and
for which no other more precise tag is applicable.
In most of these cases, such highlighting signifies
emphasis.
- item = 50
- label = 2
- list = 11
- measure = 18
- mentioned = 10
- name = 4967
All names of people, places, organizations,
products, events, programmes, are marked.
This tag was used also for names of documents in cases,
where 'title' did not seem very appropriate.
Person names in the genitive are not marked.
- note = 11
- num = 1555
Anything containing one or more digits (the characters 0-9) that is
not part of a date, and all roman numerals, are marked as a
number. In cases where a ratio is expressed (per cent, per thousand),
the entire phrase (e.g., "10 per cent") is marked as a number.
- opener = 29
- p = 1440
- ptr = 23
- q = 228
The Q tag is used to mark quoted dialogue.
The attribute "broken=yes" is used when no sentence terminating punctuation
(either inside the Q itself or in the intervening text between two Qs)
appears between two dialogue fragments by the same speaker.
- quote = 1
- ref = 7
- s = 155
S tag is used only in case
of broken sentences
- sp = 80
SP tag is used for interviews
- speaker = 6
- text = 1
- term = 4
Words with specific usage in a particular domain are tagged with this tag.
No attempt is made to identify the type of domain.
- time = 13
All times which contain one or more digits (the characters 0-9) are
marked. The attribute 'iso8601' is not used consistently.
- title = 254
Titles of newspapers, books, songs, pictures, movies, and any other
art objects are marked.
Revision Description
- Date: 1996-10-25
Lydia Sinapova
- Replaced Q tags with MENTIONED tags where appropriate
- linked broken Q tags with "prev" and "next" attributes
- Distinguished text within the mainstream to serve
as an "in-between" title has been tagged with CAPTION if the text
consists of whole sentences. Otherwise HI is used.
- all occurrences of "... have been replaced with the
ISO_8879:1986 Publishing entity "hellip"
- all occurrences of "%" have been replaced with the
ISO_8879:1986 Publishing entity "percnt"
- all occurrences of paragraph character have been replaced
with the ISO_8879:1986 Publishing entity "sect"
- Date: 1996-01-28
Lydia Sinapova
- linked broken Q tags with "prev" and "next" attributes
- Distinguished text within the mainstream to serve
as an "in-between" title has been tagged with CAPTION.
whereby broken sentences are linked with "prev" and "next"
- inserting ID attribute to P tag in articles with sentences
broken by CAPTION for linking purposes
- Date: 1997-03-20
Tomaz Erjavec, IJS
- Normalisation of corpus component CESHEADER elements:
CESHEADER, EDITIONSTMT, TITLESTMT/H.TITLE
- ISO LANGUAGEs implemented as marked section PUBLIC ent
- Language (WSDs) implemented as PUBLIC entities
- Date: 1997-03-27
Tomaz Erjavec, IJS
- Substituted IGCY entity with JCY (80 occurences)
- Date: 1997-09-25
Tomaž Erjavec
- Changed editionStmt, byteCount, pubDate, Availability
to final form