Next: Top level structure
Up: List of data
Previous: List of data
The component corpora are currently stand-alone SGML documents, which
were collected for the milestone at the Ljubljana site via ftp and
per-partes email, usually through several attempts. Therefore
it was essential to adopt a relatively transparent file-naming
convention for data interchange and storage. As a general policy,
filenames were made descriptive, rather than short, and rely on the
Unix filenaming conventions. In particular, the filenames are
taken as:
- alphanumeric ASCII with no length limit,
- upper and lower case distinguished and
- hyphens and multiple extensions (``.'') allowed.
Each corpus component is stored in one file, so, for example, the
MULTEXT-East
Sampler of the CES-1 encoded Bulgarian translation of ``1984'',
compressed with GNU zip is named
mte1984Smp-bg.ces1.gz.
The filename thus consists of:
- the ``project stamp'' mte;
- the component set: 1984, Fict, or
News;
- ``sampler'' files (which will continue to be SGML documents)
have a further Smp following the component set;
- a two letter ISO 639:1988 language code (bg, cs, en, et,
hu, ro, sl), separated from the preceding part by a hyphen;
- the type extension: the M corpus possibilities are ces1
for the CES1 corpus component or orig for the digital
``original'' on the basis of which the CES1 encoding was made;
- if the file is not compliant to MULTEXT-East
standards, it is marked
with a 0 --- an error file, giving the validation
errors should also be present in the same directory (no such files
should be present in the final distribution);
- at the end come the possible storage & compression extensions,
currently tar --- for Unix tape archive --- and gz
for GNU zip, gzip.
Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996