next up previous contents
Next: Top level structure Up: List of data Previous: List of data

Filenaming conventions

The component corpora are currently stand-alone SGML documents, which were collected for the milestone at the Ljubljana site via ftp and per-partes email, usually through several attempts. Therefore it was essential to adopt a relatively transparent file-naming convention for data interchange and storage. As a general policy, filenames were made descriptive, rather than short, and rely on the Unix filenaming conventions. In particular, the filenames are taken as:

Each corpus component is stored in one file, so, for example, the MULTEXT-East Sampler of the CES-1 encoded Bulgarian translation of ``1984'', compressed with GNU zip is named mte1984Smp-bg.ces1.gz.

The filename thus consists of:

  1. the ``project stamp'' mte;
  2. the component set: 1984, Fict, or News;
  3. ``sampler'' files (which will continue to be SGML documents) have a further Smp following the component set;
  4. a two letter ISO 639:1988 language code (bg, cs, en, et, hu, ro, sl), separated from the preceding part by a hyphen;
  5. the type extension: the M corpus possibilities are ces1 for the CES1 corpus component or orig for the digital ``original'' on the basis of which the CES1 encoding was made;
  6. if the file is not compliant to MULTEXT-East standards, it is marked with a 0 --- an error file, giving the validation errors should also be present in the same directory (no such files should be present in the final distribution);
  7. at the end come the possible storage & compression extensions, currently tar --- for Unix tape archive --- and gz for GNU zip, gzip.


Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996