Next: Corpus Encoding Up: TELRI Appendix 1: Additional Previous: TELRI Appendix 1: Additional

Overview

The encoding of the new translations was harmonised, as much was possible, with the MULTEXT-East ``1984'' corpus components. Due to tight time and effort constraints, the encoding is often less detailed, and there might be more encoding errors than in the MULTEXT-East ``1984''. In general, however, the encodings are more similar than different, esp. as regards their segmentation, so as to simplify sentence alignment.

Below we give an estimate for the number of words, by language. The wordcounts were produced by removing the SGML tags from the texts and then using a 'wc'-like procedure. The translations described here are in boldface:

English	104.302
Romanian	101.460
Slovene	91.619
Serbo-Croatian	89.749
Bulgarian	87.235
Latvian	81.956
Czech	80.366
Hungarian	81.147
Estonian	79.334
Lithuanian	71.252

Wordcounts in Orwell's ``1984''

Next: Corpus Encoding Up: TELRI Appendix 1: Additional Previous: TELRI Appendix 1: Additional

Multext-East