next up previous contents
Next: Corpus Encoding Up: TELRI Appendix 1: Additional Previous: TELRI Appendix 1: Additional

Overview

The encoding of the new translations was harmonised, as much was possible, with the MULTEXT-East ``1984'' corpus components. Due to tight time and effort constraints, the encoding is often less detailed, and there might be more encoding errors than in the MULTEXT-East ``1984''. In general, however, the encodings are more similar than different, esp. as regards their segmentation, so as to simplify sentence alignment.

Below we give an estimate for the number of words, by language. The wordcounts were produced by removing the SGML tags from the texts and then using a 'wc'-like procedure. The translations described here are in boldface:

English 104.302
Romanian 101.460
Slovene 91.619
Serbo-Croatian 89.749
Bulgarian 87.235
Latvian 81.956
Czech 80.366
Hungarian 81.147
Estonian 79.334
Lithuanian 71.252

Wordcounts in Orwell's ``1984''


next up previous contents
Next: Corpus Encoding Up: TELRI Appendix 1: Additional Previous: TELRI Appendix 1: Additional
Multext-East