The encoding of the new translations was harmonised, as much was possible, with the MULTEXT-East ``1984'' corpus components. Due to tight time and effort constraints, the encoding is often less detailed, and there might be more encoding errors than in the MULTEXT-East ``1984''. In general, however, the encodings are more similar than different, esp. as regards their segmentation, so as to simplify sentence alignment.
Below we give an estimate for the number of words, by language. The wordcounts were produced by removing the SGML tags from the texts and then using a 'wc'-like procedure. The translations described here are in boldface:
English | 104.302 |
Romanian | 101.460 |
Slovene | 91.619 |
Serbo-Croatian | 89.749 |
---|---|
Bulgarian | 87.235 |
Latvian | 81.956 |
Czech | 80.366 |
Hungarian | 81.147 |
Estonian | 79.334 |
Lithuanian | 71.252 |
Wordcounts in Orwell's ``1984''