Exam questions for Introduction to Human Language Technologies 07/08 2008-01-24 The exam lasts for one and a half hours. You can use any reference materials you wish, and, of course, Python with the NLTK library. In case you can't make a program work, submit what you did manage to write. When you finish, put all the text answers, programs, and their output in an email, and send it to tomaz.erjavec@ijs.si In case there would be problems with mailing, give the file to Walter Scholger. QUESTIONS: 1. Compare the ICE (http://www.ucl.ac.uk/english-usage/ice/) and COLT (http://torvald.aksis.uib.no/colt/) corpora by characteristics and typology. 2. What are the advantages and disadvantages of having a text encoded in ISO 8859-1 compared to UTF-16? 3. Take the text in http://nl.ijs.si/et/teach/graz07/hlt/Excercise/text.txt and write a program that substitutes each vowel with "V" and each consonant with "C". E.g. "2 Programming" becomes "2 CCVCCVCCVCC". Write out the result. 4. For the same (original) text, compute its vocabulary (lexicon), and output it alphabetically sorted. Each word in the output should be preceded by its rank and followed by its frequency, e.g. 1. aardvark 1 2. able 5 3. act 4 5. Same as above, but sort the lexicon according to reversed words (e.g., "the" reversed is "eht" and should come before "feel" ("leef"), e.g. 1. able 5 2. aardvark 1 3. act 4