Japanese web corpus with difficulty levels jpWaC-L http://nl.ijs.si/jaslo/ The Japanese Web corpus with difficulty levels jpWaC-L contains over 300 million words, with words and sentences annotated with their difficulty level. The corpus is available also as 5 subcorpora, each for one difficulty level, from 4 (easiest) to 0 (hardest). The corpus was collected from the Web using WaCkY tools and then processed by Chasen. The difficulty levels of the words come from a lexicon provided by prof. Yoshiko Kawamura, Tokyo International University. Words are assigned difficulty levels according to the Japanese Language Proficiency Test Content Specifications (Revised Edition), Japan Foundation & Association of International Education Japan. Tokyo: Bonjinsha 2004. The difficulty level of the sentences is computed using various heuristics, based on the (difficulty level of) words, sentence length, etc. (c.f. Hmeljak et al. 2010) The corpus is encoded in CQP vertical format with structural attributes (one element for each file), and (sentence). Each gives the @url and @domain of the text and has the @level attribute giving the difficulty of the sentence. Positional attributes are: 1. token, as it appears in text 2. lemma of the word 3. the Chasen tag, translated to English 4. original Chasen tag in Japanese 5. difficulty level of the word Example: よろしく よろしく Adv.g 副詞-一般 4 お願い お願い N.Vs 名詞-サ変接続 4 し する V.free 動詞-自立 4 ます ます Aux 助動詞 4 。 。 Sym.p 記号-句点 0 ... ================================================================================ Tomaz Erjavec, JSI 2013-05-01