to copy, distribute, display, and perform the work
to make derivative works
Under the following conditions:
Attribution. You must give the original author credit. In scientific publications this means citing the relevant publication or publications,
referred to on the home page of the project: http://nl.ijs.si/jos/.
Noncommercial. You may not use this work for commercial purposes.
Sampling for this corpus was performed in two steps. First, complete documents were sampled
from FidaPLUS (600M words), to make a corpus of 10M words; at this stage, FidaPLUS MSDs were
converted to JOS MSDs. Second, isolated paragraphs were sampled from the 10M corpus, to
arrive at 100k words.
Sampling complete documents: we chose random documents from FidaPLUS, and selected those that
met the following criteria: 1. were larger than 5 paragraphs and 500 words; 2. were smaller
than 500k words; 3. had less than half paragraphs starting with upper-case words 4. from
these, we also discarded documents according to the following weights: $NONTECH = 0.5; $NEWS
= 0.5; $JOURNAL = 0.5; $SPORTS = 0.05;
In the second stage, the above corpus was used to sample random paragraphs, which meet the
following criteria: 1. longer than 10 words, 2. shorter than 1000 words.