to copy, distribute, display, and perform the work
to make derivative works
Under the following conditions:
Attribution. You must give the original author credit. In scientific publications this
meanrs citing the relevant publication or publications, referred to on the home page of the
project: http://nl.ijs.si/jos/.
Noncommercial. You may not use this work for commercial purposes.
Sampling for this corpus was performed in two steps. First, complete documents
were sampled from FidaPLUS (600M words), to make a corpus of 10M words; at this stage, FidaPLUS
MSDs were converted to JOS MSDs. Second, isolated paragraphs were sampled from the 10M corpus,
to arrive at 100k words.
Sampling complete documents: we chose random documents from FidaPLUS, and
selected those that met the following criteria: 1. were larger than 5 paragraphs and 500 words;
2. were smaller than 500k words; 3. had less than half paragraphs starting with upper-case
words 4. from these, we also discarded documents according to the following weights: $NONTECH =
0.5; $NEWS = 0.5; $JOURNAL = 0.5; $SPORTS = 0.05;
In the second stage, the above corpus was used to sample random paragraphs,
which meet the following criteria: 1. longer than 10 words, 2. shorter than 1000 words.
Word-level linguistic annotation comprises the lemma of a word and its
morphosyntactic description (MSD). Both have been manually checked twice, with the third check
concentrating on the word tokens annotated differenty in the first pass.
Syntactic annotation is in a dependency framework developed in the scope of
the project. The dependency relations have been, in the first pass, manually annotated twice,
with a second pass concentrating in dependencies marked up differenty in the first pass.
Semantic annotation of this corpus was performed manually. The sense
repository used for the annotation process was sloWNet (http://lojze.lugos.si/darja/research/slownet/), a wordnet for
Slovene. Annotation was performed following the targeted annotation principle which aims at
determining senses for a selection of polysemous words in the corpus. This is why only certain
words in the sentence are annotated. In this corpus, all occurrences of 102 most frequent
nouns in the corpus were assigned their most appropriate wordnet sense.
Semantically annotated word or phrase. Majority has @key attribute; the
value is the synest id from sloWNet. The @sortKey attribute gives the target word, i.e. one
of the nouns selected for semantic annotation. The @subtype attribute - where present -
indicates that a sense is missing from sloWNet; its values can be 'missing_hyponym' or
'missing_synset'. The former indicates that a multi-word unit was idenitified, but could not
be assigned a synset - in this case, only the headword was included in the term and annotated with the hypernym.
. The latter
indicates a missing proper name or a missing idiom in sloWNet; such terms were not assigned a synset id.
Root dependency<term>: Root forms a link between the abstract node of the clause or sentence, as the source, with elements which form further connections in a dependency tree. The targets are typically clause predicates, predicateless elliptical parts of sentences or independent particles within a sentence. Furthermore, it forms a link with all other tokens (word or punctuation) without an explicit syntactic role in a sentence.
Predicate part<term>: PPart forms a link between elements without a dependency relation in the usual head-dependent sense which are consequently defined merely as parts of a word phrase. Typically it is used to link parts of verb phrases with the finite verb form or a participle ending in -l, as the source, and morphemes »ne«, »se«, »si«, »bi«, or the forms of the auxiliary verb be used to form future and past tenses, i. e. »bo«, »je«, etc., as the target.
Attribute<term>: Atr is used to link heads and their dependents in word phrases. The source is the head of the phrase, the target is its dependent. Typically it is used in noun phrases, adjectival and adverbial phrases or to connect parts of complex verb phrases with modal verbs and non-finite verb forms, as well as to link subject or object complements to the verb.
Subject<term>: Sb is used to link parts of clauses or sentences that can be defined as traditional subjects. However, the nodes linked with this relation do not comply entirely with the definition of a subject in traditional grammars. On the clause level, it forms a link between the predicate node and the subject node, with the head of the verb phrase in the predicate, as the source, and the head of the noun phrase or other kinds of phrases in the subject, as the target. On the sentence level, it forms a link between the main clause and the subject clause with the head of verb phrase in the main clause, as the source, and the head of the verb phrase in the subject clause, as the target.
Object<term>: Obj is used to link parts of clauses or sentences that can be defined as traditional objects. However, the nodes linked with this relation do not comply entirely with the definition of an object in traditional grammars. On the clause level, it forms a link between the predicate node and the object node, with the head of the verb phrase in the predicate, as the source, and the head of the noun phrase or other kinds of phrases in the object, as the target. On the sentence level, it forms a link between the main clause and the object clause with the head of verb phrase in the main clause, as the source, and the head of the verb phrase in the object clause, as the target.
Adverbial of manner<term>: AdvM is used to link parts of clauses or sentences that can be defined as traditional adverbials of manner. However, the nodes linked with this relation do not comply entirely with the definition of such adverbials in traditional grammars. On the clause level, it forms a link between the predicate node and the adverbial node, with the head of the verb phrase in the predicate, as the source, and the head of the noun phrase or other kinds of phrases in the adverbial, as the target. On the sentence level, it forms a link between the main clause and the adverbial clause with the head of verb phrase in the main clause, as the source, and the head of the verb phrase in the adverbial clause, as the target.
Adverbial, other<term>: AdvO is used to link parts of clauses or sentences that can be defined as traditional adverbials, with the exception of adverbials of manner. However, the nodes linked with this relation do not comply entirely with the definition of such adverbials in traditional grammars. On the clause level, it forms a link between the predicate node and the adverbial node, with the head of the verb phrase in the predicate, as the source, and the head of the noun phrase or other kinds of phrases in the adverbial, as the target. On the sentence level, it forms a link between the main clause and the adverbial clause with the head of verb phrase in the main clause, as the source, and the head of the verb phrase in the adverbial clause, as the target.
Coordination<term>: Coord is used to link parts of coordinate structures on phrase level. It forms a link between the head of the first part of the coordinate structure and the head of the second part of the structure. The source is always the head in the left part of the structure and the target is the head in the right part of the structure.
Conjunction<term>: Conj is used in combination with the Coord relation to link three elements – connected with Coord and Conj – in a triangle. Conj is used to link the head of the second part of the coordinate structure on the phrase level, as the source, and the coordinating conjunction or punctuation mark (if it functions as the coordinating conjunction), as the target.
Multi-word unit<term>: MWU is used to link words which have a very strong tendency to appear together as a group forming a multiword unit and do not show characteristics of a head-dependant phrase structure. Typically, this relation is used to link words with a variant spelling with or without a space, some multi-word conjunctions and similar elements.