slwac

TEI Header

§file description
§title statement
§title

slWaC Corpus
§statement of responsibility
§name Nikola Ljubešić
§responsibility

Acquisition of HTML source, conversion to text corpus.
§statement of responsibility
§name Tomaž Erjavec (IJS)
§responsibility

Linguistic annotation and XML encoding.
§edition statement
§edition V2.0
§extent
§term

1,258 million tokens, 2.8 million URLs
§publication statement
§date 2014-07-01
§publication place nl.ijs.si http://www.nljubesic.net/resources/corpora/slwac/
§availability

The corpus is available via concordancers at nl.ijs.si and for download under the CC BY-SA 4.0 licence.

§source description

The slWaC corpus contains texts extracted from the crawled HTML pages in Slovene (mostly) from the .si domain. This corpus is an extended version of the corpus described in: Nikola Ljubešić and Tomaž Erjavec: hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. Text, Speech and Dialogue 2011. Lecture Notes in Computer Science vol. 9743, 395-402 Springer.

§encoding description
§editorial practice declaration
§segmentation

Each text element corresponds to (the text extracted from) one Web page. Paragraphs have been, as much as possible, preserved in the text. The text inside paragraphs has been automatically marked-up for sentences and tokens.

§interpretation

The text has been automatically tokenised, part-of-speech tagged and lemmatised with the ToTaLe tool. The morphosyntactic descriptions ("PoS" tags) follow the JOS specification.

§text-profile description
§language usage
§language

ident = sl
§term

Slovene
§revision description
§change Tomaž Erjavec<name>: First release.
§date 2014-07-01


Datum: 2014-08-18

Avtorske pravice za besedilo te izdaje določa licenca Creative Commons Priznanje avtorstva 3.0.