itwac

TEI Header

^§file description

^§title statement

^§title

itWaC corpus

^§statement of responsibility

^§name	The WaCkys
^§responsibility	Acquisition of HTML source, conversion to text corpus.

^§statement of responsibility

^§name	Tomaž Erjavec (IJS)
^§responsibility	Linguistic annotation and XML encoding.

^§edition statement

^§edition

V1.0

^§extent

^§term

1,900 million tokens, 1.870 thousand URLs

^§publication statement

^§date	2013-03-29
^§publication place	Source corpus is available from wacky.sslmit.unibo.it; Concordances over the corpus are available from nl.ijs.si.
^§availability	The corpus is available via concordancers at CLARIN.SI and for download with permission from wacky.sslmit.unibo.it.

^§source description

^§encoding description

^§editorial practice declaration

^§segmentation	Each text element corresponds to the text extracted from one Web page. The text has been automatically marked-up for sentences and tokens by TreeTagger.
^§interpretation	The text has been automatically tokenised, part-of-speech tagged and lemmatised with TreeTagger. The TreeTagger tags have been then also mapped to the common, MULTEXT-based SPOOK tagset.

^§text-profile description

^§language usage

^§language

ident = it

^§term

Italian

^§revision description

^§change

Tomaž Erjavec_<name>: First release.

^§date

2013-02-29

Datum: 2018-03-01

Avtorske pravice za besedilo te izdaje določa licenca Creative Commons Priznanje avtorstva 3.0.