frwac

TEI Header

§file description
§title statement
§title

frWaC corpus
§statement of responsibility
§name The WaCkys
§responsibility

Acquisition of HTML source, conversion to text corpus.
§statement of responsibility
§name Tomaž Erjavec (IJS)
§responsibility

Linguistic annotation and XML encoding.
§edition statement
§edition V1.0
§extent
§term

1,600 million tokens, 2.270 thousand URLs
§publication statement
§date 2013-03-29
§publication place Source corpus is available from wacky.sslmit.unibo.it; Concordances over the corpus are available from nl.ijs.si.
§availability

The corpus is available via concordancers at CLARIN.SI and for download with permission from wacky.sslmit.unibo.it.

§source description

c.f. wacky.sslmit.unibo.it

§encoding description
§editorial practice declaration
§segmentation

Each text element corresponds to the text extracted from one Web page. The text has been automatically marked-up for sentences and tokens by TreeTagger.

§interpretation

The text has been automatically tokenised, part-of-speech tagged and lemmatised with TreeTagger. The TreeTagger tags have been then also mapped to the common, MULTEXT-based SPOOK tagset.

§text-profile description
§language usage
§language

ident = fr
§term

French
§revision description
§change Tomaž Erjavec<name>: First release.
§date 2013-02-29


Datum: 2018-03-01

Avtorske pravice za besedilo te izdaje določa licenca Creative Commons Priznanje avtorstva 3.0.