hrwac

TEI Header

§file description
§title statement
§title

hrWaC Corpus (Croatian Web)
§statement of responsibility
§name Nikola Ljubešić
§responsibility

Crawl, conversion and annotation.
§statement of responsibility
§name Tomaž Erjavec (IJS)
§responsibility

Validation, conversion to CWB vertical format, TEI header.
§edition statement
§edition V2.2
§extent
§term

700 million tokens, 3,611,090 texts
§publication statement
§date 2016-06-13
§publication place http://nlp.ffzg.hr/resources/corpora/hrwac/
§availability

The corpus is available via concordancers at nl.ijs.si and for download from http://hdl.handle.net/11356/1064 under the CC BY-SA 4.0 licence.

In published work using this corpus please cite:

Nikola Ljubešić and Filip Klubička {bs,hr,sr}WaC - Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9), ACL, 2014.

§source description

The hrWaC corpus contains texts extracted from Croatian HTML pages from the .hr domain. The compilation of this corpus is described in:

Nikola Ljubešić and Filip Klubička {bs,hr,sr}WaC - Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9), ACL, 2014.

§encoding description
§editorial practice declaration
§segmentation

Each text element corresponds to (the text extracted from) one Web page. Paragraphs have been, as much as possible, preserved in the text. The text inside paragraphs has been automatically marked-up for sentences and tokens.

§interpretation

The text has been automatically tokenised, rediacriticised, part-of-speech tagged and lemmatised. The morphosyntactic descriptions ("PoS" tags) follow the preliminary MULTEXT-East V5 Croatian specification.

§sampling declaration

This version of the corpus is paragraph deduplicated.

§text-profile description
§language usage
§language

ident = hr
§term

Croatian
§revision description
§change Tomaž Erjavec<name>: Introduced new CMC MSDs (Xe, Xw, Xh, Xa) and corrected some wrong Adverb MSDs.
§date 2016-06-13
§change Tomaž Erjavec<name>: Made teiHeader and vertical file.
§date 2016-05-11


Datum: 2016-06-14

Avtorske pravice za besedilo te izdaje določa licenca Creative Commons Priznanje avtorstva 3.0.