hrwac

TEI Header

^§file description

^§title statement

^§title

hrWaC Corpus (Croatian Web)

^§statement of responsibility

^§name	Nikola Ljubešić
^§responsibility	Crawl, conversion and annotation.

^§statement of responsibility

^§name	Tomaž Erjavec (IJS)
^§responsibility	Validation, conversion to CWB vertical format, TEI header.

^§edition statement

^§edition

V2.2

^§extent

^§term

700 million tokens, 3,611,090 texts

^§publication statement

^§date

2016-06-13

^§publication place

http://nlp.ffzg.hr/resources/corpora/hrwac/

^§availability

The corpus is available via concordancers at nl.ijs.si and for download from http://hdl.handle.net/11356/1064 under the CC BY-SA 4.0 licence.

In published work using this corpus please cite:

Nikola Ljubešić and Filip Klubička {bs,hr,sr}WaC - Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9), ACL, 2014.

^§source description

The hrWaC corpus contains texts extracted from Croatian HTML pages from the .hr domain. The compilation of this corpus is described in:

Nikola Ljubešić and Filip Klubička {bs,hr,sr}WaC - Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9), ACL, 2014.

^§encoding description

^§editorial practice declaration

^§segmentation	Each text element corresponds to (the text extracted from) one Web page. Paragraphs have been, as much as possible, preserved in the text. The text inside paragraphs has been automatically marked-up for sentences and tokens.
^§interpretation	The text has been automatically tokenised, rediacriticised, part-of-speech tagged and lemmatised. The morphosyntactic descriptions ("PoS" tags) follow the preliminary MULTEXT-East V5 Croatian specification.

^§sampling declaration

This version of the corpus is paragraph deduplicated.

^§text-profile description

^§language usage

^§language

ident = hr

^§term

Croatian

^§revision description

^§change

Tomaž Erjavec_<name>: Introduced new CMC MSDs (Xe, Xw, Xh, Xa) and corrected some wrong Adverb MSDs.

^§date

2016-06-13

^§change

Tomaž Erjavec_<name>: Made teiHeader and vertical file.

^§date

2016-05-11

Datum: 2016-06-14

Avtorske pravice za besedilo te izdaje določa licenca Creative Commons Priznanje avtorstva 3.0.