The hrWaC corpus contains texts extracted from Croatian HTML pages from the .hr domain.
The compilation of this corpus is described in:
Nikola Ljubešić and
Filip Klubička
{bs,hr,sr}WaC - Web corpora of Bosnian, Croatian and Serbian.
Proceedings of the 9th Web as Corpus Workshop (WaC-9),
ACL, 2014.
Each text element corresponds to (the text extracted from) one Web page.
Paragraphs have been, as much as possible, preserved in the text.
The text inside paragraphs has been automatically marked-up for sentences and tokens.
The text has been automatically
tokenised, rediacriticised, part-of-speech tagged and lemmatised.
The morphosyntactic descriptions ("PoS" tags) follow the preliminary
MULTEXT-East V5 Croatian specification.