srwac

TEI Header

§file description
§title statement
§title

srWaC Corpus (Serbian Web)
§statement of responsibility
§name Nikola Ljubešić
§responsibility

Crawl, conversion and annotation.
§statement of responsibility
§name Tomaž Erjavec (IJS)
§responsibility

Validation, conversion to CWB vertical format, TEI header.
§edition statement
§edition V1.2
§extent
§term

555 million tokens, 1,353,237 URLs
§publication statement
§date 2016-06-13
§publication place http://nlp.ffzg.hr/resources/corpora/srwac/
§availability

The corpus is available via concordancers at nl.ijs.si and for download from http://hdl.handle.net/11356/1063 under the CC BY-SA 4.0 licence.

In published work using this corpus please cite:

Nikola Ljubešić and Filip Klubička {bs,hr,sr}WaC - Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9), ACL, 2014.

§source description

The srWaC corpus contains texts extracted from Serbian HTML pages from the .sr domain. The compilation of this corpus is described in:

Nikola Ljubešić and Filip Klubička {bs,hr,sr}WaC - Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9), ACL, 2014.

§encoding description
§editorial practice declaration
§segmentation

Each text element corresponds to (the text extracted from) one Web page. Paragraphs have been, as much as possible, preserved in the text. The text inside paragraphs has been automatically marked-up for sentences and tokens.

§interpretation

The text has been automatically tokenised, rediacriticies, part-of-speech tagged and lemmatised . The morphosyntactic descriptions ("PoS" tags) follow the preliminary MULTEXT-East V5 specification.

§sampling declaration

This version of the corpus is paragraph deduplicated.

§text-profile description
§language usage
§language

ident = sr
§term

Serbian
§revision description
§change Tomaž Erjavec<name>: Introduced new CMC MSDs (Xe, Xw, Xh, Xa) and corrected some wrong Adverb MSDs.
§date 2016-06-13
§change Tomaž Erjavec<name>: Made teiHeader and vertical file.
§date 2016-04-19


Datum: 2016-06-14

Avtorske pravice za besedilo te izdaje določa licenca Creative Commons Priznanje avtorstva 3.0.