next up previous contents
Next: Language Specific Applications Up: Lexical descriptions and corpus Previous: Lexical descriptions and corpus


In addition to the morphosyntactic specifications (described in detail in further sections) and corpus tags (specified in the Multext-East files tbl.tag.corpus.xx, xx standing for the two-letter language code), a common set of corpus tags for punctuation marks has been defined for all the languages involved in the project. The table below gives the list of punctuation marks along with the corpus tags assigned. The corpus punctuation tags appear in the cesAna format of the disambiguated parallel multilingual corpus as described in the Deliverable D2.3 F. All 7 components (for each language involved) share this common set of corpus tags for punctuation.

=========== ========== =============================
Orthography Corpus tag Definition
=========== ========== =============================
    .       PERIOD     period (full-stop)
    ,       COMMA      comma
    ;       SCOLON     semi-colon
    :       COLON      colon
    ?       QUEST      question mark
    !       EXCL       exclamation mark
   ...      HELLIP     ellipsis
  —   DASH       dash
    (       LPAR       left (opening) parenthesis
    )       RPAR       right (closing) parenthesis
    "       ODBLQ      open double-quotes
    "       CDBLQ      close double-quotes
    -       HYPHEN     hyphen
    /       SLASH      slash
    [       LSQR       left (opening) square bracket
    ]       RSQR       right (closing) square bracket


Orthography column specifies the overt form of a punctuation mark as it appears in the parallel corpus.
No compound punctuation marks are used.
Open and close quotes are orthographically identical but they are distinguished by the corpus tag.