Table of Contents
Using standard Word formatting 2
Conclusions and further work 9
Appendix 1. Auto-generated sections 11
This document is meant as an exemplar and test Word file for a docx2tei profile of the TEI Stylesheets. It also functions as the source for a Word template (.dotx) that can serve for authoring new or editing existing Word documents (primarily books) with the intention of converting them to TEI. How Word structures are converted to TEI is here explained only briefly; to see the details it is best to compare the Word document with the generated TEI one.
This file and the associated profile, as well as a mini Web converter are available at http://nl.ijs.si/tei/convert/
In this document we give as examples the actual Word styles used, and when we refer to them, they are set in italic, e.g. the style Quote. To give TEI structures that these styles are converted to we use XPath and underline them, for <note place="left"> we write note[@place = "left"].
Here are some general hints about the conversion of the Word document to TEI using this profile:
This section reviews what kind of formatting we can do in standard Word to get appropriate TEI elements. The following section explains basic Word formatting (paragraphs, links, text effects) while the next two deal with character level and paragraph level styles. It is important to understand the distinction between the two, because the conversion to TEI is defined in terms of these two levels of styles. At the same time, Word does magic and can change one type of style to the other, which can lead to bad conversion results. When the TEI elements are not as expected it often helps to show Word formatting, i.e. pressing the “Show/Hide ¶” button and the Style gallery.
Plain paragraphs are converted to p. Empty paragraphs are removed even if they contain white-space, e.g:
All types of links (to web pages, mail addresses, and document-internal cross-references, e.g. to the section on TEI element styles) should be converted correctly. References to documents on disk will of course not work.
Formatting is converted to the value of hi/@rend: bold, italic, underline, and strikethrough are preserved, also if more than one style is used, e.g. italic bold underline. Colours are also converted: rumeno ozadje, rdeča, svetlozelena, temno rdeča, oranžna, rumena, svetlomodra, modra, vijolična, podčrtana, krepka in nenavadna. The exact details, such as the colour of underline and more fancy text effects are not preserved.
Inside paragraphs we can have dates (with the Date style), which are converted to date, e.g. “It was a bright cold day in April, and the clocks were striking thirteen.”
Bulleted and numbered lists are supported, although the numbering style will not survive the conversion, e.g.:
By using the paragraph level Quote style a quote can be produced:
My fake plants died because I did not pretend to water them.
Here is a standard footnote1 and another,2 which should be converted without problems. We can also use endnotes1 although the difference between the two is moot in on-line editions.
Critical editions can also use marginal notes. Following one of the existing TEI profiles we define 4 MarginNote styles (MarginNoteLeft, MarginNoteRight, MarginNoteInner, MarginNoteOuter), which are used to the left and right of this text. They are converted to note[@place = "margin_xx"] where xx ∈ {left, right, inner, outer}. Note that the exact positioning of these notes is rather tricky.
Figures and esp. tables are the more problematic aspects of conversion, as there are many ways to include them into a Word document. The pictures have to be embedded in the Word document. Because the conversion takes as input a Word document, references to external images are not supported.
The included pictures should be in as high resolution as possible – it is not a good idea to copy & paste them into Word, as this often loses resolution. Also, avoid embedding TIFF images if the TEI is to be afterwards converted to HTML as most Web browsers do not display TIFF.
If the figure has a caption, it should be made with “Insert Caption” so that it is in the correct style (Caption) and has automatic figure numbering. Note that the captions has to be below the image in order to get converted.
So, in short, the conversion supports embedded images with captions and references to them, c.f. Figure 1, which can be also referred to as the Figure below.
It is possible to have two images in one figure (i.e. with one Caption). As shown in Figure 2 they can be side by side or they can be side by side i.e. separated by a paragraph mark , c.f. Figure 3.
Figures in Word can also be embedded Excel graphs, as is the case with Figure 4. However, this conversion currently does not work.
�
We can also have pictures without captions. These are with this profile wrapped in figure, c.f. below:
Tables, even somewhat more complicated ones (e.g. Table 2) can also be converted to TEI. However, the details of their layout and formatting will not be preserved. As with Figures, it is currently not possible to convert embedded Excel spreadsheets.
Word also supports the making of indexes and they are preserved in the conversion, as the example below shows (click on “Reveal formatting”, i.e. “Show ¶” to see the index marks):
“Here we are indexing the Web, Web services, and Web apps, but also bugs and errors. Note that the index terms can be in Word also formatted, which is lost in the TEI. We can have ranges though, like this.”
Support for bibliography is quite basic – use the Bibliography style, as below, to get a listBibl element with nested bibl elements, i.e. listBibl/bibl+.
Page breaks are preserved in TEI, even soft ones. However, page breaks can be problematic,
as they can appear inside any (even otherwise empty) element, like p, head, div. Hard line breaks also work, and are converted to lb.
There was a hard line break just before this sentence, and a hard page break follows
it.[Page]
In addition to standard Word styles, there is a special group of styles that start with “tei:” followed by (typically) the name of a TEI element. These styles are in the Word document given lots of eye watering effects to distinguish them from other text.
In some cases the styles are mapped to more complicated structures. An example is the tei:lg style: if a series of paragraphs uses this style then each series ending with an empty paragraph is converted to lg, with l for the individual paragraph. To see the details for each style it is easiest to compare this file with the derived TEI.
These styles are paragraph level, i.e. they should mark complete paragraphs.
A citation is styled with tei:cit and can (as all other paragraph level styles) have included character level styles, in this case tei:bibl to mark a bibliographic item (note that citations should have a bibliographic item, whereas quotations, i.e. quote, do not need to):
For poetry the tei:lg style should be used. This is converted to lg for each stanza, and l for a line in a stanza. Note that a series of stanzas can be styled with tei:lg, and an empty paragraph will separate the stanzas, as in the example below:
There once was a man from Nantucket
Who kept all his cash in a bucket.
But his daughter, named Nan,
Ran away with a man
And as for the bucket, Nantucket.
We can also have individual lines of poetry, without line group; for these the tei:l style should be used, e.g.:
There once was a man from Nantucket
Who kept all his cash in a bucket.
The tei:sp style should be used to mark a drama speech (sp). We implement the convention that the first paragraph goes to speaker and the rest are lines. As with a bibliography list, an empty paragraph will separate two speeches. For example:
Polonius
Though this be madness, yet there is method in’t.
Will you walk out of the air, my lord?
Hamlet
Into my grave.
Catch-word
Signature
Page number
Running Head
When transcribing primary sources, the fw element is used to mark text in the headers and footers of the pages, where the fw/@type is used to distinguish different types of these “forme works«. To make this annotation easier, several styles are defined which already set the value of @type:
Character level Word styles map to various TEI phrase level elements:
Note that if we have several such elements in a row there should be at least one character in between marked with Normal style in order to separate them. For example, (Pančur, 2011, 45-47; Erjavec, 2012, 30) will produce one bibl, whereas (Pančur, 2011, 45-47; Erjavec, 2012, 30) will produce two; the difference is in the semicolon, it is marked with the tei:bibl style in the former and Normal in the latter.3
So called Janus (two-faced) elements are used mostly in text-critical editions and are special in that they can represent two alternative paths through the text of the document. When taken as alternative encodings they are wrapped in a subst or choice element. In particular, a contiguous series of del and add elements gets in the TEI the parent element subst , while ordered pairs of abbr followed by expan, orig followed by reg and sic followed by corr get a parent choice element.
Examples of use:
It is possible to define new tei: styles, which will also get converted to the TEI element that follows the tei: prefix. It is useful to give them visual features to distinguish them from the surrounding text and other styles, and, as with the other style, to use them only in contexts where the TEI element is allowed.
A down-converter to HTML (together with CSS) is also available in the JSI profile. The HTML simulates the look of the JSI template.docx, in particular the supported elements should look the same as the tei:* styles in Word. The intention is to offer a “round trip” for the author / editor of the Word file, so that errors can be seen by visually comparing the DOCX with the HTML.
We’ve used approach to authoring TEI documents via Word for many years now, but so far the workflow was from RTF to TEI with home grown XSLT, c.f. http://nl.ijs.si/e-zrc/rtf2tei/. Now we’ve switched to standard Stylesheets, and this document and the associated profile are our attempt in this direction.
The plan is to:
Including facsimiles:
Word can auto-generate various tables of contents. Not clear if this is worth including in the TEI document (except if pageification is kept), as they would probably be better automatically generated from a tei:divGen.
bugs
horrible bugs. See bugs
error, 6
formatting, 6
Range
Subrange, 9
Web, 6
Web app, 6
Web service, 6
It is also possible to have character-level styles in notes, e.g., AUMÜLLER, Jutta: Assimilation: Kontroversen um ein migrationspolitisches Konzept. Bielefeld: Transcript Verlag, 2009.