DOCX 2 TEI
Instructions and Example Document
Tomaž Erjavec, Andrej Pančur
2024-09-19

Table of Contents

Introduction 2

Using standard Word formatting 2

Basic formatting 3

Character level styles 3

Paragraph level styles 3

Notes 3

Figures 4

Tables 5

Indexes 6

Bibliography 6

Page and line breaks 6

TEI element styles 7

Paragraph level styles 7

Character level styles 8

Janus elements 8

Defining your own 9

Conversion to HTML 9

Conclusions and further work 9

TEI Stylesheet bugs 10

Appendix 1. Auto-generated sections 11

Index 11

Table of Figures 11

Table of Tables 11

Table of contents

1. Introduction

This document is meant as an exemplar and test Word file for a docx2tei profile of the TEI Stylesheets. It also functions as the source for a Word template (.dotx) that can serve for authoring new or editing existing Word documents (primarily books) with the intention of converting them to TEI. How Word structures are converted to TEI is here explained only briefly; to see the details it is best to compare the Word document with the generated TEI one.

This file and the associated profile, as well as a mini Web converter are available at http://nl.ijs.si/tei/convert/

In this document we give as examples the actual Word styles used, and when we refer to them, they are set in italic, e.g. the style Quote. To give TEI structures that these styles are converted to we use XPath and underline them, for <note place="left"> we write note[@place = "left"].

Here are some general hints about the conversion of the Word document to TEI using this profile:

2. Using standard Word formatting

This section reviews what kind of formatting we can do in standard Word to get appropriate TEI elements. The following section explains basic Word formatting (paragraphs, links, text effects) while the next two deal with character level and paragraph level styles. It is important to understand the distinction between the two, because the conversion to TEI is defined in terms of these two levels of styles. At the same time, Word does magic and can change one type of style to the other, which can lead to bad conversion results. When the TEI elements are not as expected it often helps to show Word formatting, i.e. pressing the “Show/Hide ¶” button and the Style gallery.

2.1. Basic formatting

Plain paragraphs are converted to p. Empty paragraphs are removed even if they contain white-space, e.g:

All types of links (to web pages, mail addresses, and document-internal cross-references, e.g. to the section on TEI element styles) should be converted correctly. References to documents on disk will of course not work.

Formatting is converted to the value of hi/@rend: bold, italic, underline, and strikethrough are preserved, also if more than one style is used, e.g. italic bold underline. Colours are also converted: rumeno ozadje, rdeča, svetlozelena, temno rdeča, oranžna, rumena, svetlomodra, modra, vijolična, podčrtana, krepka in nenavadna. The exact details, such as the colour of underline and more fancy text effects are not preserved.

2.2. Character level styles

Inside paragraphs we can have dates (with the Date style), which are converted to date, e.g. “It was a bright cold day in April, and the clocks were striking thirteen.”

2.3. Paragraph level styles

Bulleted and numbered lists are supported, although the numbering style will not survive the conversion, e.g.:

  1. First item
    1. Subitem
      1. Subsubitem

By using the paragraph level Quote style a quote can be produced:

My fake plants died because I did not pretend to water them.

2.4. Notes

Margin Note Left
Margin Note Right

Here is a standard footnote1 and another,2 which should be converted without problems. We can also use endnotes1 although the difference between the two is moot in on-line editions.

Margin Note Inner
Margin Note Outer

Critical editions can also use marginal notes. Following one of the existing TEI profiles we define 4 MarginNote styles (MarginNoteLeft, MarginNoteRight, MarginNoteInner, MarginNoteOuter), which are used to the left and right of this text. They are converted to note[@place = "margin_xx"] where xx {left, right, inner, outer}. Note that the exact positioning of these notes is rather tricky.

2.5. Figures

Figures and esp. tables are the more problematic aspects of conversion, as there are many ways to include them into a Word document. The pictures have to be embedded in the Word document. Because the conversion takes as input a Word document, references to external images are not supported.

The included pictures should be in as high resolution as possible – it is not a good idea to copy & paste them into Word, as this often loses resolution. Also, avoid embedding TIFF images if the TEI is to be afterwards converted to HTML as most Web browsers do not display TIFF.

If the figure has a caption, it should be made with “Insert Caption” so that it is in the correct style (Caption) and has automatic figure numbering. Note that the captions has to be below the image in order to get converted.

So, in short, the conversion supports embedded images with captions and references to them, c.f. Figure 1, which can be also referred to as the Figure below.

Figure 1 Some Statistics, as a picture
Figure 1 Some Statistics, as a picture

It is possible to have two images in one figure (i.e. with one Caption). As shown in Figure 2 they can be side by side or they can be side by side i.e. separated by a paragraph mark , c.f. Figure 3.

Figure 2 Two images, side by side

Figure 3 Two images one above the other

Figures in Word can also be embedded Excel graphs, as is the case with Figure 4. However, this conversion currently does not work.

unable to handle picture here, no embed or link
Figure 4 Embedded Excel example

We can also have pictures without captions. These are with this profile wrapped in figure, c.f. below:

2.6. Tables

Tables, even somewhat more complicated ones (e.g. Table 2) can also be converted to TEI. However, the details of their layout and formatting will not be preserved. As with Figures, it is currently not possible to convert embedded Excel spreadsheets.

Table 1 A Simple table
Lendava Murska Sobota Beltinci okoliške vasi skupaj
1778 14 0 0 0 14
1793 19 14 21 6 60
1812 23 13 40 0 76
Table 2 A more complex table with multi columns
SLOVENSKE DEŽELE
144 192 182
96 89 145
Ljubljana (74) (76) (95)
Sodni okraji z ozemlja današnje slovenske Primorske

2.7. Indexes

Word also supports the making of indexes and they are preserved in the conversion, as the example below shows (click on “Reveal formatting”, i.e. “Show ¶” to see the index marks):

“Here we are indexing the Web, Web services, and Web apps, but also bugs and errors. Note that the index terms can be in Word also formatted, which is lost in the TEI. We can have ranges though, like this.”

2.8. Bibliography

Support for bibliography is quite basic – use the Bibliography style, as below, to get a listBibl element with nested bibl elements, i.e. listBibl/bibl+.

  1. Aumüller, J: Assimilation: Kontroversen um ein migrationspolitisches Konzept. Bielefeld: Transcript Verlag, 2009.
  2. Beller, S.: Wien und die Juden 1867-1938. Wien, Köln in Weimar: Böhlau, 1993.
  3. Fišer, D., Ljubešić, N. & Erjavec, T.: The Janes project: language resources and tools for Slovene user generated content. Language Resources & Evaluation 54, 223–246 (2020). https://doi.org/10.1007/s10579-018-9425-z

2.9. Page and line breaks

Page breaks are preserved in TEI, even soft ones. However, page breaks can be problematic, as they can appear inside any (even otherwise empty) element, like p, head, div. Hard line breaks also work, and are converted to lb.
There was a hard line break just before this sentence, and a hard page break follows it.[Page]

3. TEI element styles

In addition to standard Word styles, there is a special group of styles that start with “tei:” followed by (typically) the name of a TEI element. These styles are in the Word document given lots of eye watering effects to distinguish them from other text.

In some cases the styles are mapped to more complicated structures. An example is the tei:lg style: if a series of paragraphs uses this style then each series ending with an empty paragraph is converted to lg, with l for the individual paragraph. To see the details for each style it is easiest to compare this file with the derived TEI.

3.1. Paragraph level styles

These styles are paragraph level, i.e. they should mark complete paragraphs.

A citation is styled with tei:cit and can (as all other paragraph level styles) have included character level styles, in this case tei:bibl to mark a bibliographic item (note that citations should have a bibliographic item, whereas quotations, i.e. quote, do not need to):

‘A spectre is haunting Europe; the spectre of Communism. ’The Communist Manifesto (1848), by Karl Marx and Friedrich Engels

For poetry the tei:lg style should be used. This is converted to lg for each stanza, and l for a line in a stanza. Note that a series of stanzas can be styled with tei:lg, and an empty paragraph will separate the stanzas, as in the example below:

There once was a man from Nantucket

Who kept all his cash in a bucket.

But his daughter, named Nan,

Ran away with a man

And as for the bucket, Nantucket.

We can also have individual lines of poetry, without line group; for these the tei:l style should be used, e.g.:

There once was a man from Nantucket

Who kept all his cash in a bucket.

The tei:sp style should be used to mark a drama speech (sp). We implement the convention that the first paragraph goes to speaker and the rest are lines. As with a bibliography list, an empty paragraph will separate two speeches. For example:

Polonius

Though this be madness, yet there is method in’t.

Will you walk out of the air, my lord?

Hamlet

Into my grave.

Catch-word

Signature

[Page]

Page number

Running Head

When transcribing primary sources, the fw element is used to mark text in the headers and footers of the pages, where the fw/@type is used to distinguish different types of these “forme works«. To make this annotation easier, several styles are defined which already set the value of @type:

3.2. Character level styles

Character level Word styles map to various TEI phrase level elements:

Note that if we have several such elements in a row there should be at least one character in between marked with Normal style in order to separate them. For example, (Pančur, 2011, 45-47; Erjavec, 2012, 30) will produce one bibl, whereas (Pančur, 2011, 45-47; Erjavec, 2012, 30) will produce two; the difference is in the semicolon, it is marked with the tei:bibl style in the former and Normal in the latter.3

3.2.1. Janus elements

So called Janus (two-faced) elements are used mostly in text-critical editions and are special in that they can represent two alternative paths through the text of the document. When taken as alternative encodings they are wrapped in a subst or choice element. In particular, a contiguous series of del and add elements gets in the TEI the parent element subst , while ordered pairs of abbr followed by expan, orig followed by reg and sic followed by corr get a parent choice element.

Examples of use:

  • For deletions and additions in the source text use tei:del (del) and tei:add (add):
    Delete text DeleteAdd, text AddDelete, text Delete Add DeleteAdd, text Add”.
  • For abbreviations and their expansions use tei:abbr (abbr) and tei:expan (expan):
    Abbreviation text AbbreviationExpansion, text Expansion”.
  • For original and regularised text use tei:orig (orig) and tei:reg (reg):
    Orig text OrigRegularised, text Regularised”.
  • For abbreviations and their expansions use tei:abbr (abbr) and tei:expan (expan):
    Sic text SicCorr SicCorr SicCorr, text Corr”.

3.3. Defining your own

It is possible to define new tei: styles, which will also get converted to the TEI element that follows the tei: prefix. It is useful to give them visual features to distinguish them from the surrounding text and other styles, and, as with the other style, to use them only in contexts where the TEI element is allowed.

4. Conversion to HTML

A down-converter to HTML (together with CSS) is also available in the JSI profile. The HTML simulates the look of the JSI template.docx, in particular the supported elements should look the same as the tei:* styles in Word. The intention is to offer a “round trip” for the author / editor of the Word file, so that errors can be seen by visually comparing the DOCX with the HTML.

5. Conclusions and further work

We’ve used approach to authoring TEI documents via Word for many years now, but so far the workflow was from RTF to TEI with home grown XSLT, c.f. http://nl.ijs.si/e-zrc/rtf2tei/. Now we’ve switched to standard Stylesheets, and this document and the associated profile are our attempt in this direction.

The plan is to:

  1. Fix bugs & add features, c.f. below
  2. Make a better converter than the current one? Maybe install OxGarage?
  3. Split current profile into two:
    1. for social studies (focus on tables, figures, indexes, names, potentially soft pbs)
    2. for humanities (facsimile, text-critical and other tei: styles).

Including facsimiles:

5.1. TEI Stylesheet bugs

[Page]

6. Appendix 1. Auto-generated sections

Word can auto-generate various tables of contents. Not clear if this is worth including in the TEI document (except if pageification is kept), as they would probably be better automatically generated from a tei:divGen.

6.1. Index

bugs

horrible bugs. See bugs

error, 6

formatting, 6

Range

Subrange, 9

Web, 6

Web app, 6

Web service, 6

6.2. Table of Figures

6.3. Table of Tables

Notes
1.

Small footnote (we don’t want a paragraph inside it).

2.

A footnote with two paragraphs. This is the first.

And this is the second.

3.

It is also possible to have character-level styles in notes, e.g., AUMÜLLER, Jutta: Assimilation: Kontroversen um ein migrationspolitisches Konzept. Bielefeld: Transcript Verlag, 2009.

Notes
1

This is an example of an endnote.

Tomaž Erjavec, Andrej Pančur. Date: 2014-01-20