Creating Digital Editions using TEI

Joey Takeda, SFU DHIL

November 08, 2023

Unceded territory of the səl̓ilw̓ətaʔɬ (Tsleil-Waututh), kʷikʷəƛ̓əm (Kwikwetlem), Sḵwx̱wú7mesh Úxwumixw (Squamish), and xʷməθkʷəy̓əm (Musqueam) Nations

Hello!

About me: Developer at the Digital Humanities Innovation Lab @ Simon Fraser University

Elected member of the TEI Technical Council (2023-2027)

Link to today's presentation: https://www.sfu.ca/~takeda/teiworkshops/2023-11-08/

Some definitions

A scholarly edition is the critical representation of historic documents.
Scholarly digital editions are scholarly editions that are guided by a digital paradigm in their theory, method and practice.
A digitized edition is not a digital edition.
Patrick Sahle, "What is a Scholarly Digital Edition?" Digital Scholarly Editing: Theories and Practices, edited by Matthew James Driscoll and Elena Pierazzo. Open Book Publishers, 2016.

The TEI

A set of guidelines for encoding text

A non-profit organization

A community or consortium of users

Website: https://tei-c.org/

The TEI

Is a markup language written in XML

Currently in its 5th major revision (P5 4.1.0)

Used by many projects across the world in many different languages and for many different reasons

Within the noisy market place of the Digital Humanities, the TEI is a kind of senior member, an annoying parental figure for some, a benevolent one for others, something just too old-fashioned even to be considered for others.
Yet, over the last decade, it has become increasingly clear that the TEI is part of what makes the digital humanities happen.
Lou Burnard, "Conclusion: what is the TEI?" What is the Text Encoding Initiative: How to Add Intelligent Markup to Digital Resources. OpenEdition Press, 2014.

What the TEI is not

A language that describes how a text should be displayed online or in print: "performative and expressive significance of the input" vs "the aesthetics of the output".

A programming language: encoding your texts in TEI does not automatically do anything to them

Caveat: There are many, many tools for transforming TEI into other formats (Word documents, PDFs, and, of course, websites)

Some Examples

The Pulter Project

https://pulterproject.northwestern.edu/

Walter Benjamin Digital

https://www.walter-benjamin.online/

The Lyon in Mourning Project

[In development]

A Note on Technology

All of these editions look and behave differently (and have different audiences)

But all of these are rely on a process of markup and encoding

Encoding, markup, et cetera...

At its core, textual encoding is a way of identifying and differentiating bits of text from other bits of texts.

We already do this all the time!

Italics for emphasis

Underlining for titles

Bold for extra-emphasis

Quotation marks for outside attribution or skepticism

All capitals to YELL

+++

Encoding, markup, et cetera...

But these are contextual and local

E.g. different types of punctuation for levels of quotation

And they are subject to varying interpretations

E.g. I think these quotation marks denote a term, but maybe the author is just being sarcastic...

What is markup?

Markup refers to a structured way to identify and separate textual information

The most common form of markup is a structure called XML (aka "pointy brackets")

Semantics v. Display

Semantic or Descriptive markup = encoding what the thing is

Display or Presentational markup = encoding how you want that thing to look

Encoding Texts as Literary Criticism

Marking up text is an assertion of your knowledge and your interpretation of the text

What does the text (form and content) express?

The process is analytical, strategic, and interpretive.
It is analytical, in identifying a set of components into which the text can meaningfully be broken and whose relationship can be represented
Markup is strategic, in that text encoding is always aimed (deliberately or by default) at some intellectual or practical goal
And markup is interpretive, in that the act of encoding will always take place through a connection between an observing individual and a source object.
Julia Flanders, Syd Bauman, and Sarah Connell. "Text Encoding." Doing Digital Humanities, edited by Constance Crompton, Richard Lane, and Ray Siemens. Routledge, 2016.

XML Markup

Markup codifies intentions

"Sure"

<quotation>Sure</quotation>

<sarcasm>Sure</sarcasm>

<skepticism>Sure</skepticism>

<title>Sure</title>

XML

XML = eXtensible Markup Language

XML is not a set language unto itself, but a grammar

There is nothing inherent about the function of XML

It is purely a structure--a way of organizing

Anyone can conceive of an XML dialect (e.g. it is extensible)

XML is Everywhere

HTML (HyperText Markup Language: Every website)

KML (Keyhole Markup Language: Google Maps)

RDF (Resource Description Framework: Library catalogues)

SVG (Scalable Vector Graphics: Digital Images)

OOXML (Open Office XML: This presentation, word documents, et cetera)

XML

XML is hierarchical

XML is a tree-like structure

And is often described in genealogical terms

XML


 <song>
    <verse>
        <line>I am the very model of a 
        modern Major-General,</line>
        <line>I've information vegetable,
         animal, and mineral,</line>
        <line>I know the kings of England,
         and I quote the fights historical</line>
        <line>From <location>Marathon</location>
         to <location>Waterloo</location>,
          in order categorical;</line>
    </verse>
</song>
             
                
            

XML Explained

The two pointy brackets is called an element

E.g. <song> would be called the song element

All elements have start and end tags

E.g. <song> is the start tag and </song> is the end tag

XML Explained

Elements can also have attributes and each attribute must have a value

E.g. <song genre="country"> has a genre attribute with the value of country

(Think of attributes as you would in everyday life; people don't have "height" or "age" without a value)

XML Explained

Elements cannot overlap

<shelf><book>Anna Karenina</book></shelf>

<shelf><book>Anna Karenina</shelf></book>

XML Terminology

Elements nest and use genealogical terms

Root: The main wrapper element (<song>)

Child: Element directly within another element:
<line> is a child of <verse>

Parent: The containing element:
<song> is the parent element of <verse>

Questions so far?

The TEI

Offers a rich vocabulary and method to encode:

Bibliographic and structural features: page breaks, headers, footers, page numbers, line breaks, divisions, paragraphs, line groups, etc

Interpretative features: stage movement, emphasis, place names, proper names, dialogue direction, etc

Editorial apparatus: hands, witnesses, collation, gaps, additions, deletions, etc

Linguistic features: morphemes, feature structures, orthographic form, etc

Spoken features: incidents, pauses, shifts, "communicative phenomenon", etc

Metadata: various classification schemes, provenance, manuscript description, etc

+++++

Components of a (basic) TEI file

Root <TEI> element

A <teiHeader> that describes both the file and the primary source that you are transcribing (if applicable)

A <text> that contains the text of the document

Within text, you can have a <front>, <body>, or <back>

For example:

                
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>The Major General's Song</title>
         </titleStmt>
         <publicationStmt>
            <p>Publication Information</p>
         </publicationStmt>
         <sourceDesc>
            <bibl>From Gilbert and Sullivan's The Pirates of Penzance</bibl>
         </sourceDesc>
      </fileDesc>
  </teiHeader>
  <text>
      <body>
        <div type="song">
            <lg>
                <l>I am the very model of a modern Major-General,</l>
                <l>I've information vegetable, animal, and mineral,</l>
                <l>I know the kings of England, and I quote the fights historical</l>
                <l>From <placeName>Marathon</placeName> to
                     <placeName>Waterloo</placeName>, in order categorical;</l>
            </lg>
        </div>
      </body>
  </text>
</TEI>
                
            

TEI

Note that the TEI is huge (586* elements)

No one uses the entirety of the TEI tagset

Individual projects customize the TEI for their own needs, usually using a small subset of the overall tagset

E.g. Drama projects will use the drama tagset (<sp> for speech, <speaker> for speaker, et cetera) and discard the linguistic/dictionary tagset (<entry> for dictionary entries, <m> for morpheme, etc).

What to encode?

Input =/= Output

Encode what you care about and what you have time to encode

If you don't encode it, you can't do much with it

Why not use something else?

Document centric vs. data-centric

XML is criticized for being verbose and complex

But so are texts!

And the TEI is full of documentation

The Guidelines: Some Examples

Okay, I'm convinced...where do I start?

oXygen XML Editor

https://www.oxygenxml.com/

LEAF Writer

https://leaf-writer.leaf-vre.org/

VSCode + Scholarly XML Extension

https://marketplace.visualstudio.com/items?itemName=raffazizzi.sxml

My stuff is encoded...how do I make it available?

oXygen Built-In Transformation Scenarios

https://www.oxygenxml.com/

TEI Garage

https://teigarage.tei-c.org/

CETEIcean 🐳

https://github.com/TEIC/CETEIcean

Scholarly Editing

https://scholarlyediting.org

I want to learn more!

Learning more about the TEI

The TEI Guidelines: https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html

TEI By Example: https://teibyexample.org/

TEI GitHub: https://github.com/TEIC/TEI

TEI listServ: https://listserv.brown.edu/cgi-bin/wa?A0=TEI-L

Thanks! (Questions?)