Joey Takeda, SFU DHIL
November 08, 2023
About me: Developer at the Digital Humanities Innovation Lab @ Simon Fraser University
Elected member of the TEI Technical Council (2023-2027)
Link to today's presentation: https://www.sfu.ca/~takeda/teiworkshops/2023-11-08/
A scholarly edition is the critical representation of historic documents.
Scholarly digital editions are scholarly editions that are guided by a digital paradigm in their theory, method and practice.
A digitized edition is not a digital edition.
A set of guidelines for encoding text
A non-profit organization
A community or consortium of users
Website: https://tei-c.org/
Is a markup language written in XML
Currently in its 5th major revision (P5 4.1.0)
Used by many projects across the world in many different languages and for many different reasons
Within the noisy market place of the Digital Humanities, the TEI is a kind of senior member, an annoying parental figure for some, a benevolent one for others, something just too old-fashioned even to be considered for others.
Yet, over the last decade, it has become increasingly clear that the TEI is part of what makes the digital humanities happen.
A language that describes how a text should be displayed online or in print: "performative and expressive significance of the input" vs "the aesthetics of the output".
A programming language: encoding your texts in TEI does not automatically do anything to them
Caveat: There are many, many tools for transforming TEI into other formats (Word documents, PDFs, and, of course, websites)
[In development]
All of these editions look and behave differently (and have different audiences)
But all of these are rely on a process of markup and encoding
At its core, textual encoding is a way of identifying and differentiating bits of text from other bits of texts.
We already do this all the time!
Italics for emphasis
Underlining for titles
Bold for extra-emphasis
Quotation marks for outside attribution
or skepticism
All capitals to YELL
+++
But these are contextual and local
E.g. different types of punctuation for levels of quotation
And they are subject to varying interpretations
E.g. I think these quotation marks denote a term, but maybe the author is just being sarcastic...
Markup refers to a structured way to identify and separate textual information
The most common form of markup is a structure called XML (aka "pointy brackets")
Semantic or Descriptive markup = encoding what the thing is
Display or Presentational markup = encoding how you want that thing to look
Marking up text is an assertion of your knowledge and your interpretation of the text
What does the text (form and content) express?
The process is analytical, strategic, and interpretive.
It is analytical, in identifying a set of components into which the text can meaningfully be broken and whose relationship can be represented
Markup is strategic, in that text encoding is always aimed (deliberately or by default) at some intellectual or practical goal
And markup is interpretive, in that the act of encoding will always take place through a connection between an observing individual and a source object.
Markup codifies intentions
"Sure"
<quotation>Sure</quotation>
<sarcasm>Sure</sarcasm>
<skepticism>Sure</skepticism>
<title>Sure</title>
XML = eXtensible Markup Language
XML is not a set language unto itself, but a grammar
There is nothing inherent about the function of XML
It is purely a structure--a way of organizing
Anyone can conceive of an XML dialect (e.g. it is extensible)
HTML (HyperText Markup Language: Every website)
KML (Keyhole Markup Language: Google Maps)
RDF (Resource Description Framework: Library catalogues)
SVG (Scalable Vector Graphics: Digital Images)
OOXML (Open Office XML: This presentation, word documents, et cetera)
XML is hierarchical
XML is a tree-like structure
And is often described in genealogical terms
<song>
<verse>
<line>I am the very model of a
modern Major-General,</line>
<line>I've information vegetable,
animal, and mineral,</line>
<line>I know the kings of England,
and I quote the fights historical</line>
<line>From <location>Marathon</location>
to <location>Waterloo</location>,
in order categorical;</line>
</verse>
</song>
The two pointy brackets is called an element
E.g. <song> would be called the song element
All elements have start and end tags
E.g. <song> is the start tag and </song> is the end tag
Elements can also have attributes and each attribute must have a value
E.g. <song genre="country"> has a genre attribute with the value of country
(Think of attributes as you would in everyday life; people don't have "height" or "age" without a value)
Elements cannot overlap
✅ <shelf><book>Anna Karenina</book></shelf>
❌ <shelf><book>Anna Karenina</shelf></book>
Elements nest and use genealogical terms
Root: The main wrapper element (<song>)
Child: Element directly within another element:
<line> is a child of <verse>
Parent: The containing element:
<song> is the parent element of <verse>
Offers a rich vocabulary and method to encode:
Bibliographic and structural features: page breaks, headers, footers, page numbers, line breaks, divisions, paragraphs, line groups, etc
Interpretative features: stage movement, emphasis, place names, proper names, dialogue direction, etc
Editorial apparatus: hands, witnesses, collation, gaps, additions, deletions, etc
Linguistic features: morphemes, feature structures, orthographic form, etc
Spoken features: incidents, pauses, shifts, "communicative phenomenon", etc
Metadata: various classification schemes, provenance, manuscript description, etc
+++++
Root <TEI> element
A <teiHeader> that describes both the file and the primary source that you are transcribing (if applicable)
A <text> that contains the text of the document
Within text, you can have a <front>, <body>, or <back>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>The Major General's Song</title>
</titleStmt>
<publicationStmt>
<p>Publication Information</p>
</publicationStmt>
<sourceDesc>
<bibl>From Gilbert and Sullivan's The Pirates of Penzance</bibl>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<div type="song">
<lg>
<l>I am the very model of a modern Major-General,</l>
<l>I've information vegetable, animal, and mineral,</l>
<l>I know the kings of England, and I quote the fights historical</l>
<l>From <placeName>Marathon</placeName> to
<placeName>Waterloo</placeName>, in order categorical;</l>
</lg>
</div>
</body>
</text>
</TEI>
Note that the TEI is huge (586* elements)
No one uses the entirety of the TEI tagset
Individual projects customize the TEI for their own needs, usually using a small subset of the overall tagset
E.g. Drama projects will use the drama tagset (<sp> for speech, <speaker> for speaker, et cetera) and discard the linguistic/dictionary tagset (<entry> for dictionary entries, <m> for morpheme, etc).
Input =/= Output
Encode what you care about and what you have time to encode
If you don't encode it, you can't do much with it
Document centric vs. data-centric
XML is criticized for being verbose and complex
But so are texts!
And the TEI is full of documentation
https://marketplace.visualstudio.com/items?itemName=raffazizzi.sxml
The TEI Guidelines: https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html
TEI By Example: https://teibyexample.org/
TEI GitHub: https://github.com/TEIC/TEI
TEI listServ: https://listserv.brown.edu/cgi-bin/wa?A0=TEI-L