Joey Takeda, SFU DHIL
March 17, 2023
At its core, textual encoding is a way of identifying and differentiating bits of text from other bits of texts.
We do this all the time!
Italics for emphasis
Underlining for titles
Bold for extra-emphasis
Quotation marks for outside attribution
or skepticism
All capitals to YELL
+++
But these are contextual and local
E.g. different types of punctuation for levels of quotation
And they are subject to varying interpretations
E.g. I think these quotation marks denote a term, but maybe the author is just being sarcastic...
Marking up text is an assertion of your knowledge and your interpretation of the text
What does the text (form and content) express?
The process is analytical, strategic, and interpretive.
It is analytical, in identifying a set of components into which the text can meaningfully be broken and whose relationship can be represented
Markup is strategic, in that text encoding is always aimed (deliberately or by default) at some intellectual or practical goal
And markup is interpretive, in that the act of encoding will always take place through a connection between an observing individual and a source object.
XML = eXtensible Markup Language
XML is not a set language unto itself, but a grammar
There is nothing inherent about the function of XML
It is purely a structure--a way of organizing
Anyone can conceive of an XML dialect (e.g. it is extensible)
HTML (HyperText Markup Language: Every website)
KML (Keyhole Markup Language: Google Maps)
RDF (Resource Description Framework: Library catalogues)
SVG (Scalable Vector Graphics: Digital Images)
OOXML (Open Office XML: This presentation, word documents, et cetera)
Markup codifies intentions
"Sure"
<quotation>Sure</quotation>
<sarcasm>Sure</sarcasm>
<skepticism>Sure</skepticism>
<title>Sure</title>
XML is hierarchical
XML is a tree-like structure
And is often described in genealogical terms
Think of the hierarchy of the book:
Book
Chapters
Sections
Paragraphs
Sentences
Words
Letters
<book>
<chapter>
<section>
<paragraph>
<sentence>
<word>
<letter></letter>
</word>
</sentence>
</paragraph>
</section>
</chapter>
</book>
The two pointy brackets is called an element
E.g. <book> would be called the book element
All elements have start and end tags
E.g. <book> is the start tag and </book> is the end tag
Elements can also have attributes and each attribute must have a value
E.g. <book type= "primary"> has a type attribute with the value of primary
(Think of attributes as you would in everyday life; people don't have "height" or "age" without a value)
Elements cannot overlap
<sentence><word>Word1</word></sentence> is right
<sentence><word>Word1</sentence></word> is wrong
Elements nest and use genealogical terms
There is always a root element
A set of guidelines for encoding text
A non-profit organization
A community or consortium of users
Website: https://tei-c.org/
A markup language written in XML
Currently in its 5th major revision (P5 4.5.0)
Used by many projects across the world in many different languages and for many different reasons
A language that describes how a text should be displayed online or in print: "performative and expressive significance of the input" vs "the aesthetics of the output".
A programming language: encoding your texts in TEI does not automatically do anything to them
Caveat: There are many, many tools for transforming TEI into other formats (Word documents, PDFs, and, of course, websites)
Root <TEI> element
A <teiHeader> that describes both the file and the primary source that you are transcribing (if applicable)
A <text> that contains the text of the document
Within text, you can have a <front>, <body>, or <back>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Title</title>
</titleStmt>
<publicationStmt>
<p>Publication Information</p>
</publicationStmt>
<sourceDesc>
<p>Information about the source</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<p>Some text here.</p>
</body>
</text>
</TEI>
Download:
https://sfu.ca/~takeda/teiworkshop/2023-03-17/wea_bio.zip
Moving to screenshare....
To tag element quickly, highlight + Command + E and type in element name
Always make sure the file is valid! Validate, validate, validate! (Red checkmark OR Command + Shift + V)
Offers a rich vocabulary and method to encode:
Bibliographic and structural features: page breaks, headers, footers, page numbers, line breaks, divisions, paragraphs, line groups, etc
Interpretative features: stage movement, emphasis, place names, proper names, dialogue direction, etc
Editorial apparatus: hands, witnesses, collation, gaps, additions, deletions, etc
Linguistic features: morphemes, feature structures, orthographic form, etc
Spoken features: incidents, pauses, shifts, "communicative phenomenon", etc
Metadata: various classification schemes, provenance, manuscript description, etc
+++++
Note that the TEI is huge (569 elements)
No one uses the entirety of the TEI tagset
Individual projects customize the TEI for their own needs, usually using a small subset of the overall tagset
E.g. Drama projects will use the drama tagset (<sp> for speech, <speaker> for speaker, et cetera) and discard the linguistic/dictionary tagset (<entry> for dictionary entries, <m> for morpheme, etc).
The TEI is one big schema: a set of rules about how things are structured
TEI projects usually customize their schema to use only a subset
The WEA uses 151 elements (and, in reality, probably way fewer)
And soon, we'll use more!
<pb n="1"/>
<byline>
Treatment and dialogue by:
<name ref="pers:WE1">Winnifred Reeve</name>
</byline>
<noteGrp type="annotation"
place="top right">
<note>From P.1 to 38 this all new original
material by Reeve</note>
<note>Also last sequences practically
original except for block incident</note>
</noteGrp>
<head>Ropes</head>
<opener>
<byline>By
<lb/><name>Wilbur Daniel
Steele</name>
</byline>
</opener>
<p><wea:slugline>Look-out station.</wea:slugline> Paul is
smiling as he looks through glasses. The other boy,
also with glasses, is looking through them at the water.</p>
<sp>
<speaker>Life Guard</speaker>
<p>The water's pretty rough.</p>
</sp>
<sp>
<speaker>Paul</speaker>
<p>Look at that girl.</p>
</sp>
<sp>
<speaker>Life Guard</speaker>
<p>And there's a powerful undertow off there.</p>
</sp>
Download encoding package from
https://jenkins.hcmc.uvic.ca/job/WEA/lastSuccessfulBuild/artifact/products/site/contribute.html
Open in oXygen
Change purple lines at the top from "../sch/wea.rng" to "https://www.sfu.ca/~takeda/teiworkshop/2023-03-17/screenplay.rng"
<?xml-model href="../sch/wea.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?><?xml-model href="../sch/wea.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-model href="https://www.sfu.ca/~takeda/teiworkshop/2023-03-17/screenplay.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?><?xml-model href="https://www.sfu.ca/~takeda/teiworkshop/2023-03-17/screenplay.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
Paragraphs with <p>
Speeches tagged with <sp>
; speaker tagged with <speaker>
; text tagged with <p>
Headings tagged with <head>
Page beginnings: <pb/>
Page numbers: <fw type="pageNum">
If you notice something, look through the TEI Guidelines for something that might work
Use Ropes as an example (but note that the encoding is still experimental)
When in doubt, ask!
Presentation:
https://sfu.ca/~takeda/teiworkshop/2023-03-17/
Ropes:
https://sfu.ca/~takeda/teiworkshop/2023-03-17/Ropes.xml
TEI Guidelines:
https://tei-c.org