Joey Takeda, SFU DHIL
March 29, 2023
Materials here: https://sfu.ca/~takeda/xpath
A syntax for navigating and querying the XML tree
The core syntax for the "X-languages" (XQuery, XSLT) to transform and manipulate XML documents
XPath can be used in JavaScript (much like CSS selectors)
XQuery (to create server-side web applications); XSLT (to transform XML into HTML, PDF, et cetera)
Used frequently in Python for Natural Language Processing and data analysis
XML = eXtensible Markup Language
XML is not a set language unto itself, but a grammar
There is nothing inherent about the function of XML
It is purely a structure--a way of organizing
Anyone can conceive of an XML dialect (e.g. it is extensible)
XML is hierarchical
XML is a tree-like structure
And is often described in genealogical terms
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
Each component of an XML tree is called a node
There are three main types of nodes:
The two pointy brackets is called an element
E.g. <book> would be called the book element
All* elements have start and end tags
E.g. <book> is the start tag and </book> is the end tag
Any text contained by an element
The <magazine> element has a single text node, "The New Yorker"
How many text nodes?
<books>
<book>Fifteen Dogs</book>
<book>I Am A Cat</book>
</books>
<books>
<book>Fifteen Dogs</book>
<book>I Am A Cat</book>
</books>
Wait...what about <knickknack/> ?
A special shortcut, called a "self-closing element" or "empty" element
<knickknack/> === <knickknack></knickknack>
Elements can also have attributes and each attribute must have a value
E.g. <book author= "André Alexis"> has an author attribute with the value of "André Alexis"
(Think of attributes as you would in everyday life; people don't have "height" or "age" without a value)
Elements nest and use genealogical terms
Root: The main wrapper element (<shelf>)
Child: Element directly within another element:
<book> is a child of <books>
Parent: The containing element:
<magazines> is the parent element of <magazine>
Ancestor: Parents and the parent's parents
Descendant: Children and children's children
Preceding sibling and following sibling: Children that share a parent
An XPath expression specifies the location of a node in the XML tree
To navigate to the magazine element: /shelf/magazines/magazine
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
The leading slash represents the document root
Each / represents a step in the tree
The / is also known as the child axis
Think of "/" as meaning "direct child of"
So /shelf/magazines/magazine = "Magazine is a direct child of magazines, which is a direct child of shelf"
An XPath statement will return all nodes that match a path
Text nodes are special and are called text()
/shelf/books/book
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
Attributes can be referred to using the "@" with their name
/shelf/books/book/@author
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
Text nodes can be referred to using text()
Any element can be referred to using * (as a wildcard selector)
Text and element nodes can be referred to using node()
shelf/*/* = All grandchildren of shelf
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
/shelf/books/node()
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
/shelf/books/book/text()
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
What will /shelf/books/text() return?
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
Go to the workshop page: https://sfu.ca/~takeda/xpath
Click the "Download Hamlet" button (and remember where you save it)
Open oXygen and go to File > Open and open the Hamlet file
Go to Window > Show View > XPath/XQuery Builder
<div> = a division (act or scene)
<sp> = speech
<speaker> = speech prefix (i.e. the name printed in the play)
<l> = a verse line
<p> = a paragraph
<stage> = a stage direction
Retrieve all acts (direct children of body)
/body/div (5)
Retrieve all scenes (direct children of acts)
/body/div/div (20)
Retrieve all stage directions (descendant of body)
/body/div/div/stage
But also /body/div/div/sp/stage
But also /body/div/div/sp/l/stage
But also /body/div/div/sp/speaker/stage
But also /body/div/div/sp/p/stage
+++++++
Descendants = "//"
//stage = every stage that is a descendant of the root element
Retrieve all stage directions
//stage (284)
How many speeches (<sp>) are in Hamlet?
//sp (1138)
Retrieve all stage directions contained in a speech
//sp/stage (134)
Axis | Short | Full | Example | Note |
---|---|---|---|---|
Child | / | child:: | /body/div /body/child::div | |
Descendant or self | // | descendant-or-self:: | //sp /*/descendant-or-self::sp | |
Descendant | descendant:: | /body//sp /body/descendant::sp | ||
Parent | .. | parent:: | /body/div/div/sp/parent::div //div/.. | |
Ancestor | ancestor:: | //speaker/ancestor::div | ||
Ancestor or self | ancestor-or-self:: | //div/ancestor-or-self::div | ||
Self | self:: | //stage/parent::*/self::stage | This is mostly useful in conjunction with other axes |
Axis | Short | Full | Example |
---|---|---|---|
Preceding-sibling | preceding-sibling:: | //stage/preceding-sibling::l | |
Following-sibling | following-sibling:: | //div/stage/following-sibling::sp | |
Preceding | preceding:: | //stage/preceding::lb | |
Following | following:: | //sp/following::stage |
Retrieve the parent speech (<sp>) from all stage directions
//stage/parent::sp (99)
Retrieve all of the @who values for speech
//sp/@who (1137)
How many verse lines (<l>) follow a stage direction in a speech?
//sp/stage/following-sibling::l (403)
We've just learned how XPath helps us navigate the tree
XPath predicates help us filter these results by searching for parts of the tree so long as they satisfy a particular set of conditions.
Square brackets that follows the step they apply to
Think of the square brackets as the phrase “that is...” or “that has...”
//shelf[books] = all shelf elements that has a child books element
//div[descendant::stage] = all divs that have a descendant stage element
Predicates can be nested and can be chained
//sp[stage[stage]] = all speeches that have a stage that have a child stage
//sp[stage][p] = all speeches that have a stage AND has a paragraph
More formally, predicates resolve to a boolean value (true or false)
This means you can perform operations in the predicate
//div[@type='act'] = all divs that have a type attribute with the value "act"
//div[@type='scene'][stage] = all scenes that have a stage as a child
Write the XPath for getting Act 3, Scene 1
//div[@type='act'][@n='3']/div[@type='scene'][@n='1']
How many stages contain another stage?
//stage[descendant::stage] (16)
All nodes have an implicit position that can be used in a predicate
//books/book[2] = the second book in the sequence
You can also retrieve this using the position() function (we'll talk more about functions later)
//sp[position() = 3] = the speech that is third in a sequence
//book[2]
<shelf>
<knickknack/>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
//book[2]
<shelf>
<knickknack/>
<books>
<book author="Kate Beaton">Ducks</book>
</books>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<books>
<book author="Joy Kogawa">Obasan</book>
<book author="Emily St. John Mandel">Station Eleven</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
<shelf>
<knickknack/>
<books>
<book author="Kate Beaton">Ducks</book>
</books>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<books>
<book author="Joy Kogawa">Obasan</book>
<book author="Emily St. John Mandel">Station Eleven</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
What do you get when you try //sp[3]?
Why would there be more than one?
This specifies that this is third in a given sequence, not in the entire set of results
(//book)[2]
<shelf>
<knickknack/>
<books>
<book author="Kate Beaton">Ducks</book>
</books>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<books>
<book author="Joy Kogawa">Obasan</book>
<book author="Emily St. John Mandel">Station Eleven</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
<shelf>
<knickknack/>
<books>
<book author="Kate Beaton">Ducks</book>
</books>
<books>
<book author="André Alexis">Fifteen Dogs</book>
<book author="Natsume Sōseki">I Am A Cat</book>
</books>
<books>
<book author="Joy Kogawa">Obasan</book>
<book author="Emily St. John Mandel">Station Eleven</book>
</books>
<magazines>
<magazine>The New Yorker</magazine>
<magazines>
</shelf>
Challenge: Finish the XPath to determine how many stages are in Act 3, Scene 1
//stage[
//stage[ancestor::div[@type='act'][@n='3']/div[@type='scene'][@n='1']] (10)
//stage[ancestor::div[@type='scene'][@n='1'][../self::div[@n='3']]]
Challenge: How many prose speeches are in act 2, scene 1?
//div[@type='act'][@n='2']/div[@type='scene'][@n='1']/descendant::sp[p]
//div[2]/div[1]//sp[p]
//sp[p][ancestor::div[@type='scene'][@n='1']][ancestor::div[@type='act'][@n='2']]
Functions perform operations on sets of results
position() is a function that returns the position of a node
not() is a function that negates a boolean
//sp[not(stage)] = all speeches that DO NOT have a child stage
Are there any speeches that do not have a @who value?
//sp[not(@who)] (1)
Any queries you're interested in trying out?
W3C Guidelines for XPath: https://www.w3.org/TR/xpath-functions-31/
Saxonica (XSLT Processor) XPath Reference: https://www.saxonica.com/documentation12/index.html#!expressions
DHSI "Code the X-Files" course: https://dhsi.org/on-campus-courses/