Introduction to XPath


Joey Takeda, SFU DHIL

March 29, 2023

Materials here: https://sfu.ca/~takeda/xpath

What is XPath

A syntax for navigating and querying the XML tree

The core syntax for the "X-languages" (XQuery, XSLT) to transform and manipulate XML documents

XPath can be used in JavaScript (much like CSS selectors)

XQuery (to create server-side web applications); XSLT (to transform XML into HTML, PDF, et cetera)

Used frequently in Python for Natural Language Processing and data analysis

Quick XML Refresher

XML = eXtensible Markup Language

XML is not a set language unto itself, but a grammar

There is nothing inherent about the function of XML

It is purely a structure--a way of organizing

Anyone can conceive of an XML dialect (e.g. it is extensible)

XML

XML is hierarchical

XML is a tree-like structure

And is often described in genealogical terms

XML Explained


 <shelf>
    <knickknack/>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
        <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            

            

XML Explained

Each component of an XML tree is called a node

There are three main types of nodes:

  • Element nodes
  • Text nodes
  • Attribute nodes

XML Explained: Element Nodes

The two pointy brackets is called an element

E.g. <book> would be called the book element

All* elements have start and end tags

E.g. <book> is the start tag and </book> is the end tag

XML Explained: Text Nodes

Any text contained by an element

The <magazine> element has a single text node, "The New Yorker"

XML Explained: Text Nodes

How many text nodes?

                
<books>
      <book>Fifteen Dogs</book>
      <book>I Am A Cat</book>
</books>                    
                
            
                
<books>
     <book>Fifteen Dogs</book>
     <book>I Am A Cat</book>
</books>                    
                
            

XML Explained

Wait...what about <knickknack/> ?

A special shortcut, called a "self-closing element" or "empty" element

<knickknack/> === <knickknack></knickknack>

XML Explained

Elements can also have attributes and each attribute must have a value

E.g. <book author= "André Alexis"> has an author attribute with the value of "André Alexis"

(Think of attributes as you would in everyday life; people don't have "height" or "age" without a value)

XML Terminology

Elements nest and use genealogical terms

Root: The main wrapper element (<shelf>)

Child: Element directly within another element:
<book> is a child of <books>

Parent: The containing element:
<magazines> is the parent element of <magazine>

XML Terminology

Ancestor: Parents and the parent's parents

  • <shelf> is the ancestor of <magazine>
  • <books> is an ancestor of <book>

Descendant: Children and children's children

  • <magazine> is a descendant of <shelf>
  • <books> is a descendant of <shelf>

XML Terminology

Preceding sibling and following sibling: Children that share a parent

  • <magazines> is the following sibling of <books>
  • <books> is the preceding sibling of <magazines> and a following sibling of <knickknack>
  • The <knickknack> has two following siblings and no preceding siblings

What is XPath

An XPath expression specifies the location of a node in the XML tree

Example: Finding the magazine

To navigate to the magazine element: /shelf/magazines/magazine


 <shelf>
   <knickknack/>
   <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

 <shelf>
    <knickknack/>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

 <shelf>
    <knickknack/>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
        <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

XPath Explained

The leading slash represents the document root

Each / represents a step in the tree

The / is also known as the child axis

XPath Explained

Think of "/" as meaning "direct child of"

So /shelf/magazines/magazine = "Magazine is a direct child of magazines, which is a direct child of shelf"

XPath Explained

An XPath statement will return all nodes that match a path

Text nodes are special and are called text()

Example: Finding books

/shelf/books/book


 <shelf>
   <knickknack/>
   <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

 <shelf>
    <knickknack/>
    <books>
       <book author="André Alexis">Fifteen Dogs</book>
       <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

Beyond elements

Attributes can be referred to using the "@" with their name

Example: Retrieving attributes

/shelf/books/book/@author


 <shelf>
   <knickknack/>
   <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

 <shelf>
    <knickknack/>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

Beyond elements

Text nodes can be referred to using text()

Any element can be referred to using * (as a wildcard selector)

Text and element nodes can be referred to using node()

Element: All child elements

shelf/*/* = All grandchildren of shelf


 <shelf>
   <knickknack/>
   <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

 <shelf>
    <knickknack/>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

Example: All child nodes

/shelf/books/node()


 <shelf>
   <knickknack/>
   <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

 <shelf>
    <knickknack/>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

Example: Finding book titles

/shelf/books/book/text()


 <shelf>
   <knickknack/>
   <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

 <shelf>
    <knickknack/>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

Question

What will /shelf/books/text() return?


 <shelf>
    <knickknack/>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

 <shelf>
    <knickknack/>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

Try it!

Go to the workshop page: https://sfu.ca/~takeda/xpath

Click the "Download Hamlet" button (and remember where you save it)

Open oXygen and go to File > Open and open the Hamlet file

In oXygen

Go to Window > Show View > XPath/XQuery Builder

Hamlet

<div> = a division (act or scene)

<sp> = speech

<speaker> = speech prefix (i.e. the name printed in the play)

<l> = a verse line

<p> = a paragraph

<stage> = a stage direction

Exercises

Retrieve all acts (direct children of body)

/body/div (5)

Retrieve all scenes (direct children of acts)

/body/div/div (20)

Okay, but...

Retrieve all stage directions (descendant of body)

/body/div/div/stage

But also /body/div/div/sp/stage

But also /body/div/div/sp/l/stage

But also /body/div/div/sp/speaker/stage

But also /body/div/div/sp/p/stage

+++++++

Axes!

Descendants = "//"

//stage = every stage that is a descendant of the root element

Try it!

Retrieve all stage directions

//stage (284)

Try it!

How many speeches (<sp>) are in Hamlet?

//sp (1138)

Try it!

Retrieve all stage directions contained in a speech

//sp/stage (134)

Axes

Axis Short Full Example Note
Child / child:: /body/div /body/child::div
Descendant or self // descendant-or-self:: //sp /*/descendant-or-self::sp
Descendant descendant:: /body//sp /body/descendant::sp
Parent .. parent:: /body/div/div/sp/parent::div //div/..
Ancestor ancestor:: //speaker/ancestor::div
Ancestor or self ancestor-or-self:: //div/ancestor-or-self::div
Self self:: //stage/parent::*/self::stage This is mostly useful in conjunction with other axes

Axes

Axis Short Full Example
Preceding-sibling preceding-sibling:: //stage/preceding-sibling::l
Following-sibling following-sibling:: //div/stage/following-sibling::sp
Preceding preceding:: //stage/preceding::lb
Following following:: //sp/following::stage

Try it!

Retrieve the parent speech (<sp>) from all stage directions

//stage/parent::sp (99)

Try it!

Retrieve all of the @who values for speech

//sp/@who (1137)

Try it!

How many verse lines (<l>) follow a stage direction in a speech?

//sp/stage/following-sibling::l (403)

Coffee Break!

Predicates

We've just learned how XPath helps us navigate the tree

XPath predicates help us filter these results by searching for parts of the tree so long as they satisfy a particular set of conditions.

What do they look like

Square brackets that follows the step they apply to

Think of the square brackets as the phrase “that is...” or “that has...”

//shelf[books] = all shelf elements that has a child books element

//div[descendant::stage] = all divs that have a descendant stage element

More examples

Predicates can be nested and can be chained

//sp[stage[stage]] = all speeches that have a stage that have a child stage

//sp[stage][p] = all speeches that have a stage AND has a paragraph

More examples

More formally, predicates resolve to a boolean value (true or false)

This means you can perform operations in the predicate

//div[@type='act'] = all divs that have a type attribute with the value "act"

//div[@type='scene'][stage] = all scenes that have a stage as a child

Try it!

Write the XPath for getting Act 3, Scene 1

//div[@type='act'][@n='3']/div[@type='scene'][@n='1']

Try it!

How many stages contain another stage?

//stage[descendant::stage] (16)

Position in sequence

All nodes have an implicit position that can be used in a predicate

//books/book[2] = the second book in the sequence

You can also retrieve this using the position() function (we'll talk more about functions later)

//sp[position() = 3] = the speech that is third in a sequence

Positions example

//book[2]

                
 <shelf>
    <knickknack/>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
                
            

What about this?

//book[2]


 <shelf>
    <knickknack/>
    <books>
        <book author="Kate Beaton">Ducks</book>
    </books>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <books>
        <book author="Joy Kogawa">Obasan</book>
        <book author="Emily St. John Mandel">Station Eleven</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

 <shelf>
    <knickknack/>
    <books>
        <book author="Kate Beaton">Ducks</book>
    </books>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <books>
        <book author="Joy Kogawa">Obasan</book>
        <book author="Emily St. John Mandel">Station Eleven</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

Positions continued

What do you get when you try //sp[3]?

Why would there be more than one?

This specifies that this is third in a given sequence, not in the entire set of results

What about this?

(//book)[2]


 <shelf>
    <knickknack/>
    <books>
        <book author="Kate Beaton">Ducks</book>
    </books>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <books>
        <book author="Joy Kogawa">Obasan</book>
        <book author="Emily St. John Mandel">Station Eleven</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

 <shelf>
    <knickknack/>
    <books>
        <book author="Kate Beaton">Ducks</book>
    </books>
    <books>
        <book author="André Alexis">Fifteen Dogs</book>
        <book author="Natsume Sōseki">I Am A Cat</book>
    </books>
    <books>
        <book author="Joy Kogawa">Obasan</book>
        <book author="Emily St. John Mandel">Station Eleven</book>
    </books>
    <magazines>
       <magazine>The New Yorker</magazine>
    <magazines>
 </shelf>
            
            

Try it!

Challenge: Finish the XPath to determine how many stages are in Act 3, Scene 1

//stage[

//stage[ancestor::div[@type='act'][@n='3']/div[@type='scene'][@n='1']] (10)

//stage[ancestor::div[@type='scene'][@n='1'][../self::div[@n='3']]]

Try it!

Challenge: How many prose speeches are in act 2, scene 1?

//div[@type='act'][@n='2']/div[@type='scene'][@n='1']/descendant::sp[p]

//div[2]/div[1]//sp[p]

//sp[p][ancestor::div[@type='scene'][@n='1']][ancestor::div[@type='act'][@n='2']]

If time: Functions

Functions perform operations on sets of results

position() is a function that returns the position of a node

not() is a function that negates a boolean

//sp[not(stage)] = all speeches that DO NOT have a child stage

Try it!

Are there any speeches that do not have a @who value?

//sp[not(@who)] (1)

Brainstorm

Any queries you're interested in trying out?

Resources

W3C Guidelines for XPath: https://www.w3.org/TR/xpath-functions-31/

Saxonica (XSLT Processor) XPath Reference: https://www.saxonica.com/documentation12/index.html#!expressions

DHSI "Code the X-Files" course: https://dhsi.org/on-campus-courses/

Thanks!