TEI Summer Sessions

Encoding Language and the Language of XML

https://disa-dhil.github.io/tei-summer-sessions/

Joey Takeda

Digital Humanities Innovation Lab, SFU | Digital Scholarship in the Arts (DiSA), UBC

August 11, 2025

Today

10:00-12:00: Presentation and Workshop
12:00-12:30: Lunch
12:30-2:00: Debrief & Co-working

Before we begin...

This is not going to be an introduction to TEI
But completely OK if you aren't familiar!
But also: please feel free to interrupt, ask questions, etc

Today

Brief History of Unicode
@xml:lang
Encoding Foreign Language and Parallel Transcription

Character Encoding

1960s–1980s: Many different character sets (ASCII, ISO-8859, Shift-JIS…)
Limited support for languages beyond English (256 characters)
Encoding mismatches caused "mojibake": ��

Unicode (1991)

Universal character set for all writing systems (aka UTF-8)
Each character = a unique number (a code point)
Written as U+ + hexadecimal (e.g., U+0041 = A)

Examples

A — U+0041
é — U+00E9
🙂 — U+1F642

Glyphs and Characters

Unicode also provides sets of "combining characters": characters that are meant to modify another character and not stand on its own
Accents and other diacritical marks are usually "precomposed" with characters
E.g. é — U+00E9
But could also be decomposed

Example

NFC character	A	m	é		l	i	e
Composed code point	0041	006d	00e9		006c	0069	0065
Decomposed code point	0041	006d	0065	0301	006c	0069	0065
Character	A	m	e	◌́	l	i	e

Emoji Combinations: Skin Tone (Fitzpatrick Scale)

Base Emoji	U+1F3FB (Light)	U+1F3FC (Medium-Light)	U+1F3FD (Medium)	U+1F3FE (Medium-Dark)	U+1F3FF (Dark)
👍	👍🏻	👍🏼	👍🏽	👍🏾	👍🏿
🧑	🧑🏻	🧑🏼	🧑🏽	🧑🏾	🧑🏿

Other Emoji Combinations

Description	Emojis	Result
Occupation (Person + medical symbol)	🧑 + ⚕	🧑‍⚕
Family (Person + person + baby)	👩 + 👩 + 👧	👩‍👩‍👧
Objects (Flag + rainbow)	🏳 + 🌈	🏳️‍🌈

Unicode Support

All of these are font specific, not Unicode
292,531 assigned characters with code points
Not all fonts support all character ranges
Character can exist in Unicode, but not in a font
This creates "tofu": 􏿮
Google's Noto family ("no tofu") of fonts meant to resolve this problem

Summary

Everything is in UTF-8
Every glyph corresponds to a code point
Code points can combine to make characters
Most rendering issues are a font issue, not a character encoding one (but that's always useful to check)

Questions?

@xml:lang and Language Encoding in TEI

In TEI (and all XML), the @xml:lang allows encoders to indicate the language of the content in any element
This is usually declared on the root element (e.g. TEI) and is inherited

The root TEI element has an @xml:lang="en"

We don't need an @xml:lang on the first title, since it is in English by default

But we do need to say that the second title is in French

And these can nest

The content of the text will be declared to be in English (because of the root @xml:lang)

But individual segments (or the whole body itself) could have a new @xml:lang value

Language tags

Uses standardized tags from the IANA Language Subtag Registry
Basic Structure: primary language subtag (e.g. en), optional script subtag (e.g.Latn), region subtag (e.g. CA)

IANA Language Tag Examples

Language Tag	Language
`en`	English
`en-CA`	Canadian English
`es`	Spanish
`ru`	Russian
`de`	German
`zh`	Chinese
`ja-Latn`	Japanese in Latin script (romanji)

Code switching

@xml:lang is allowed everywhere, but in many cases, it is desired to mark up linguistically distinct phrases specifically
The TEI provides two elements for this purpose: <foreign> and <distinct>

Exercise

Use the IANA Subtag Registry Lookup Tool: https://r12a.github.io/app-subtags/ to find the language codes for:

Portuguese
Gaelic
Russian
Irish

Open Question: Dialect?

Summary

Every element can bear the xml:lang attribute
@xml:id is inherited
Best practice: put an xml:lang on the root TEI element
<foreign> is useful for some cases, but not necessary if there's a better container (unless there's a good reason for it)

Parallel Texts and Translations

Aligning parallel texts can be tricky in TEI
Requires making links between the "main" text and translations

Option 1: Embedded Note

Option 2: Structural, implicit

Option 3: Structural, aligned

Option 4: Link Groups

Encoding Language and the Language of XML

Today

Before we begin...

Today

Character Encoding

Unicode (1991)

Examples

Glyphs and Characters

Example

Emoji Combinations: Skin Tone (Fitzpatrick Scale)

Other Emoji Combinations

Unicode Support

Summary

Questions?

@xml:lang and Language Encoding in TEI

Language tags

IANA Language Tag Examples

Code switching

Exercise

Open Question: Dialect?

Summary

Parallel Texts and Translations

Questions?

Lunch