Encoding Language and the Language of XML

https://disa-dhil.github.io/tei-summer-sessions/

Joey Takeda

Digital Humanities Innovation Lab, SFU | Digital Scholarship in the Arts (DiSA), UBC

August 11, 2025

Today

  1. 10:00-12:00: Presentation and Workshop
  2. 12:00-12:30: Lunch
  3. 12:30-2:00: Debrief & Co-working

Before we begin...

  • This is not going to be an introduction to TEI
  • But completely OK if you aren't familiar!
  • But also: please feel free to interrupt, ask questions, etc

Today

  1. Brief History of Unicode
  2. @xml:lang
  3. Encoding Foreign Language and Parallel Transcription

Character Encoding

  • 1960s–1980s: Many different character sets (ASCII, ISO-8859, Shift-JIS…)
  • Limited support for languages beyond English (256 characters)
  • Encoding mismatches caused "mojibake": ���

Unicode (1991)

  • Universal character set for all writing systems (aka UTF-8)
  • Each character = a unique number (a code point)
  • Written as U+ + hexadecimal (e.g., U+0041 = A)

Examples

  • A — U+0041
  • é — U+00E9
  • 🙂 — U+1F642

Glyphs and Characters

  • Unicode also provides sets of "combining characters": characters that are meant to modify another character and not stand on its own
  • Accents and other diacritical marks are usually "precomposed" with characters
  • E.g. é — U+00E9
  • But could also be decomposed

Example

NFC character A m é l i e
Composed code point 0041 006d 00e9 006c 0069 0065
Decomposed code point 0041 006d 0065 0301 006c 0069 0065
Character A m e ◌́ l i e

Emoji Combinations: Skin Tone (Fitzpatrick Scale)

Base Emoji U+1F3FB (Light) U+1F3FC (Medium-Light) U+1F3FD (Medium) U+1F3FE (Medium-Dark) U+1F3FF (Dark)
👍 👍🏻 👍🏼 👍🏽 👍🏾 👍🏿
🧑 🧑🏻 🧑🏼 🧑🏽 🧑🏾 🧑🏿

Other Emoji Combinations

Description Emojis Result
Occupation (Person + medical symbol) 🧑 + ⚕ 🧑‍⚕
Family (Person + person + baby) 👩 + 👩 + 👧 👩‍👩‍👧
Objects (Flag + rainbow) 🏳 + 🌈 🏳️‍🌈

Unicode Support

  • All of these are font specific, not Unicode
  • 292,531 assigned characters with code points
  • Not all fonts support all character ranges
  • Character can exist in Unicode, but not in a font
  • This creates "tofu": 􏿮
  • Google's Noto family ("no tofu") of fonts meant to resolve this problem

Summary

  • Everything is in UTF-8
  • Every glyph corresponds to a code point
  • Code points can combine to make characters
  • Most rendering issues are a font issue, not a character encoding one (but that's always useful to check)

Questions?

@xml:lang and Language Encoding in TEI

  • In TEI (and all XML), the @xml:lang allows encoders to indicate the language of the content in any element
  • This is usually declared on the root element (e.g. TEI) and is inherited



The root TEI element has an @xml:lang="en"

We don't need an @xml:lang on the first title, since it is in English by default

But we do need to say that the second title is in French

And these can nest

The content of the text will be declared to be in English (because of the root @xml:lang)

But individual segments (or the whole body itself) could have a new @xml:lang value

Language tags

  • Uses standardized tags from the IANA Language Subtag Registry
  • Basic Structure: primary language subtag (e.g. en), optional script subtag (e.g.Latn), region subtag (e.g. CA)

IANA Language Tag Examples

Language Tag Language
en English
en-CA Canadian English
es Spanish
ru Russian
de German
zh Chinese
ja-Latn Japanese in Latin script (romanji)

Code switching

  • @xml:lang is allowed everywhere, but in many cases, it is desired to mark up linguistically distinct phrases specifically
  • The TEI provides two elements for this purpose: <foreign> and <distinct>






Exercise

Use the IANA Subtag Registry Lookup Tool: https://r12a.github.io/app-subtags/ to find the language codes for:

  • Portuguese
  • Gaelic
  • Russian
  • Irish

Open Question: Dialect?




Summary

  • Every element can bear the xml:lang attribute
  • @xml:id is inherited
  • Best practice: put an xml:lang on the root TEI element
  • <foreign> is useful for some cases, but not necessary if there's a better container (unless there's a good reason for it)

Parallel Texts and Translations

  • Aligning parallel texts can be tricky in TEI
  • Requires making links between the "main" text and translations

Option 1: Embedded Note




Option 2: Structural, implicit




Option 3: Structural, aligned




Option 4: Link Groups




Questions?

Lunch