Details¶
This page describes how TRACE works under the hood — the tokenizer pipeline, the scoring model, and the alignment DP. Read this if you want to extend a language pack, change scoring tiers, or debug an unexpected alignment.
Architectural overview¶
text (str)
│
▼
┌─────────────────────────────────┐
│ tokenize/plaintext.pretokenize │ language-agnostic
│ - NFC │
│ - editorial bracket scan │
│ - whitespace + punctuation split │
│ - marker reattach │
└─────────────────────────────────┘
│ list[RawToken]
▼
┌─────────────────────────────────┐
│ pack.post_tokenize │ language-specific
│ (Hebrew: split maqqef compounds)│
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ pack.normalize │ language-specific
│ (Hebrew: niqqud strip, skeleton, │
│ detect abbreviation candidates) │
└─────────────────────────────────┘
│ list[Token]
▼
↪ public API tracealign.tokenize()
overrides id / position with
sequence-index keyed by seq_label
Two Token sequences then feed into the aligner:
seq_a, seq_b
│
▼
┌─────────────────────────────────┐
│ align/needleman_wunsch │
│ - tiered_score per cell │
│ - Gotoh DP (M / X / Y) │
│ - abbreviation-span lookahead │
│ - semi-global edges │
│ - backtrace + continuation │
└─────────────────────────────────┘
│
▼
AlignmentResult
Token model¶
Token is a Pydantic model with extra="forbid". Fields:
Field |
Meaning |
|---|---|
|
Stable identifier |
|
Index in the token sequence (0-based). |
|
Original surface form, niqqud and editorial markers preserved. |
|
Normalized form (niqqud stripped for Hebrew). |
|
Dict of alternative forms — Hebrew adds |
|
Free-form set of strings ( |
|
|
|
Dict for I/O-specific extras ( |
flags is deliberately a set[str], not an enum — language packs and project-specific tooling can attach their own labels without needing a core change.
The tokenizer pipeline¶
Stage 1 — NFC normalization¶
Every input is unicodedata.normalize("NFC", text) before anything else. This canonicalises composed vs decomposed forms so a niqqud-bearing Hebrew letter compares equal regardless of how the source encoded it.
Stage 2 — Editorial-marker scan¶
The default EditorialBracketRules matches:
Surface |
Resulting flag |
|---|---|
|
|
|
|
|
|
|
|
|
|
The brackets themselves are removed from the surface text; the inner content tokenizes normally but each resulting token inherits the bracket’s flag.
You can override the rules:
from tracealign.tokenize.base import EditorialBracketRules
my_rules = EditorialBracketRules(
pairs=[("{", "}", "deletion"), ("⟨", "⟩", "insertion")],
lacuna_markers=["...", "…"],
)
tokens = tracealign.tokenize(text, lang="hbo", rules=my_rules)
Stage 3 — Whitespace / punctuation split¶
Splits on whitespace and on a default punctuation set (,.;:!?،؛). A language pack contributes mid_word_chars that are not split — Hebrew adds gershayim, geresh, and maqqef. A character is otherwise considered word-internal if its Unicode category starts with L, N, or M.
Stage 4 — Marker reattach¶
The editorial flags from stage 2 are attached to whichever tokens overlap the corresponding bracket span.
Stage 5 — Language-pack post_tokenize¶
The Hebrew pack splits maqqef compounds into separate RawToken parts here, each tagged with compound_part.
Stage 6 — Language-pack normalize¶
Turns each RawToken into a Token. The Hebrew pack:
strips niqqud / te’amim from
rawto producetext;computes the
skeletonrepresentation (text with yod / waw also removed) — used for plene/defective matching;flags tokens containing gershayim as
abbreviation;looks up
textin the lexicon’s abbreviation table and attaches the expansions tometadata["abbrev_candidates"].
Scoring¶
Scoring is tiered: tiers are tried in order, and the first one that returns a non-None TierResult wins. The result becomes a Match(reason, score, details).
The Hebrew pack ships seven tiers in this order:
Order |
Reason |
When it fires |
Score |
|---|---|---|---|
1 |
|
|
1.00 |
2 |
|
|
0.95 |
3 |
|
same skeleton, differ by exactly one yod/waw |
0.85 |
4 |
|
1:1 abbreviation expansion ( |
0.85 |
5 |
|
rapidfuzz ratio ≥ 0.6 |
ratio × 0.9 |
— |
|
none of the above |
0.00 |
INSERTION and OMISSION are not scoring tiers — they’re emitted by the aligner when it inserts a gap in one of the sequences. SCRIPT_VARIANT is reserved for future cross-script use (e.g. Hebrew ↔ Greek transliteration).
Multi-token abbreviation matches (ר"י ↔ רבי ישמעאל) are not a tier — they’re a separate transition in the DP (see below).
The alignment DP¶
TRACE uses a semi-global Needleman–Wunsch with affine gap penalties (Gotoh) plus a fourth optional transition for multi-token abbreviation lookahead. Three NumPy matrices store the running scores:
Matrix |
Meaning |
|---|---|
|
best score ending with |
|
best score ending with a gap in B ( |
|
best score ending with a gap in A ( |
gap_open is the penalty for opening a gap; gap_extend is the per-position cost of continuing one. Defaults: gap_open=-2.0, gap_extend=-0.5.
Semi-global edges: by default the first and last column / row are free (X[i, 0] = 0, Y[0, j] = 0). This means leading and trailing gaps don’t cost anything — important for fragment alignment where one witness is a strict subset of the other.
Abbreviation lookahead (the fourth transition, A): when a[i-1] is flagged as an abbreviation and a span b[j-k..j] matches one of its known expansions, the cell can transition directly from (i-1, j-k) to (i, j) with the ABBREVIATION-tier score (0.85). The traceback emits one primary Match plus k − 1 continuation matches so the output stays 1:1 to consumed tokens. Continuations carry details["role"] = "continuation" and a primary_index pointer.
The traceback chooses the best of M[m, n], X[m, n], Y[m, n] (or, in semi-global mode, the best value along the bottom row / right column). Tokens left over after the chosen traceback start are appended as INSERTION (rim of A) or OMISSION (rim of B) without DP penalty.
Reproducibility¶
AlignmentResult.params contains:
the input config (
gap_open,gap_extend,abbrev_lookahead,abbrev_max_span,semi_global_a/b);the language pack code and its
versionstring;trace_version(the installed package version).
This is sufficient to reproduce any alignment from the same inputs.
Package layout¶
src/tracealign/
__init__.py # public API: tokenize, align, get_language, list_languages, register_language
model.py # Token, Match, AlignmentResult, Lexica, Reason
tokenize/
base.py # RawToken, EditorialBracketRules
plaintext.py # the 4-stage language-agnostic pretokenizer
lang/
base.py # LanguagePack ABC, ScoringTier, TierResult
registry.py # register_language, get_language, list_languages
hebrew/
pack.py # HebrewLanguagePack
tokenize.py # gershayim mid-word, maqqef split
normalize.py # niqqud strip, skeleton
scoring.py # the five tier predicates
data/ # seed JSON lexica
score/
tiered.py # generic tier-walk function
align/
needleman_wunsch.py # Gotoh DP + abbreviation lookahead
io/
result.py # AlignmentResult JSON dump/load
escriptorium.py # eScriptorium JSON importer
tei.py # TEI XML importer
Multi-witness alignment (v0.2)¶
align_multi extends the pairwise aligner to N witnesses. The pipeline is three-phase:
Phase 1 — Pairwise distances¶
Every pair of witnesses is aligned with tracealign.align() (the v0.1 pairwise aligner) and the distance is computed as 1 − total_score. The result is a symmetric N × N distance matrix; the diagonal is zero. Witness ids are sorted lexicographically before computing, making the matrix independent of dict insertion order.
Phase 2 — UPGMA guide tree¶
A binary guide tree is built from the distance matrix using UPGMA (Unweighted Pair Group Method with Arithmetic Mean). At every iteration the closest cluster pair is merged. Ties are broken on the canonical (min, max) lexicographic order of cluster members, guaranteeing determinism. The tree’s height field carries the cumulative UPGMA distance — a starting point for later stemmatic work.
Phase 3 — Progressive POA-based merge¶
The guide tree is walked in post-order to produce a canonical merge sequence (closely-related witnesses are merged first). The first witness seeds the graph as a linear chain. Each subsequent witness is aligned to the current graph via partial-order alignment (POA) — a DP over the topologically sorted graph nodes. Three transitions:
Transition |
Effect on graph |
|---|---|
Match |
Merge the new token into an existing node’s |
Insertion in sequence (gap in graph) |
Add a new node holding only this witness’s token; new edge |
Deletion (skip graph node) |
The new witness’s path bypasses this node — recorded by an edge that skips it. |
node_match_score aggregates the per-constituent tiered score across the witnesses already in the target node. The default mode "max" is permissive (CollateX-aligned); "mean" and "min" are configurable.
Correctness guarantees¶
Two properties are pinned by tests:
Lossless reconstruction. For every input witness
w, the path through the result graph yields exactly the original token sequence.Permutation invariance. The same set of witnesses in any input dict order produces the same alignment (same witness paths, same variant loci).
Data flow¶
align_multi(witnesses, lang, config)
│
▼
pairwise_distances — Phase 1: O(N²/2) pairwise alignments
│
▼
build_upgma — Phase 2: deterministic binary tree
│
▼
progressive_merge — Phase 3: post-order POA-based merge
│
▼
VariantGraph
│
├──► AlignedTable — derived view, re-anchorable
└──► MultiAlignmentResult (graph + table + guide_tree + summary + params)