Details¶

This page describes how TRACE works under the hood — the tokenizer pipeline, the scoring model, and the alignment DP. Read this if you want to extend a language pack, change scoring tiers, or debug an unexpected alignment.

Architectural overview¶

text (str)
   │
   ▼
┌─────────────────────────────────┐
│ tokenize/plaintext.pretokenize  │   language-agnostic
│  - NFC                          │
│  - editorial bracket scan       │
│  - whitespace + punctuation split │
│  - marker reattach              │
└─────────────────────────────────┘
   │ list[RawToken]
   ▼
┌─────────────────────────────────┐
│ pack.post_tokenize              │   language-specific
│  (Hebrew: split maqqef compounds)│
└─────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────┐
│ pack.normalize                  │   language-specific
│  (Hebrew: niqqud strip, skeleton, │
│   detect abbreviation candidates) │
└─────────────────────────────────┘
   │ list[Token]
   ▼
   ↪ public API tracealign.tokenize()
     overrides id / position with
     sequence-index keyed by seq_label

Two Token sequences then feed into the aligner:

seq_a, seq_b
   │
   ▼
┌─────────────────────────────────┐
│ align/needleman_wunsch          │
│  - tiered_score per cell        │
│  - Gotoh DP (M / X / Y)         │
│  - abbreviation-span lookahead   │
│  - semi-global edges            │
│  - backtrace + continuation     │
└─────────────────────────────────┘
   │
   ▼
AlignmentResult

Token model¶

Token is a Pydantic model with extra="forbid". Fields:

Field	Meaning
`id`	Stable identifier `{seq_label}:{position:06d}`, e.g. `W1:000003`.
`position`	Index in the token sequence (0-based).
`raw`	Original surface form, niqqud and editorial markers preserved.
`text`	Normalized form (niqqud stripped for Hebrew).
`representations`	Dict of alternative forms — Hebrew adds `"skeleton"` (yod/waw also removed).
`flags`	Free-form set of strings (`abbreviation`, `compound_part`, `lacuna`, editorial markers like `reconstructed`/`deletion`/`insertion`/`abbreviation_expanded`).
`source_span`	`(start, end)` byte offsets into the original text — for round-tripping back to scans / TEI markup.
`metadata`	Dict for I/O-specific extras (`witness_id`, `surface_label`, `line_pk`, `bbox`, `abbrev_candidates`, …).

flags is deliberately a set[str], not an enum — language packs and project-specific tooling can attach their own labels without needing a core change.

The tokenizer pipeline¶

Stage 1 — NFC normalization¶

Every input is unicodedata.normalize("NFC", text) before anything else. This canonicalises composed vs decomposed forms so a niqqud-bearing Hebrew letter compares equal regardless of how the source encoded it.

Stage 2 — Editorial-marker scan¶

The default EditorialBracketRules matches:

Surface	Resulting flag
`[content]`	`reconstructed`
`⟦content⟧`	`deletion`
`〈content〉`	`insertion`
`(content)`	`abbreviation_expanded`
`…`, `[…]`	`lacuna` (as a zero-content token)

The brackets themselves are removed from the surface text; the inner content tokenizes normally but each resulting token inherits the bracket’s flag.

You can override the rules:

from tracealign.tokenize.base import EditorialBracketRules

my_rules = EditorialBracketRules(
    pairs=[("{", "}", "deletion"), ("⟨", "⟩", "insertion")],
    lacuna_markers=["...", "…"],
)
tokens = tracealign.tokenize(text, lang="hbo", rules=my_rules)

Stage 3 — Whitespace / punctuation split¶

Splits on whitespace and on a default punctuation set (,.;:!?،؛). A language pack contributes mid_word_chars that are not split — Hebrew adds gershayim, geresh, and maqqef. A character is otherwise considered word-internal if its Unicode category starts with L, N, or M.

Stage 4 — Marker reattach¶

The editorial flags from stage 2 are attached to whichever tokens overlap the corresponding bracket span.

Stage 5 — Language-pack `post_tokenize`¶

The Hebrew pack splits maqqef compounds into separate RawToken parts here, each tagged with compound_part.

Stage 6 — Language-pack `normalize`¶

Turns each RawToken into a Token. The Hebrew pack:

strips niqqud / te’amim from raw to produce text;
computes the skeleton representation (text with yod / waw also removed) — used for plene/defective matching;
flags tokens containing gershayim as abbreviation;
looks up text in the lexicon’s abbreviation table and attaches the expansions to metadata["abbrev_candidates"].

Scoring¶

Scoring is tiered: tiers are tried in order, and the first one that returns a non-None TierResult wins. The result becomes a Match(reason, score, details).

The Hebrew pack ships seven tiers in this order:

Order	Reason	When it fires	Score
1	`EXACT`	`a.raw == b.raw`	1.00
2	`NIQQUD_STRIPPED`	`a.text == b.text` but `a.raw != b.raw`	0.95
3	`PLENE_DEFECTIVE`	same skeleton, differ by exactly one yod/waw	0.85
4	`ABBREVIATION`	1:1 abbreviation expansion (`ר"ש` ↔ `שמעון` style)	0.85
5	`ORTHOGRAPHIC`	rapidfuzz ratio ≥ 0.6	ratio × 0.9
—	`NO_MATCH`	none of the above	0.00

INSERTION and OMISSION are not scoring tiers — they’re emitted by the aligner when it inserts a gap in one of the sequences. SCRIPT_VARIANT is reserved for future cross-script use (e.g. Hebrew ↔ Greek transliteration).

Multi-token abbreviation matches (ר"י ↔ רבי ישמעאל) are not a tier — they’re a separate transition in the DP (see below).

The alignment DP¶

TRACE uses a semi-global Needleman–Wunsch with affine gap penalties (Gotoh) plus a fourth optional transition for multi-token abbreviation lookahead. Three NumPy matrices store the running scores:

Matrix	Meaning
`M[i, j]`	best score ending with `a[i-1]` aligned to `b[j-1]`
`X[i, j]`	best score ending with a gap in B (`a[i-1]` consumed against a gap)
`Y[i, j]`	best score ending with a gap in A (`b[j-1]` consumed against a gap)

gap_open is the penalty for opening a gap; gap_extend is the per-position cost of continuing one. Defaults: gap_open=-2.0, gap_extend=-0.5.

Semi-global edges: by default the first and last column / row are free (X[i, 0] = 0, Y[0, j] = 0). This means leading and trailing gaps don’t cost anything — important for fragment alignment where one witness is a strict subset of the other.

Abbreviation lookahead (the fourth transition, A): when a[i-1] is flagged as an abbreviation and a span b[j-k..j] matches one of its known expansions, the cell can transition directly from (i-1, j-k) to (i, j) with the ABBREVIATION-tier score (0.85). The traceback emits one primary Match plus k − 1 continuation matches so the output stays 1:1 to consumed tokens. Continuations carry details["role"] = "continuation" and a primary_index pointer.

The traceback chooses the best of M[m, n], X[m, n], Y[m, n] (or, in semi-global mode, the best value along the bottom row / right column). Tokens left over after the chosen traceback start are appended as INSERTION (rim of A) or OMISSION (rim of B) without DP penalty.

Reproducibility¶

AlignmentResult.params contains:

the input config (gap_open, gap_extend, abbrev_lookahead, abbrev_max_span, semi_global_a/b);
the language pack code and its version string;
trace_version (the installed package version).

This is sufficient to reproduce any alignment from the same inputs.

Package layout¶

src/tracealign/
  __init__.py            # public API: tokenize, align, get_language, list_languages, register_language
  model.py               # Token, Match, AlignmentResult, Lexica, Reason
  tokenize/
    base.py              # RawToken, EditorialBracketRules
    plaintext.py         # the 4-stage language-agnostic pretokenizer
  lang/
    base.py              # LanguagePack ABC, ScoringTier, TierResult
    registry.py          # register_language, get_language, list_languages
    hebrew/
      pack.py            # HebrewLanguagePack
      tokenize.py        # gershayim mid-word, maqqef split
      normalize.py       # niqqud strip, skeleton
      scoring.py         # the five tier predicates
      data/              # seed JSON lexica
  score/
    tiered.py            # generic tier-walk function
  align/
    needleman_wunsch.py  # Gotoh DP + abbreviation lookahead
  io/
    result.py            # AlignmentResult JSON dump/load
    escriptorium.py      # eScriptorium JSON importer
    tei.py               # TEI XML importer

Multi-witness alignment (v0.2)¶

align_multi extends the pairwise aligner to N witnesses. The pipeline is three-phase:

Phase 1 — Pairwise distances¶

Every pair of witnesses is aligned with tracealign.align() (the v0.1 pairwise aligner) and the distance is computed as 1 − total_score. The result is a symmetric N × N distance matrix; the diagonal is zero. Witness ids are sorted lexicographically before computing, making the matrix independent of dict insertion order.

Phase 2 — UPGMA guide tree¶

A binary guide tree is built from the distance matrix using UPGMA (Unweighted Pair Group Method with Arithmetic Mean). At every iteration the closest cluster pair is merged. Ties are broken on the canonical (min, max) lexicographic order of cluster members, guaranteeing determinism. The tree’s height field carries the cumulative UPGMA distance — a starting point for later stemmatic work.

Phase 3 — Progressive POA-based merge¶

The guide tree is walked in post-order to produce a canonical merge sequence (closely-related witnesses are merged first). The first witness seeds the graph as a linear chain. Each subsequent witness is aligned to the current graph via partial-order alignment (POA) — a DP over the topologically sorted graph nodes. Three transitions:

Transition	Effect on graph
Match	Merge the new token into an existing node’s `tokens[witness_id]`; extend the witness set on the incoming edge.
Insertion in sequence (gap in graph)	Add a new node holding only this witness’s token; new edge `prev → new`.
Deletion (skip graph node)	The new witness’s path bypasses this node — recorded by an edge that skips it.

node_match_score aggregates the per-constituent tiered score across the witnesses already in the target node. The default mode "max" is permissive (CollateX-aligned); "mean" and "min" are configurable.

Correctness guarantees¶

Two properties are pinned by tests:

Lossless reconstruction. For every input witness w, the path through the result graph yields exactly the original token sequence.
Permutation invariance. The same set of witnesses in any input dict order produces the same alignment (same witness paths, same variant loci).

Data flow¶

align_multi(witnesses, lang, config)
   │
   ▼
pairwise_distances        — Phase 1: O(N²/2) pairwise alignments
   │
   ▼
build_upgma               — Phase 2: deterministic binary tree
   │
   ▼
progressive_merge         — Phase 3: post-order POA-based merge
   │
   ▼
VariantGraph
   │
   ├──► AlignedTable      — derived view, re-anchorable
   └──► MultiAlignmentResult (graph + table + guide_tree + summary + params)