# Details

This page describes how TRACE works under the hood — the tokenizer pipeline, the scoring model, and the alignment DP. Read this if you want to extend a language pack, change scoring tiers, or debug an unexpected alignment.

## Architectural overview

```
text (str)
   │
   ▼
┌─────────────────────────────────┐
│ tokenize/plaintext.pretokenize  │   language-agnostic
│  - NFC                          │
│  - editorial bracket scan       │
│  - whitespace + punctuation split │
│  - marker reattach              │
└─────────────────────────────────┘
   │ list[RawToken]
   ▼
┌─────────────────────────────────┐
│ pack.post_tokenize              │   language-specific
│  (Hebrew: split maqqef compounds)│
└─────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────┐
│ pack.normalize                  │   language-specific
│  (Hebrew: niqqud strip, skeleton, │
│   detect abbreviation candidates) │
└─────────────────────────────────┘
   │ list[Token]
   ▼
   ↪ public API tracealign.tokenize()
     overrides id / position with
     sequence-index keyed by seq_label
```

Two Token sequences then feed into the aligner:

```
seq_a, seq_b
   │
   ▼
┌─────────────────────────────────┐
│ align/needleman_wunsch          │
│  - tiered_score per cell        │
│  - Gotoh DP (M / X / Y)         │
│  - abbreviation-span lookahead   │
│  - semi-global edges            │
│  - backtrace + continuation     │
└─────────────────────────────────┘
   │
   ▼
AlignmentResult
```

## Token model

`Token` is a Pydantic model with `extra="forbid"`. Fields:

| Field | Meaning |
|---|---|
| `id` | Stable identifier `{seq_label}:{position:06d}`, e.g. `W1:000003`. |
| `position` | Index in the token sequence (0-based). |
| `raw` | Original surface form, niqqud and editorial markers preserved. |
| `text` | Normalized form (niqqud stripped for Hebrew). |
| `representations` | Dict of alternative forms — Hebrew adds `"skeleton"` (yod/waw also removed). |
| `flags` | Free-form set of strings (`abbreviation`, `compound_part`, `lacuna`, editorial markers like `reconstructed`/`deletion`/`insertion`/`abbreviation_expanded`). |
| `source_span` | `(start, end)` byte offsets into the original text — for round-tripping back to scans / TEI markup. |
| `metadata` | Dict for I/O-specific extras (`witness_id`, `surface_label`, `line_pk`, `bbox`, `abbrev_candidates`, …). |

`flags` is deliberately a `set[str]`, not an enum — language packs and project-specific tooling can attach their own labels without needing a core change.

## The tokenizer pipeline

### Stage 1 — NFC normalization

Every input is `unicodedata.normalize("NFC", text)` before anything else. This canonicalises composed vs decomposed forms so a niqqud-bearing Hebrew letter compares equal regardless of how the source encoded it.

### Stage 2 — Editorial-marker scan

The default `EditorialBracketRules` matches:

| Surface | Resulting flag |
|---|---|
| `[content]` | `reconstructed` |
| `⟦content⟧` | `deletion` |
| `〈content〉` | `insertion` |
| `(content)` | `abbreviation_expanded` |
| `…`, `[…]` | `lacuna` (as a zero-content token) |

The brackets themselves are removed from the surface text; the inner content tokenizes normally but each resulting token inherits the bracket's flag.

You can override the rules:

```python
from tracealign.tokenize.base import EditorialBracketRules

my_rules = EditorialBracketRules(
    pairs=[("{", "}", "deletion"), ("⟨", "⟩", "insertion")],
    lacuna_markers=["...", "…"],
)
tokens = tracealign.tokenize(text, lang="hbo", rules=my_rules)
```

### Stage 3 — Whitespace / punctuation split

Splits on whitespace and on a default punctuation set (`,.;:!?،؛`). A language pack contributes `mid_word_chars` that are *not* split — Hebrew adds gershayim, geresh, and maqqef. A character is otherwise considered word-internal if its Unicode category starts with `L`, `N`, or `M`.

### Stage 4 — Marker reattach

The editorial flags from stage 2 are attached to whichever tokens overlap the corresponding bracket span.

### Stage 5 — Language-pack `post_tokenize`

The Hebrew pack splits maqqef compounds into separate `RawToken` parts here, each tagged with `compound_part`.

### Stage 6 — Language-pack `normalize`

Turns each `RawToken` into a `Token`. The Hebrew pack:

- strips niqqud / te'amim from `raw` to produce `text`;
- computes the `skeleton` representation (text with yod / waw also removed) — used for plene/defective matching;
- flags tokens containing gershayim as `abbreviation`;
- looks up `text` in the lexicon's abbreviation table and attaches the expansions to `metadata["abbrev_candidates"]`.

## Scoring

Scoring is **tiered**: tiers are tried in order, and the first one that returns a non-`None` `TierResult` wins. The result becomes a `Match(reason, score, details)`.

The Hebrew pack ships seven tiers in this order:

| Order | Reason | When it fires | Score |
|---|---|---|---|
| 1 | `EXACT` | `a.raw == b.raw` | 1.00 |
| 2 | `NIQQUD_STRIPPED` | `a.text == b.text` but `a.raw != b.raw` | 0.95 |
| 3 | `PLENE_DEFECTIVE` | same skeleton, differ by exactly one yod/waw | 0.85 |
| 4 | `ABBREVIATION` | 1:1 abbreviation expansion (`ר"ש` ↔ `שמעון` style) | 0.85 |
| 5 | `ORTHOGRAPHIC` | rapidfuzz ratio ≥ 0.6 | ratio × 0.9 |
| — | `NO_MATCH` | none of the above | 0.00 |

`INSERTION` and `OMISSION` are not scoring tiers — they're emitted by the aligner when it inserts a gap in one of the sequences. `SCRIPT_VARIANT` is reserved for future cross-script use (e.g. Hebrew ↔ Greek transliteration).

Multi-token abbreviation matches (`ר"י` ↔ `רבי ישמעאל`) are *not* a tier — they're a separate transition in the DP (see below).

## The alignment DP

TRACE uses a **semi-global Needleman–Wunsch with affine gap penalties (Gotoh)** plus a fourth optional transition for multi-token abbreviation lookahead. Three NumPy matrices store the running scores:

| Matrix | Meaning |
|---|---|
| `M[i, j]` | best score ending with `a[i-1]` aligned to `b[j-1]` |
| `X[i, j]` | best score ending with a gap in B (`a[i-1]` consumed against a gap) |
| `Y[i, j]` | best score ending with a gap in A (`b[j-1]` consumed against a gap) |

`gap_open` is the penalty for *opening* a gap; `gap_extend` is the per-position cost of continuing one. Defaults: `gap_open=-2.0`, `gap_extend=-0.5`.

**Semi-global edges**: by default the first and last column / row are free (`X[i, 0] = 0`, `Y[0, j] = 0`). This means leading and trailing gaps don't cost anything — important for fragment alignment where one witness is a strict subset of the other.

**Abbreviation lookahead** (the fourth transition, `A`): when `a[i-1]` is flagged as an `abbreviation` and a span `b[j-k..j]` matches one of its known expansions, the cell can transition directly from `(i-1, j-k)` to `(i, j)` with the `ABBREVIATION`-tier score (0.85). The traceback emits one primary `Match` plus *k − 1* continuation matches so the output stays 1:1 to consumed tokens. Continuations carry `details["role"] = "continuation"` and a `primary_index` pointer.

The traceback chooses the best of M[m, n], X[m, n], Y[m, n] (or, in semi-global mode, the best value along the bottom row / right column). Tokens left over after the chosen traceback start are appended as `INSERTION` (rim of A) or `OMISSION` (rim of B) without DP penalty.

## Reproducibility

`AlignmentResult.params` contains:

- the input config (`gap_open`, `gap_extend`, `abbrev_lookahead`, `abbrev_max_span`, `semi_global_a/b`);
- the language pack code and its `version` string;
- `trace_version` (the installed package version).

This is sufficient to reproduce any alignment from the same inputs.

## Package layout

```
src/tracealign/
  __init__.py            # public API: tokenize, align, get_language, list_languages, register_language
  model.py               # Token, Match, AlignmentResult, Lexica, Reason
  tokenize/
    base.py              # RawToken, EditorialBracketRules
    plaintext.py         # the 4-stage language-agnostic pretokenizer
  lang/
    base.py              # LanguagePack ABC, ScoringTier, TierResult
    registry.py          # register_language, get_language, list_languages
    hebrew/
      pack.py            # HebrewLanguagePack
      tokenize.py        # gershayim mid-word, maqqef split
      normalize.py       # niqqud strip, skeleton
      scoring.py         # the five tier predicates
      data/              # seed JSON lexica
  score/
    tiered.py            # generic tier-walk function
  align/
    needleman_wunsch.py  # Gotoh DP + abbreviation lookahead
  io/
    result.py            # AlignmentResult JSON dump/load
    escriptorium.py      # eScriptorium JSON importer
    tei.py               # TEI XML importer
```

## Multi-witness alignment (v0.2)

`align_multi` extends the pairwise aligner to N witnesses. The pipeline is three-phase:

### Phase 1 — Pairwise distances

Every pair of witnesses is aligned with `tracealign.align()` (the v0.1 pairwise aligner) and the distance is computed as `1 − total_score`. The result is a symmetric `N × N` distance matrix; the diagonal is zero. Witness ids are sorted lexicographically before computing, making the matrix independent of dict insertion order.

### Phase 2 — UPGMA guide tree

A binary guide tree is built from the distance matrix using **UPGMA** (Unweighted Pair Group Method with Arithmetic Mean). At every iteration the closest cluster pair is merged. Ties are broken on the canonical `(min, max)` lexicographic order of cluster members, guaranteeing determinism. The tree's `height` field carries the cumulative UPGMA distance — a starting point for later stemmatic work.

### Phase 3 — Progressive POA-based merge

The guide tree is walked in post-order to produce a canonical merge sequence (closely-related witnesses are merged first). The first witness seeds the graph as a linear chain. Each subsequent witness is aligned to the current graph via **partial-order alignment (POA)** — a DP over the topologically sorted graph nodes. Three transitions:

| Transition | Effect on graph |
|---|---|
| Match | Merge the new token into an existing node's `tokens[witness_id]`; extend the witness set on the incoming edge. |
| Insertion in sequence (gap in graph) | Add a new node holding only this witness's token; new edge `prev → new`. |
| Deletion (skip graph node) | The new witness's path bypasses this node — recorded by an edge that skips it. |

`node_match_score` aggregates the per-constituent tiered score across the witnesses already in the target node. The default mode `"max"` is permissive (CollateX-aligned); `"mean"` and `"min"` are configurable.

### Correctness guarantees

Two properties are pinned by tests:

- **Lossless reconstruction.** For every input witness `w`, the path through the result graph yields exactly the original token sequence.
- **Permutation invariance.** The same set of witnesses in any input dict order produces the same alignment (same witness paths, same variant loci).

### Data flow

```
align_multi(witnesses, lang, config)
   │
   ▼
pairwise_distances        — Phase 1: O(N²/2) pairwise alignments
   │
   ▼
build_upgma               — Phase 2: deterministic binary tree
   │
   ▼
progressive_merge         — Phase 3: post-order POA-based merge
   │
   ▼
VariantGraph
   │
   ├──► AlignedTable      — derived view, re-anchorable
   └──► MultiAlignmentResult (graph + table + guide_tree + summary + params)
```