Usage

A minimal example

import tracealign

w1 = tracealign.tokenize("שלום עולם", lang="hbo", seq_label="W1")
w2 = tracealign.tokenize("שלום עולם", lang="hbo", seq_label="W2")

result = tracealign.align(w1, w2, lang="hbo")

print(result.total_score)   # 1.0
print(dict(result.summary)) # {EXACT: 2}

Tokenizing

tracealign.tokenize(text, lang, seq_label=...) runs the full pipeline (NFC normalization → editorial-marker scan → whitespace/punctuation split → language-pack post-processing → normalization) and returns a list[Token].

The seq_label argument is what distinguishes two witnesses — the resulting token IDs start with that label (e.g. W1:000000). When you align two sequences they must have different labels or their token IDs will collide.

tokens = tracealign.tokenize(
    "שלום עולם רַבִּי דויד ר\"י אמר",
    lang="hbo",
    seq_label="W1",
)
for t in tokens:
    print(t.id, t.text, t.flags)

Note

Niqqud (vowel points) and te’amim (cantillation marks) survive into Token.raw but are stripped from Token.text and from the skeleton form. This lets EXACT match the raw form while NIQQUD_STRIPPED can fall back to the stripped form.

Aligning

result = tracealign.align(seq_a, seq_b, lang="hbo")

The returned AlignmentResult exposes:

Field

Description

matches

List of Match objects, 1:1 to consumed tokens. Multi-token abbreviation expansions emit one primary match plus k − 1 continuation matches.

summary

dict[Reason, int] — count of each Reason. Continuations do not inflate ABBREVIATION.

total_score

Normalized in [0, 1]. Computed as the sum of non-continuation match scores divided by max(len(seq_a), len(seq_b)).

seq_a_meta, seq_b_meta

Free-form metadata you pass when calling align().

params

Snapshot of the config (gap penalties, abbrev settings) plus trace_version and language_pack_version for reproducibility.

Inspecting matches

for m in result.matches:
    a = m.token_a.text if m.token_a else "—"
    b = m.token_b.text if m.token_b else "—"
    print(f"{a:>10}{b:<10}  {m.reason.value:<18} {m.score:.2f}")

m.details carries Reason-specific extra information. For ABBREVIATION matches that’s role: "primary" or "continuation", the expansion string (e.g. "רבי ישמעאל"), and span_size. For ORTHOGRAPHIC matches it’s the rapidfuzz ratio.

Multi-witness alignment

tracealign.align_multi(witnesses, lang, config=None) aligns N witness sequences simultaneously and returns a canonical variant graph plus a derived aligned table.

import tracealign

witnesses = {
    "W1": tracealign.tokenize("שלום עולם רבי דוד אמר", lang="hbo", seq_label="W1"),
    "W2": tracealign.tokenize("שלום עולם רבי דוד אמר", lang="hbo", seq_label="W2"),
    "W3": tracealign.tokenize("שלום עולם ר\"י אמר", lang="hbo", seq_label="W3"),
}

result = tracealign.align_multi(witnesses, lang="hbo")

print(result.guide_tree.format_text())
print(result.table.format_text())

for node in result.graph.variants():
    readings = {wid: t.text for wid, t in node.tokens.items()}
    print(node.id, readings)

The result exposes:

Attribute

Description

result.graph

The canonical VariantGraph (DAG). Use witness_path(w) to get one witness’s trail; variants() to iterate variant loci.

result.table

The derived AlignedTable. Use re_anchor(witness_id) to render with any witness as the reference column.

result.guide_tree

The UPGMA GuideTree. Carries the original distance matrix for downstream use.

result.witness_ids

List of witness ids, sorted lexicographically.

result.summary

Aggregated Reason counts (may be empty in 0.2.0; richer aggregation in later patches).

result.params

Configuration snapshot plus trace_version and language_pack_version.

Configuration

from tracealign import MultiAlignerConfig
from tracealign.align import AlignerConfig

cfg = MultiAlignerConfig(
    pairwise=AlignerConfig(gap_open=-2.5),
    node_match="max",                    # also "mean" or "min"
    guide_tree_method="upgma",
    gap_penalty_multi=-2.0,
)
result = tracealign.align_multi(witnesses, lang="hbo", config=cfg)

Persistence

from tracealign.io import multi_result as mr_io

mr_io.dump(result, "alignment.json")
restored = mr_io.load("alignment.json")

JSON round-trip preserves the entire result including the guide tree’s distance matrix.

I/O

JSON round-trip

from tracealign.io import result as result_io

result_io.dump(result, "out.json")
restored = result_io.load("out.json")

dumps(result) -> str and loads(payload) -> AlignmentResult are also available for in-memory round-trips.

eScriptorium JSON exports

from tracealign.io.escriptorium import load as load_escr

tokens = load_escr("witness1.json", lang="hbo")

Expects an export with a top-level witness_id, a regions array whose entries have label and a lines array, each line carrying content plus optional line_pk and bbox. The eScriptorium-specific fields are preserved on each Token.metadata so you can map alignment matches back to scan coordinates.

TEI XML

from tracealign.io.tei import load as load_tei

a = load_tei("W1.xml", lang="hbo", seq_label="W1")
b = load_tei("W2.xml", lang="hbo", seq_label="W2")

If the TEI body contains <tei:w> elements, each <w> is treated as one token boundary. If it does not, the body’s flow text is tokenized through the standard plaintext pipeline.

Custom lexica

The Hebrew pack ships with a seed lexicon (six rabbinic abbreviations, two plene/defective pairs). You almost certainly want to extend it with project-specific entries:

from tracealign.lang.hebrew.pack import HebrewLanguagePack
from tracealign.lang.registry import register_language
from tracealign.model import Lexica

extra = Lexica(
    abbreviations={"רמב\"ם": ["רבי משה בן מימון"]},
    plene_defective_pairs=[("ירושלים", "ירושלם")],
)
pack = HebrewLanguagePack()
pack.lexica = pack.lexica.merge(extra)
register_language(pack)  # replaces the auto-registered one

Lexica.merge() is union-on-conflict — order-preserving deduplication for both abbreviation expansions and plene/defective pairs.

You can also load lexica from JSON files:

lex = Lexica.load({
    "abbreviations": "my_abbrev.json",
    "plene_defective_pairs": "my_plene.json",
})

JSON shapes:

{"ר\"י": ["רבי ישמעאל", "רבי יהודה"]}
[["דויד", "דוד"], ["משיח", "מאשיח"]]

Configuring the aligner

from tracealign.align import AlignerConfig

cfg = AlignerConfig(
    gap_open=-2.5,
    gap_extend=-0.4,
    abbrev_lookahead=True,
    abbrev_max_span=5,
)
result = tracealign.align(a, b, lang="hbo", config=cfg)

Defaults (gap_open=-2.0, gap_extend=-0.5, abbrev_max_span=4, semi-global on both sides) work well for the Hebrew pack out of the box. Adjust if you see persistent gap mis-placements or if your texts use abbreviations that expand to more than four tokens.