Usage¶
A minimal example¶
import tracealign
w1 = tracealign.tokenize("שלום עולם", lang="hbo", seq_label="W1")
w2 = tracealign.tokenize("שלום עולם", lang="hbo", seq_label="W2")
result = tracealign.align(w1, w2, lang="hbo")
print(result.total_score) # 1.0
print(dict(result.summary)) # {EXACT: 2}
Tokenizing¶
tracealign.tokenize(text, lang, seq_label=...) runs the full pipeline (NFC normalization → editorial-marker scan → whitespace/punctuation split → language-pack post-processing → normalization) and returns a list[Token].
The seq_label argument is what distinguishes two witnesses — the resulting token IDs start with that label (e.g. W1:000000). When you align two sequences they must have different labels or their token IDs will collide.
tokens = tracealign.tokenize(
"שלום עולם רַבִּי דויד ר\"י אמר",
lang="hbo",
seq_label="W1",
)
for t in tokens:
print(t.id, t.text, t.flags)
Note
Niqqud (vowel points) and te’amim (cantillation marks) survive into Token.raw but are stripped from Token.text and from the skeleton form. This lets EXACT match the raw form while NIQQUD_STRIPPED can fall back to the stripped form.
Aligning¶
result = tracealign.align(seq_a, seq_b, lang="hbo")
The returned AlignmentResult exposes:
Field |
Description |
|---|---|
|
List of |
|
|
|
Normalized in |
|
Free-form metadata you pass when calling |
|
Snapshot of the config (gap penalties, abbrev settings) plus |
Inspecting matches¶
for m in result.matches:
a = m.token_a.text if m.token_a else "—"
b = m.token_b.text if m.token_b else "—"
print(f"{a:>10} ↔ {b:<10} {m.reason.value:<18} {m.score:.2f}")
m.details carries Reason-specific extra information. For ABBREVIATION matches that’s role: "primary" or "continuation", the expansion string (e.g. "רבי ישמעאל"), and span_size. For ORTHOGRAPHIC matches it’s the rapidfuzz ratio.
Multi-witness alignment¶
tracealign.align_multi(witnesses, lang, config=None) aligns N witness sequences simultaneously and returns a canonical variant graph plus a derived aligned table.
import tracealign
witnesses = {
"W1": tracealign.tokenize("שלום עולם רבי דוד אמר", lang="hbo", seq_label="W1"),
"W2": tracealign.tokenize("שלום עולם רבי דוד אמר", lang="hbo", seq_label="W2"),
"W3": tracealign.tokenize("שלום עולם ר\"י אמר", lang="hbo", seq_label="W3"),
}
result = tracealign.align_multi(witnesses, lang="hbo")
print(result.guide_tree.format_text())
print(result.table.format_text())
for node in result.graph.variants():
readings = {wid: t.text for wid, t in node.tokens.items()}
print(node.id, readings)
The result exposes:
Attribute |
Description |
|---|---|
|
The canonical |
|
The derived |
|
The UPGMA |
|
List of witness ids, sorted lexicographically. |
|
Aggregated Reason counts (may be empty in 0.2.0; richer aggregation in later patches). |
|
Configuration snapshot plus |
Configuration¶
from tracealign import MultiAlignerConfig
from tracealign.align import AlignerConfig
cfg = MultiAlignerConfig(
pairwise=AlignerConfig(gap_open=-2.5),
node_match="max", # also "mean" or "min"
guide_tree_method="upgma",
gap_penalty_multi=-2.0,
)
result = tracealign.align_multi(witnesses, lang="hbo", config=cfg)
Persistence¶
from tracealign.io import multi_result as mr_io
mr_io.dump(result, "alignment.json")
restored = mr_io.load("alignment.json")
JSON round-trip preserves the entire result including the guide tree’s distance matrix.
I/O¶
JSON round-trip¶
from tracealign.io import result as result_io
result_io.dump(result, "out.json")
restored = result_io.load("out.json")
dumps(result) -> str and loads(payload) -> AlignmentResult are also available for in-memory round-trips.
eScriptorium JSON exports¶
from tracealign.io.escriptorium import load as load_escr
tokens = load_escr("witness1.json", lang="hbo")
Expects an export with a top-level witness_id, a regions array whose entries have label and a lines array, each line carrying content plus optional line_pk and bbox. The eScriptorium-specific fields are preserved on each Token.metadata so you can map alignment matches back to scan coordinates.
TEI XML¶
from tracealign.io.tei import load as load_tei
a = load_tei("W1.xml", lang="hbo", seq_label="W1")
b = load_tei("W2.xml", lang="hbo", seq_label="W2")
If the TEI body contains <tei:w> elements, each <w> is treated as one token boundary. If it does not, the body’s flow text is tokenized through the standard plaintext pipeline.
Custom lexica¶
The Hebrew pack ships with a seed lexicon (six rabbinic abbreviations, two plene/defective pairs). You almost certainly want to extend it with project-specific entries:
from tracealign.lang.hebrew.pack import HebrewLanguagePack
from tracealign.lang.registry import register_language
from tracealign.model import Lexica
extra = Lexica(
abbreviations={"רמב\"ם": ["רבי משה בן מימון"]},
plene_defective_pairs=[("ירושלים", "ירושלם")],
)
pack = HebrewLanguagePack()
pack.lexica = pack.lexica.merge(extra)
register_language(pack) # replaces the auto-registered one
Lexica.merge() is union-on-conflict — order-preserving deduplication for both abbreviation expansions and plene/defective pairs.
You can also load lexica from JSON files:
lex = Lexica.load({
"abbreviations": "my_abbrev.json",
"plene_defective_pairs": "my_plene.json",
})
JSON shapes:
{"ר\"י": ["רבי ישמעאל", "רבי יהודה"]}
[["דויד", "דוד"], ["משיח", "מאשיח"]]
Configuring the aligner¶
from tracealign.align import AlignerConfig
cfg = AlignerConfig(
gap_open=-2.5,
gap_extend=-0.4,
abbrev_lookahead=True,
abbrev_max_span=5,
)
result = tracealign.align(a, b, lang="hbo", config=cfg)
Defaults (gap_open=-2.0, gap_extend=-0.5, abbrev_max_span=4, semi-global on both sides) work well for the Hebrew pack out of the box. Adjust if you see persistent gap mis-placements or if your texts use abbreviations that expand to more than four tokens.