# Usage ## A minimal example ```python import tracealign w1 = tracealign.tokenize("שלום עולם", lang="hbo", seq_label="W1") w2 = tracealign.tokenize("שלום עולם", lang="hbo", seq_label="W2") result = tracealign.align(w1, w2, lang="hbo") print(result.total_score) # 1.0 print(dict(result.summary)) # {EXACT: 2} ``` ## Tokenizing `tracealign.tokenize(text, lang, seq_label=...)` runs the full pipeline (NFC normalization → editorial-marker scan → whitespace/punctuation split → language-pack post-processing → normalization) and returns a `list[Token]`. The `seq_label` argument is what distinguishes two witnesses — the resulting token IDs start with that label (e.g. `W1:000000`). When you align two sequences they must have different labels or their token IDs will collide. ```python tokens = tracealign.tokenize( "שלום עולם רַבִּי דויד ר\"י אמר", lang="hbo", seq_label="W1", ) for t in tokens: print(t.id, t.text, t.flags) ``` ```{note} Niqqud (vowel points) and te'amim (cantillation marks) survive into `Token.raw` but are stripped from `Token.text` and from the skeleton form. This lets EXACT match the raw form while NIQQUD_STRIPPED can fall back to the stripped form. ``` ## Aligning ```python result = tracealign.align(seq_a, seq_b, lang="hbo") ``` The returned `AlignmentResult` exposes: | Field | Description | |---|---| | `matches` | List of `Match` objects, 1:1 to consumed tokens. Multi-token abbreviation expansions emit one primary match plus *k − 1* continuation matches. | | `summary` | `dict[Reason, int]` — count of each Reason. Continuations do not inflate ABBREVIATION. | | `total_score` | Normalized in `[0, 1]`. Computed as the sum of non-continuation match scores divided by `max(len(seq_a), len(seq_b))`. | | `seq_a_meta`, `seq_b_meta` | Free-form metadata you pass when calling `align()`. | | `params` | Snapshot of the config (gap penalties, abbrev settings) plus `trace_version` and `language_pack_version` for reproducibility. | ## Inspecting matches ```python for m in result.matches: a = m.token_a.text if m.token_a else "—" b = m.token_b.text if m.token_b else "—" print(f"{a:>10} ↔ {b:<10} {m.reason.value:<18} {m.score:.2f}") ``` `m.details` carries Reason-specific extra information. For ABBREVIATION matches that's `role: "primary"` or `"continuation"`, the `expansion` string (e.g. `"רבי ישמעאל"`), and `span_size`. For ORTHOGRAPHIC matches it's the rapidfuzz ratio. ## Multi-witness alignment `tracealign.align_multi(witnesses, lang, config=None)` aligns N witness sequences simultaneously and returns a canonical variant graph plus a derived aligned table. ```python import tracealign witnesses = { "W1": tracealign.tokenize("שלום עולם רבי דוד אמר", lang="hbo", seq_label="W1"), "W2": tracealign.tokenize("שלום עולם רבי דוד אמר", lang="hbo", seq_label="W2"), "W3": tracealign.tokenize("שלום עולם ר\"י אמר", lang="hbo", seq_label="W3"), } result = tracealign.align_multi(witnesses, lang="hbo") print(result.guide_tree.format_text()) print(result.table.format_text()) for node in result.graph.variants(): readings = {wid: t.text for wid, t in node.tokens.items()} print(node.id, readings) ``` The result exposes: | Attribute | Description | |---|---| | `result.graph` | The canonical `VariantGraph` (DAG). Use `witness_path(w)` to get one witness's trail; `variants()` to iterate variant loci. | | `result.table` | The derived `AlignedTable`. Use `re_anchor(witness_id)` to render with any witness as the reference column. | | `result.guide_tree` | The UPGMA `GuideTree`. Carries the original distance matrix for downstream use. | | `result.witness_ids` | List of witness ids, sorted lexicographically. | | `result.summary` | Aggregated Reason counts (may be empty in 0.2.0; richer aggregation in later patches). | | `result.params` | Configuration snapshot plus `trace_version` and `language_pack_version`. | ### Configuration ```python from tracealign import MultiAlignerConfig from tracealign.align import AlignerConfig cfg = MultiAlignerConfig( pairwise=AlignerConfig(gap_open=-2.5), node_match="max", # also "mean" or "min" guide_tree_method="upgma", gap_penalty_multi=-2.0, ) result = tracealign.align_multi(witnesses, lang="hbo", config=cfg) ``` ### Persistence ```python from tracealign.io import multi_result as mr_io mr_io.dump(result, "alignment.json") restored = mr_io.load("alignment.json") ``` JSON round-trip preserves the entire result including the guide tree's distance matrix. ## I/O ### JSON round-trip ```python from tracealign.io import result as result_io result_io.dump(result, "out.json") restored = result_io.load("out.json") ``` `dumps(result) -> str` and `loads(payload) -> AlignmentResult` are also available for in-memory round-trips. ### eScriptorium JSON exports ```python from tracealign.io.escriptorium import load as load_escr tokens = load_escr("witness1.json", lang="hbo") ``` Expects an export with a top-level `witness_id`, a `regions` array whose entries have `label` and a `lines` array, each line carrying `content` plus optional `line_pk` and `bbox`. The eScriptorium-specific fields are preserved on each `Token.metadata` so you can map alignment matches back to scan coordinates. ### TEI XML ```python from tracealign.io.tei import load as load_tei a = load_tei("W1.xml", lang="hbo", seq_label="W1") b = load_tei("W2.xml", lang="hbo", seq_label="W2") ``` If the TEI body contains `` elements, each `` is treated as one token boundary. If it does not, the body's flow text is tokenized through the standard plaintext pipeline. ## Custom lexica The Hebrew pack ships with a seed lexicon (six rabbinic abbreviations, two plene/defective pairs). You almost certainly want to extend it with project-specific entries: ```python from tracealign.lang.hebrew.pack import HebrewLanguagePack from tracealign.lang.registry import register_language from tracealign.model import Lexica extra = Lexica( abbreviations={"רמב\"ם": ["רבי משה בן מימון"]}, plene_defective_pairs=[("ירושלים", "ירושלם")], ) pack = HebrewLanguagePack() pack.lexica = pack.lexica.merge(extra) register_language(pack) # replaces the auto-registered one ``` `Lexica.merge()` is union-on-conflict — order-preserving deduplication for both abbreviation expansions and plene/defective pairs. You can also load lexica from JSON files: ```python lex = Lexica.load({ "abbreviations": "my_abbrev.json", "plene_defective_pairs": "my_plene.json", }) ``` JSON shapes: ```json {"ר\"י": ["רבי ישמעאל", "רבי יהודה"]} ``` ```json [["דויד", "דוד"], ["משיח", "מאשיח"]] ``` ## Configuring the aligner ```python from tracealign.align import AlignerConfig cfg = AlignerConfig( gap_open=-2.5, gap_extend=-0.4, abbrev_lookahead=True, abbrev_max_span=5, ) result = tracealign.align(a, b, lang="hbo", config=cfg) ``` Defaults (`gap_open=-2.0`, `gap_extend=-0.5`, `abbrev_max_span=4`, semi-global on both sides) work well for the Hebrew pack out of the box. Adjust if you see persistent gap mis-placements or if your texts use abbreviations that expand to more than four tokens.