Books — pre-computed per-book glossary for context-correct word lookup

The book reader's word lookup used DictionaryService, a verb-conjugation index plus ~200 hand-typed words: ordinary nouns like "taza" returned nothing, and homographs always lost (tapping "como" in "como siempre" gave the verb "comer" because the verb index is checked first). Add a glossary phase to the books pipeline (build_glossary.py): every distinct Spanish word is translated once, in its sentence context, by the same Claude-Code-subagent LLM step the pipeline already uses for chapter translation. English front matter is excluded by an ES==EN paragraph-ratio heuristic. The glossary is bundled into book_<slug>.json and is now part of the pipeline for every book. In the app, Book carries the decoded glossary and BookReaderView resolves each tap automatically through cache -> glossary -> DictionaryService -> on-device LLM, citing which source answered so a curated glossary hit reads differently from a best-effort AI guess. book_olly-vol2.json regenerated with a 3,658-word glossary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:44:32 -05:00
parent d0582c4ce7
commit 3ee1563cb0
10 changed files with 18669 additions and 24 deletions
@@ -17,10 +17,13 @@ This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells yo
 |---|---|---|---|
 | 1 | `extract_epub.py` | Unzip the EPUB, walk `content.opf` spine + `toc.ncx` navMap, group HTML files into chapters, strip HTML→text. | `build/<slug>/chapters.json` |
 | 2 | `translate_chapters.py` | Split each chapter into ~30-paragraph translation batches. Each batch becomes a job with its own input/output file. **Resumable**: jobs whose output file already exists are skipped. | `build/<slug>/jobs/<jobid>.input.json` + `_pending.txt` |
-| 2.5 | Claude Code subagents | Read each job's `.input.json`, translate Spanish→English, write `<jobid>.output.json`. See "Running translations" below. | `build/<slug>/jobs/<jobid>.output.json` |
-| 3 | `bundle_book.py` | Merge `chapters.json` + every `*.output.json` into the final bundled JSON the app reads. | `Conjuga/Conjuga/book_<slug>.json` |
+| 2b | `build_glossary.py` | Tokenize every Spanish paragraph the same way the app does, collect the distinct words with example sentences, split into ~150-word glossary batches. **Resumable** the same way. | `build/<slug>/glossary/<jobid>.input.json` + `_pending.txt` |
+| 2.5 | Claude Code subagents | Drain **both** manifests: translate the chapter jobs *and* the glossary jobs, writing each job's `<jobid>.output.json`. See "Running translations" below. | `build/<slug>/{jobs,glossary}/<jobid>.output.json` |
+| 3 | `bundle_book.py` | Merge `chapters.json` + every translation `*.output.json` + every glossary `*.output.json` into the final bundled JSON the app reads. | `Conjuga/Conjuga/book_<slug>.json` |

-`run.sh` chains 1 → 2 → 3. If Phase 2 produces pending jobs, Phase 3 still runs but bundles with empty `paragraphsEN` placeholders so you can preview app structure before translation completes. Re-running `run.sh` after subagents fill in the outputs gives you the real bundled file.
+`run.sh` chains 1 → 2 → 2b → 3. If Phase 2 or 2b produces pending jobs, Phase 3 still runs but bundles with placeholders so you can preview app structure before the LLM passes complete. Re-running `run.sh` after subagents fill in the outputs gives you the real bundled file.
+
+The glossary is the book reader's primary word-lookup source: every distinct word translated once, in context, so taps are instant, cover the whole book, and don't mis-resolve homographs (e.g. "como" as the conjunction vs. the verb *comer*). This phase is a permanent part of the pipeline — every book imported this way gets a glossary.

 ## Adding a new book

@@ -34,7 +37,11 @@ This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells yo

 3. **Run translations** (Phase 2.5). The default approach is to spawn Claude Code subagents from inside a Claude Code session pointed at this repo:

-   For each pending job ID listed in `build/<slug>/jobs/_pending.txt`, hand a subagent the prompt at `build/<slug>/jobs/_prompt_template.md` with `<JOB_INPUT_PATH>` / `<JOB_OUTPUT_PATH>` filled in. The subagent reads the input, translates, and writes the output. Resumable — interrupted runs just leave the missing job IDs in `_pending.txt`.
+   There are **two** manifests to drain — translation and glossary:
+   - `build/<slug>/jobs/_pending.txt` with prompt `build/<slug>/jobs/_prompt_template.md`
+   - `build/<slug>/glossary/_pending.txt` with prompt `build/<slug>/glossary/_prompt_template.md`
+
+   For each pending job ID, hand a subagent the matching prompt with `<JOB_INPUT_PATH>` / `<JOB_OUTPUT_PATH>` filled in. The subagent reads the input, produces the translation/glossary, and writes the output. Resumable — interrupted runs just leave the missing job IDs in `_pending.txt`.

   Cluster jobs into agent batches of ~5–10 jobs each to keep per-agent context manageable. ~5 parallel agents is a good throughput target.

@@ -56,16 +63,23 @@ This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells yo
 Conjuga/Scripts/books/
 ├── extract_epub.py        # Phase 1
 ├── translate_chapters.py  # Phase 2
+├── build_glossary.py      # Phase 2b
 ├── bundle_book.py         # Phase 3
 ├── run.sh                 # Orchestrator
 └── build/                 # gitignored
    └── <slug>/
        ├── chapters.json
-        └── jobs/
+        ├── jobs/                    # translation jobs
+        │   ├── _pending.txt
+        │   ├── _prompt_template.md
+        │   ├── ch01_b00.input.json
+        │   ├── ch01_b00.output.json
+        │   └── ...
+        └── glossary/                # glossary jobs (Phase 2b)
            ├── _pending.txt
            ├── _prompt_template.md
-            ├── ch01_b00.input.json
-            ├── ch01_b00.output.json
+            ├── gloss_b00.input.json
+            ├── gloss_b00.output.json
            └── ...
 ```

@@ -81,5 +95,8 @@ The final output (`book_<slug>.json`) lives at `Conjuga/Conjuga/book_<slug>.json

 - OCR of vocab image tables (use `Scripts/textbook/` if your book is image-heavy).
 - Exercise extraction (textbook pipeline).
- Pre-computed per-word annotations (the app uses `DictionaryService.lookup()` at runtime).
+- Per-occurrence word sense disambiguation. The glossary has one entry per
+  distinct word, translated in context; a word genuinely used in two senses in
+  the same book gets its dominant sense. The runtime `DictionaryService` + the
+  on-device LLM remain as fallbacks for anything the glossary misses.
 - Cover image extraction (covers are derived from a color hash in the app for now).
@@ -0,0 +1,200 @@
+#!/usr/bin/env python3
+"""Phase 2b — build a per-book glossary job manifest.
+
+Scans chapters.json, tokenizes every Spanish paragraph the SAME way the iOS app
+does (whitespace split, lowercase, strip leading/trailing punctuation), collects
+the distinct words with a few example sentences each, and writes batched
+glossary jobs that Claude Code subagents can translate in parallel. Resumable:
+jobs whose output file already exists are skipped.
+
+Usage:
+    python3 build_glossary.py <slug> [--batch-size N] [--max-examples N]
+                                     [--build BUILD_DIR]
+
+Inputs:
+    BUILD_DIR/<slug>/chapters.json            (from extract_epub.py)
+
+Outputs:
+    BUILD_DIR/<slug>/glossary/<jobid>.input.json   (one per batch — read by subagents)
+    BUILD_DIR/<slug>/glossary/_pending.txt          (job IDs still missing output)
+    BUILD_DIR/<slug>/glossary/_prompt_template.md   (prompt for each subagent)
+
+Job input shape (.input.json):
+    {"jobId": "gloss_b00",
+     "words": [{"word": "taza", "examples": ["...", "..."]}, ...]}
+
+Subagents must write <jobid>.output.json with shape:
+    {"jobId": "gloss_b00",
+     "entries": [{"word": "taza", "baseForm": "taza",
+                  "english": "cup", "partOfSpeech": "noun"}, ...]}
+
+`entries` must contain exactly one object per input word.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import unicodedata
+from pathlib import Path
+
+
+PROMPT_TEMPLATE = """\
+You are building a Spanish->English glossary for a language-learning app.
+
+Input file: {input_path}
+Output file: {output_path}
+
+Read the input file. It contains a JSON object with a `words` array; each item
+has a `word` (a lowercase Spanish word exactly as it appears in a book) and
+`examples` (sentences from the book that use that word).
+
+For EACH word, produce one entry:
+- baseForm: the dictionary base form -- infinitive for verbs, masculine
+  singular for nouns/adjectives, the word itself for invariant words.
+- english: a concise English translation (1-4 words). Use the sense the word
+  carries in the example sentences. Many Spanish words are both a verb form
+  AND a function word -- e.g. "como" is "I eat" (verb) and "as/like"
+  (conjunction). Choose the meaning shown in the examples, not the most common
+  dictionary sense.
+- partOfSpeech: one of verb, noun, adjective, adverb, pronoun, preposition,
+  conjunction, article, interjection, numeral, proper noun, other.
+
+Write the output file as JSON with this exact shape:
+    {{"jobId": "<the jobId from the input>", "entries": [
+        {{"word": "...", "baseForm": "...", "english": "...", "partOfSpeech": "..."}}
+    ]}}
+
+`entries` MUST contain exactly one object per input word, cover every word, and
+echo each `word` back verbatim. Write nothing else to disk and produce no other
+output.
+"""
+
+SENTENCE_SPLIT = re.compile(r"(?<=[.!?…])\s+")
+
+
+def is_punct(ch: str) -> bool:
+    """True for any Unicode punctuation — matches Swift's .punctuationCharacters."""
+    return unicodedata.category(ch).startswith("P")
+
+
+def clean_word(token: str) -> str:
+    """Mirror BookReaderView.cleanWord: lowercase, strip leading/trailing
+    punctuation, trim whitespace. Accents are preserved (no folding)."""
+    t = token.lower()
+    start, end = 0, len(t)
+    while start < end and is_punct(t[start]):
+        start += 1
+    while end > start and is_punct(t[end - 1]):
+        end -= 1
+    return t[start:end].strip()
+
+
+def has_letter(s: str) -> bool:
+    return any(c.isalpha() for c in s)
+
+
+def split_sentences(paragraph: str) -> list[str]:
+    parts = SENTENCE_SPLIT.split(paragraph.strip())
+    return [p.strip() for p in parts if p.strip()]
+
+
+def is_english_front_matter(chapter: dict, threshold: float = 0.5) -> bool:
+    """True when most of a chapter's paragraphs are untranslated — i.e. it is
+    English front matter (Preface, reading guide, …) rather than Spanish story
+    content. Story chapters still have *some* identical lines (verbatim
+    `word = meaning` vocab entries), so a majority threshold separates them:
+    front matter runs ~70-100% identical, stories ~25-35%. Only detectable once
+    paragraphsEN is populated; raw extracted chapters carry none, so nothing is
+    skipped on a fresh book's first pass."""
+    es = [p.strip() for p in chapter.get("paragraphsES", [])]
+    en = [p.strip() for p in chapter.get("paragraphsEN", [])]
+    if not en or len(en) != len(es) or not es:
+        return False
+    identical = sum(1 for a, b in zip(es, en) if a == b)
+    return identical / len(es) > threshold
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("slug")
+    parser.add_argument("--batch-size", type=int, default=150)
+    parser.add_argument("--max-examples", type=int, default=3)
+    parser.add_argument("--build", type=Path, default=Path("build"))
+    args = parser.parse_args()
+
+    base = args.build / args.slug
+    chapters = json.loads((base / "chapters.json").read_text(encoding="utf-8"))
+    gloss_dir = base / "glossary"
+    gloss_dir.mkdir(parents=True, exist_ok=True)
+
+    examples: dict[str, list[str]] = {}
+    first_seen: dict[str, int] = {}
+    order = 0
+
+    skipped_front_matter = 0
+    for ch in chapters["chapters"]:
+        if is_english_front_matter(ch):
+            skipped_front_matter += 1
+            continue
+        for paragraph in ch.get("paragraphsES", []):
+            for sentence in split_sentences(paragraph):
+                cleaned = {clean_word(tok) for tok in sentence.split()}
+                for w in cleaned:
+                    if not w or not has_letter(w):
+                        continue
+                    if w not in first_seen:
+                        first_seen[w] = order
+                        order += 1
+                        examples[w] = []
+                    bucket = examples[w]
+                    if len(bucket) < args.max_examples and sentence not in bucket:
+                        bucket.append(sentence)
+
+    words = sorted(examples.keys(), key=lambda w: first_seen[w])
+
+    pending: list[str] = []
+    completed: list[str] = []
+    total_jobs = 0
+
+    for offset in range(0, len(words), args.batch_size):
+        chunk = words[offset : offset + args.batch_size]
+        job_id = f"gloss_b{offset // args.batch_size:02d}"
+        input_path = gloss_dir / f"{job_id}.input.json"
+        output_path = gloss_dir / f"{job_id}.output.json"
+        input_path.write_text(
+            json.dumps(
+                {
+                    "jobId": job_id,
+                    "words": [{"word": w, "examples": examples[w]} for w in chunk],
+                },
+                ensure_ascii=False,
+                indent=2,
+            ),
+            encoding="utf-8",
+        )
+        total_jobs += 1
+        (completed if output_path.exists() else pending).append(job_id)
+
+    (gloss_dir / "_pending.txt").write_text(
+        "\n".join(pending) + ("\n" if pending else ""), encoding="utf-8"
+    )
+    (gloss_dir / "_prompt_template.md").write_text(
+        PROMPT_TEMPLATE.format(
+            input_path="<JOB_INPUT_PATH>", output_path="<JOB_OUTPUT_PATH>"
+        ),
+        encoding="utf-8",
+    )
+
+    print(f"Skipped front matter:  {skipped_front_matter} chapter(s)")
+    print(f"Distinct words:        {len(words)}")
+    print(f"Total glossary jobs:   {total_jobs}")
+    print(f"  Completed:           {len(completed)}")
+    print(f"  Pending:             {len(pending)}")
+    print(f"Manifest at:           {gloss_dir / '_pending.txt'}")
+    print(f"Prompt template at:    {gloss_dir / '_prompt_template.md'}")
+
+
+if __name__ == "__main__":
+    main()
@@ -7,7 +7,8 @@ Usage:

 Inputs:
    BUILD_DIR/<slug>/chapters.json
-    BUILD_DIR/<slug>/jobs/*.output.json   (from translation subagents)
+    BUILD_DIR/<slug>/jobs/*.output.json       (from translation subagents)
+    BUILD_DIR/<slug>/glossary/*.output.json   (from glossary subagents, Phase 2b)

 Output:
    DEST_DIR/book_<slug>.json
@@ -21,11 +22,16 @@ Output:
             "paragraphsES": ["...", ...],
             "paragraphsEN": ["...", ...]},
            ...
-          ]
+          ],
+          "glossary": {
+            "taza": {"baseForm": "taza", "english": "cup", "partOfSpeech": "noun"},
+            ...
+          }
        }

-If --require-all is passed, the script fails if any job is missing its output.
-Otherwise it fills missing translations with empty strings and warns.
+If --require-all is passed, the script fails if any translation OR glossary job
+is missing its output. Otherwise it fills missing translations with empty
+strings, leaves missing glossary entries out, and warns.
 """

 from __future__ import annotations
@@ -86,6 +92,35 @@ def main() -> None:
            sys.exit(1)
        print(f"WARN: {msg} — using empty strings for those paragraphs.", file=sys.stderr)

+    # Glossary (Phase 2b) — merge every glossary job's entries into one map
+    # keyed by the cleaned word the app looks up.
+    glossary_dir = base / "glossary"
+    glossary: dict[str, dict] = {}
+    glossary_missing: list[str] = []
+    if glossary_dir.exists():
+        for input_path in sorted(glossary_dir.glob("*.input.json")):
+            job_id = input_path.stem.removesuffix(".input")
+            output_path = glossary_dir / f"{job_id}.output.json"
+            if not output_path.exists():
+                glossary_missing.append(job_id)
+                continue
+            output_data = json.loads(output_path.read_text(encoding="utf-8"))
+            for entry in output_data.get("entries", []):
+                word = (entry.get("word") or "").strip()
+                if not word:
+                    continue
+                glossary[word] = {
+                    "baseForm": entry.get("baseForm") or word,
+                    "english": entry.get("english") or "",
+                    "partOfSpeech": entry.get("partOfSpeech") or "",
+                }
+    if glossary_missing:
+        msg = f"{len(glossary_missing)} glossary job(s) missing output: {glossary_missing[:5]}{'...' if len(glossary_missing) > 5 else ''}"
+        if args.require_all:
+            print(f"ERROR: {msg}", file=sys.stderr)
+            sys.exit(1)
+        print(f"WARN: {msg} — glossary will be incomplete.", file=sys.stderr)
+
    bundled_chapters: list[dict] = []
    for ch in chapters["chapters"]:
        translations = sorted(chapter_translations.get(ch["number"], []))
@@ -113,6 +148,7 @@ def main() -> None:
        "author": chapters["author"],
        "language": chapters["language"],
        "chapters": bundled_chapters,
+        "glossary": glossary,
    }

    dest_dir = (args.dest or DEFAULT_DEST).resolve()
@@ -122,6 +158,7 @@ def main() -> None:
    print(f"Wrote {out_path}")
    print(f"  Chapters:        {len(bundled_chapters)}")
    print(f"  Translated jobs: {sum(len(v) for v in chapter_translations.values())} / {sum(len(v) for v in chapter_translations.values()) + len(missing)}")
+    print(f"  Glossary words:  {len(glossary)}")


 if __name__ == "__main__":
@@ -23,11 +23,13 @@ fi
 EPUB="$1"; shift
 SLUG=""
 BATCH_SIZE="30"
+GLOSSARY_BATCH_SIZE="150"

 while [[ $# -gt 0 ]]; do
    case "$1" in
        --slug) SLUG="$2"; shift 2 ;;
        --batch-size) BATCH_SIZE="$2"; shift 2 ;;
+        --glossary-batch-size) GLOSSARY_BATCH_SIZE="$2"; shift 2 ;;
        *) echo "unknown option: $1" >&2; exit 2 ;;
    esac
 done
@@ -53,12 +55,22 @@ python3 translate_chapters.py "$SLUG" --batch-size "$BATCH_SIZE"
 PENDING_FILE="build/$SLUG/jobs/_pending.txt"
 PENDING_COUNT=$(wc -l < "$PENDING_FILE" | tr -d ' ')

+echo
+echo "=== Phase 2b: build_glossary.py ==="
+python3 build_glossary.py "$SLUG" --batch-size "$GLOSSARY_BATCH_SIZE"
+
+GLOSS_PENDING_FILE="build/$SLUG/glossary/_pending.txt"
+GLOSS_PENDING_COUNT=$(wc -l < "$GLOSS_PENDING_FILE" | tr -d ' ')
+TOTAL_PENDING=$((PENDING_COUNT + GLOSS_PENDING_COUNT))
+
 echo
 echo "=== Phase 3: bundle_book.py ==="
-if [[ "$PENDING_COUNT" -gt 0 ]]; then
-    echo "  $PENDING_COUNT translation job(s) still pending."
-    echo "  Run the Claude Code subagent translation step (see README.md), then re-run this script."
-    echo "  Bundling with empty placeholders so you can preview app structure now."
+if [[ "$TOTAL_PENDING" -gt 0 ]]; then
+    echo "  $PENDING_COUNT translation job(s) and $GLOSS_PENDING_COUNT glossary job(s) still pending."
+    echo "  Run the Claude Code subagent step (see README.md) for BOTH manifests:"
+    echo "    build/$SLUG/jobs/_pending.txt       (translation)"
+    echo "    build/$SLUG/glossary/_pending.txt   (glossary)"
+    echo "  then re-run this script. Bundling with placeholders so you can preview now."
    python3 bundle_book.py "$SLUG"
 else
    python3 bundle_book.py "$SLUG" --require-all