Add Books — read EPUB-imported books in Practice with tap-to-define

New "Books" row in the Practice tab opens a library of bundled bilingual books. Each chapter renders Spanish paragraph-by-paragraph; tap any word for a definition sheet (DictionaryService with on-device AI fallback), or toggle the toolbar button to swap to the pre-computed English translation inline. Local-only Book + BookChapter SwiftData models added to the local container schema (reset version bumped to 5). DataLoader.seedBooks walks the bundle for `book_*.json` resources, so future books drop in without touching app code — just bundle a new JSON and bump bookDataVersion. First book: Olly Richards' "Spanish Short Stories For Beginners Vol 2" — 13 chapters, 2,646 paragraphs, bilingual. Scripts/books/ is the repeatable pipeline for future EPUBs: extract_epub.py → translate_chapters.py (per-chapter resumable jobs) → bundle_book.py. Translation is done by parallel Claude Code subagents reading per-job input files and writing output files — no API key required, matching the pattern used for the textbook vocab vision pass. See Scripts/books/README.md for the full how-to. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 09:21:44 -05:00
parent ade091f108
commit 09e49bda2c
17 changed files with 6782 additions and 1 deletions
@@ -0,0 +1 @@
+build/
@@ -0,0 +1,85 @@
+# Books pipeline
+
+Turns any EPUB into a chapter-structured JSON file the app bundles and reads.
+
+## TL;DR
+
+```bash
+cd Conjuga/Scripts/books
+./run.sh /path/to/book.epub --slug my-book-slug
+```
+
+This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells you how many translation jobs are pending. Run those via Claude Code subagents (Phase 2.5 below), then re-run `./run.sh` to bundle the final file.
+
+## Phases
+
+| Phase | Script | What it does | Output |
+|---|---|---|---|
+| 1 | `extract_epub.py` | Unzip the EPUB, walk `content.opf` spine + `toc.ncx` navMap, group HTML files into chapters, strip HTML→text. | `build/<slug>/chapters.json` |
+| 2 | `translate_chapters.py` | Split each chapter into ~30-paragraph translation batches. Each batch becomes a job with its own input/output file. **Resumable**: jobs whose output file already exists are skipped. | `build/<slug>/jobs/<jobid>.input.json` + `_pending.txt` |
+| 2.5 | Claude Code subagents | Read each job's `.input.json`, translate Spanish→English, write `<jobid>.output.json`. See "Running translations" below. | `build/<slug>/jobs/<jobid>.output.json` |
+| 3 | `bundle_book.py` | Merge `chapters.json` + every `*.output.json` into the final bundled JSON the app reads. | `Conjuga/Conjuga/book_<slug>.json` |
+
+`run.sh` chains 1 → 2 → 3. If Phase 2 produces pending jobs, Phase 3 still runs but bundles with empty `paragraphsEN` placeholders so you can preview app structure before translation completes. Re-running `run.sh` after subagents fill in the outputs gives you the real bundled file.
+
+## Adding a new book
+
+1. **Drop the EPUB** anywhere on disk.
+2. **Run Phase 1+2**:
+   ```bash
+   cd Conjuga/Scripts/books
+   ./run.sh /path/to/book.epub --slug my-book
+   ```
+   Sanity-check the chapter list it prints. If chapter grouping looks wrong (e.g. an EPUB without a usable `toc.ncx`), `extract_epub.py` will need a fallback heuristic — see "Open assumptions" below.
+
+3. **Run translations** (Phase 2.5). The default approach is to spawn Claude Code subagents from inside a Claude Code session pointed at this repo:
+
+   For each pending job ID listed in `build/<slug>/jobs/_pending.txt`, hand a subagent the prompt at `build/<slug>/jobs/_prompt_template.md` with `<JOB_INPUT_PATH>` / `<JOB_OUTPUT_PATH>` filled in. The subagent reads the input, translates, and writes the output. Resumable — interrupted runs just leave the missing job IDs in `_pending.txt`.
+
+   Cluster jobs into agent batches of ~5–10 jobs each to keep per-agent context manageable. ~5 parallel agents is a good throughput target.
+
+4. **Bundle**:
+   ```bash
+   ./run.sh /path/to/book.epub --slug my-book   # re-running pulls in the new outputs
+   # or directly:
+   python3 bundle_book.py my-book --require-all
+   ```
+   `--require-all` will fail loudly if any job is still missing.
+
+5. **Bump `bookDataVersion`** in `DataLoader.swift` so the in-app store re-seeds the new book on next launch (or any time you re-run with new translations).
+
+6. **Verify the file is bundled** in `Conjuga.xcodeproj`. The script writes `book_<slug>.json` into `Conjuga/Conjuga/Resources/`; if that folder is part of a recursive group reference, Xcode picks it up automatically. Otherwise, add it manually or via the `xcodeproj` ruby gem.
+
+## File layout
+
+```
+Conjuga/Scripts/books/
+├── extract_epub.py        # Phase 1
+├── translate_chapters.py  # Phase 2
+├── bundle_book.py         # Phase 3
+├── run.sh                 # Orchestrator
+└── build/                 # gitignored
+    └── <slug>/
+        ├── chapters.json
+        └── jobs/
+            ├── _pending.txt
+            ├── _prompt_template.md
+            ├── ch01_b00.input.json
+            ├── ch01_b00.output.json
+            └── ...
+```
+
+The final output (`book_<slug>.json`) lives at `Conjuga/Conjuga/book_<slug>.json` so the iOS app bundle includes it. (Existing `textbook_data.json` / `conjuga_data.json` use the same layout — files in the app target root rather than a Resources subgroup.)
+
+## Open assumptions
+
+- **TOC drives chapter boundaries.** If an EPUB ships without a usable `toc.ncx`, or the navMap is too granular (e.g. one navPoint per page), `extract_epub.py` will need a fallback that groups by `<h1>` headings in spine order.
+- **Spanish bold tags = inline emphasis.** The Olly Richards books bold vocab hints inside paragraphs. We strip the bold and let the in-app dictionary lookup handle definitions instead. If a future book uses bold for something else (titles, etc.), revisit.
+- **Translation is per-paragraph 1:1.** Subagents must preserve paragraph count and order. `bundle_book.py` will warn + pad/truncate if a job's output array length doesn't match its input — but that's a sign the subagent misbehaved.
+
+## Out of scope (intentional)
+
+- OCR of vocab image tables (use `Scripts/textbook/` if your book is image-heavy).
+- Exercise extraction (textbook pipeline).
+- Pre-computed per-word annotations (the app uses `DictionaryService.lookup()` at runtime).
+- Cover image extraction (covers are derived from a color hash in the app for now).
@@ -0,0 +1,128 @@
+#!/usr/bin/env python3
+"""Merge chapters.json + per-job translation outputs into the final bundled
+book_<slug>.json that the iOS app reads from its bundle.
+
+Usage:
+    python3 bundle_book.py <slug> [--build BUILD_DIR] [--dest DEST_DIR] [--require-all]
+
+Inputs:
+    BUILD_DIR/<slug>/chapters.json
+    BUILD_DIR/<slug>/jobs/*.output.json   (from translation subagents)
+
+Output:
+    DEST_DIR/book_<slug>.json
+        {
+          "slug": "...",
+          "title": "...",
+          "author": "...",
+          "language": "...",
+          "chapters": [
+            {"id": "ch1", "number": 1, "title": "Preface",
+             "paragraphsES": ["...", ...],
+             "paragraphsEN": ["...", ...]},
+            ...
+          ]
+        }
+
+If --require-all is passed, the script fails if any job is missing its output.
+Otherwise it fills missing translations with empty strings and warns.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+
+DEFAULT_DEST = Path("../../Conjuga")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("slug")
+    parser.add_argument("--build", type=Path, default=Path("build"))
+    parser.add_argument("--dest", type=Path, default=None)
+    parser.add_argument("--require-all", action="store_true")
+    args = parser.parse_args()
+
+    base = args.build / args.slug
+    chapters = json.loads((base / "chapters.json").read_text(encoding="utf-8"))
+    jobs_dir = base / "jobs"
+
+    # Index translation jobs by chapter -> ordered (offset, paragraphsEN).
+    chapter_translations: dict[int, list[tuple[int, list[str]]]] = {}
+    missing: list[str] = []
+
+    for input_path in sorted(jobs_dir.glob("*.input.json")):
+        job_id = input_path.stem.removesuffix(".input")
+        input_data = json.loads(input_path.read_text(encoding="utf-8"))
+        output_path = jobs_dir / f"{job_id}.output.json"
+        if not output_path.exists():
+            missing.append(job_id)
+            continue
+        output_data = json.loads(output_path.read_text(encoding="utf-8"))
+        paragraphs_en = output_data.get("paragraphsEN", [])
+        expected = len(input_data["paragraphsES"])
+        if len(paragraphs_en) != expected:
+            print(
+                f"WARN: {job_id} length mismatch — got {len(paragraphs_en)}, "
+                f"expected {expected}. Padding/truncating.",
+                file=sys.stderr,
+            )
+            if len(paragraphs_en) < expected:
+                paragraphs_en = paragraphs_en + [""] * (expected - len(paragraphs_en))
+            else:
+                paragraphs_en = paragraphs_en[:expected]
+        chapter_translations.setdefault(input_data["chapter"], []).append(
+            (input_data["rangeStart"], paragraphs_en)
+        )
+
+    if missing:
+        msg = f"{len(missing)} translation job(s) missing output: {missing[:5]}{'...' if len(missing) > 5 else ''}"
+        if args.require_all:
+            print(f"ERROR: {msg}", file=sys.stderr)
+            sys.exit(1)
+        print(f"WARN: {msg} — using empty strings for those paragraphs.", file=sys.stderr)
+
+    bundled_chapters: list[dict] = []
+    for ch in chapters["chapters"]:
+        translations = sorted(chapter_translations.get(ch["number"], []))
+        paragraphs_en: list[str] = []
+        for _, en_chunk in translations:
+            paragraphs_en.extend(en_chunk)
+        # Pad to match ES length if jobs were missing for parts of this chapter.
+        if len(paragraphs_en) < len(ch["paragraphsES"]):
+            paragraphs_en += [""] * (len(ch["paragraphsES"]) - len(paragraphs_en))
+        elif len(paragraphs_en) > len(ch["paragraphsES"]):
+            paragraphs_en = paragraphs_en[: len(ch["paragraphsES"])]
+        bundled_chapters.append(
+            {
+                "id": ch["id"],
+                "number": ch["number"],
+                "title": ch["title"],
+                "paragraphsES": ch["paragraphsES"],
+                "paragraphsEN": paragraphs_en,
+            }
+        )
+
+    payload = {
+        "slug": chapters["slug"],
+        "title": chapters["title"],
+        "author": chapters["author"],
+        "language": chapters["language"],
+        "chapters": bundled_chapters,
+    }
+
+    dest_dir = (args.dest or DEFAULT_DEST).resolve()
+    dest_dir.mkdir(parents=True, exist_ok=True)
+    out_path = dest_dir / f"book_{args.slug}.json"
+    out_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
+    print(f"Wrote {out_path}")
+    print(f"  Chapters:        {len(bundled_chapters)}")
+    print(f"  Translated jobs: {sum(len(v) for v in chapter_translations.values())} / {sum(len(v) for v in chapter_translations.values()) + len(missing)}")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,258 @@
+#!/usr/bin/env python3
+"""Parse an EPUB into chapters.json for the in-app Books feature.
+
+Usage:
+    python3 extract_epub.py <epub_path> [--slug SLUG] [--out OUT_DIR]
+
+Defaults:
+    SLUG    derived from the EPUB filename (lowercased, dashed)
+    OUT_DIR ./build/<slug>
+
+Output:
+    OUT_DIR/chapters.json
+        {
+          "title": "...",
+          "author": "...",
+          "language": "...",
+          "slug": "...",
+          "chapters": [
+            {"id": "ch1", "number": 1, "title": "Preface",
+             "paragraphsES": ["...", "..."]},
+            ...
+          ]
+        }
+
+How chapter grouping works:
+    1. Read content.opf manifest (id -> href) and spine (ordered idrefs).
+    2. Read toc.ncx navMap to get the ordered list of chapter (title, first-href).
+    3. For each chapter, claim every spine file from its first href up to (but
+       not including) the next chapter's first href.
+    4. For each file in the chapter's range, parse <p> elements, strip tags,
+       normalise whitespace + smart quotes, drop empties.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import sys
+import unicodedata
+import warnings
+import zipfile
+from pathlib import Path
+from typing import Iterable
+from xml.etree import ElementTree as ET
+
+from bs4 import BeautifulSoup, XMLParsedAsHTMLWarning
+
+warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)
+
+
+NS = {
+    "opf": "http://www.idpf.org/2007/opf",
+    "dc": "http://purl.org/dc/elements/1.1/",
+    "ncx": "http://www.daisy.org/z3986/2005/ncx/",
+    "xhtml": "http://www.w3.org/1999/xhtml",
+}
+
+
+def _slugify(s: str) -> str:
+    s = unicodedata.normalize("NFKD", s).encode("ascii", "ignore").decode("ascii")
+    s = re.sub(r"[^a-zA-Z0-9]+", "-", s).strip("-").lower()
+    return s or "book"
+
+
+def _normalise(text: str) -> str:
+    # Collapse runs of whitespace, normalise smart quotes to plain ones.
+    text = text.replace(" ", " ")
+    text = re.sub(r"\s+", " ", text).strip()
+    text = re.sub(r"\s+([.,;:!?…])", r"\1", text)
+    text = re.sub(r"([¡¿])\s+", r"\1", text)
+    return text
+
+
+def _read_zip_text(zf: zipfile.ZipFile, path: str) -> str:
+    return zf.read(path).decode("utf-8")
+
+
+def _container_root(zf: zipfile.ZipFile) -> str:
+    container = ET.fromstring(_read_zip_text(zf, "META-INF/container.xml"))
+    rootfile = container.find(".//{urn:oasis:names:tc:opendocument:xmlns:container}rootfile")
+    if rootfile is None:
+        raise RuntimeError("Missing rootfile entry in META-INF/container.xml")
+    return rootfile.attrib["full-path"]
+
+
+def _parse_opf(zf: zipfile.ZipFile, opf_path: str):
+    text = _read_zip_text(zf, opf_path)
+    root = ET.fromstring(text)
+
+    title = (root.findtext(".//dc:title", default="", namespaces=NS) or "").strip()
+    author = (root.findtext(".//dc:creator", default="", namespaces=NS) or "").strip()
+    language = (root.findtext(".//dc:language", default="", namespaces=NS) or "").strip()
+
+    manifest: dict[str, str] = {}
+    for item in root.findall("opf:manifest/opf:item", NS):
+        manifest[item.attrib["id"]] = item.attrib["href"]
+
+    spine: list[str] = []
+    for itemref in root.findall("opf:spine/opf:itemref", NS):
+        spine.append(itemref.attrib["idref"])
+
+    ncx_id = root.find("opf:spine", NS).attrib.get("toc") if root.find("opf:spine", NS) is not None else None
+    ncx_href = manifest.get(ncx_id) if ncx_id else None
+
+    return {
+        "title": title,
+        "author": author,
+        "language": language,
+        "manifest": manifest,
+        "spine": spine,
+        "ncx_href": ncx_href,
+        "opf_dir": str(Path(opf_path).parent) if "/" in opf_path else "",
+    }
+
+
+def _parse_ncx(zf: zipfile.ZipFile, ncx_path: str) -> list[dict]:
+    text = _read_zip_text(zf, ncx_path)
+    root = ET.fromstring(text)
+    chapters: list[dict] = []
+    for nav in root.findall("ncx:navMap/ncx:navPoint", NS):
+        title = (nav.findtext("ncx:navLabel/ncx:text", default="", namespaces=NS) or "").strip()
+        content = nav.find("ncx:content", NS)
+        src = content.attrib.get("src", "") if content is not None else ""
+        # Strip the anchor — we want the file path only.
+        href = src.split("#", 1)[0]
+        chapters.append({"title": title, "href": href})
+    return chapters
+
+
+def _resolve_zip_path(base_dir: str, href: str) -> str:
+    if not base_dir:
+        return href
+    return f"{base_dir}/{href}".lstrip("/")
+
+
+def _extract_paragraphs(zf: zipfile.ZipFile, zip_path: str) -> list[str]:
+    try:
+        html = _read_zip_text(zf, zip_path)
+    except KeyError:
+        return []
+    soup = BeautifulSoup(html, "lxml")
+    paragraphs: list[str] = []
+    for p in soup.find_all("p"):
+        # Drop nav-anchor wrappers that contain no real text.
+        text = _normalise(p.get_text(" ", strip=True))
+        if not text:
+            continue
+        # Drop chapter-heading paragraphs that only echo the title — handled
+        # separately by the TOC. Heuristic: very short paragraph that's just
+        # numbers + the chapter title pattern. Keep everything else.
+        paragraphs.append(text)
+    return paragraphs
+
+
+def _chapter_files(
+    spine_files: list[str], chapter_hrefs: list[str]
+) -> list[list[str]]:
+    """Slice the spine into one list of files per chapter, using the chapter's
+    first href as the chapter boundary. Files before the first chapter (e.g.
+    cover, titlepage) are dropped."""
+    boundaries: list[int] = []
+    for href in chapter_hrefs:
+        try:
+            idx = spine_files.index(href)
+        except ValueError:
+            boundaries.append(-1)
+            continue
+        boundaries.append(idx)
+
+    ranges: list[list[str]] = []
+    for i, start in enumerate(boundaries):
+        if start < 0:
+            ranges.append([])
+            continue
+        end = len(spine_files)
+        for next_start in boundaries[i + 1:]:
+            if next_start >= 0:
+                end = next_start
+                break
+        ranges.append(spine_files[start:end])
+    return ranges
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("epub", type=Path)
+    parser.add_argument("--slug", default=None)
+    parser.add_argument("--out", type=Path, default=None)
+    args = parser.parse_args()
+
+    if not args.epub.exists():
+        print(f"EPUB not found: {args.epub}", file=sys.stderr)
+        sys.exit(2)
+
+    with zipfile.ZipFile(args.epub) as zf:
+        opf_path = _container_root(zf)
+        opf = _parse_opf(zf, opf_path)
+
+        if not opf["ncx_href"]:
+            print("No NCX found in spine; cannot derive chapter structure.", file=sys.stderr)
+            sys.exit(3)
+
+        ncx_path = _resolve_zip_path(opf["opf_dir"], opf["ncx_href"])
+        toc = _parse_ncx(zf, ncx_path)
+
+        spine_files = [
+            _resolve_zip_path(opf["opf_dir"], opf["manifest"].get(idref, ""))
+            for idref in opf["spine"]
+        ]
+        chapter_hrefs = [_resolve_zip_path(opf["opf_dir"], c["href"]) for c in toc]
+        chapter_file_ranges = _chapter_files(spine_files, chapter_hrefs)
+
+        chapters_out: list[dict] = []
+        for i, (meta, files) in enumerate(zip(toc, chapter_file_ranges), start=1):
+            paragraphs: list[str] = []
+            for f in files:
+                paragraphs.extend(_extract_paragraphs(zf, f))
+            # Drop leading paragraph(s) that just echo the chapter title — the
+            # title is already stored separately.
+            title_norm = _normalise(meta["title"]).lower()
+            while paragraphs and _normalise(paragraphs[0]).lower() == title_norm:
+                paragraphs.pop(0)
+            chapters_out.append(
+                {
+                    "id": f"ch{i}",
+                    "number": i,
+                    "title": meta["title"],
+                    "paragraphsES": paragraphs,
+                }
+            )
+
+    slug = args.slug or _slugify(opf["title"]) or args.epub.stem
+    out_dir = args.out or (Path("build") / slug)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    out_path = out_dir / "chapters.json"
+
+    payload = {
+        "title": opf["title"],
+        "author": opf["author"],
+        "language": opf["language"],
+        "slug": slug,
+        "chapters": chapters_out,
+    }
+    out_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
+
+    total_paragraphs = sum(len(c["paragraphsES"]) for c in chapters_out)
+    print(f"Wrote {out_path}")
+    print(f"  Title:      {opf['title']}")
+    print(f"  Author:     {opf['author']}")
+    print(f"  Chapters:   {len(chapters_out)}")
+    print(f"  Paragraphs: {total_paragraphs}")
+    for ch in chapters_out:
+        print(f"    ch{ch['number']:02d}  {len(ch['paragraphsES']):4d} ¶  {ch['title']}")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,65 @@
+#!/usr/bin/env bash
+# Orchestrate the books pipeline: EPUB -> chapters.json -> per-chapter job
+# manifest -> (translation by Claude Code subagents) -> bundled book_<slug>.json.
+#
+# This script DOES NOT run the LLM translation pass. After Phase 2 it stops
+# and prints how many jobs are pending. Use Claude Code subagents (or a fresh
+# session per the README) to fill in build/<slug>/jobs/*.output.json, then
+# re-run this script — it will pick up where it left off via Phase 3.
+#
+# Usage:
+#     ./run.sh <epub_path> [--slug SLUG] [--batch-size N]
+
+set -euo pipefail
+
+HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$HERE"
+
+if [[ $# -lt 1 ]]; then
+    echo "usage: $0 <epub_path> [--slug SLUG] [--batch-size N]"
+    exit 2
+fi
+
+EPUB="$1"; shift
+SLUG=""
+BATCH_SIZE="30"
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --slug) SLUG="$2"; shift 2 ;;
+        --batch-size) BATCH_SIZE="$2"; shift 2 ;;
+        *) echo "unknown option: $1" >&2; exit 2 ;;
+    esac
+done
+
+EPUB_ABS="$(cd "$(dirname "$EPUB")" && pwd)/$(basename "$EPUB")"
+
+echo "=== Phase 1: extract_epub.py ==="
+if [[ -n "$SLUG" ]]; then
+    python3 extract_epub.py "$EPUB_ABS" --slug "$SLUG"
+else
+    python3 extract_epub.py "$EPUB_ABS"
+fi
+
+# If --slug wasn't passed, recover the slug from the chapters file just written.
+if [[ -z "$SLUG" ]]; then
+    SLUG=$(python3 -c "import json,glob; p=sorted(glob.glob('build/*/chapters.json'), key=lambda x: -__import__('os').path.getmtime(x))[0]; print(json.load(open(p))['slug'])")
+fi
+
+echo
+echo "=== Phase 2: translate_chapters.py ==="
+python3 translate_chapters.py "$SLUG" --batch-size "$BATCH_SIZE"
+
+PENDING_FILE="build/$SLUG/jobs/_pending.txt"
+PENDING_COUNT=$(wc -l < "$PENDING_FILE" | tr -d ' ')
+
+echo
+echo "=== Phase 3: bundle_book.py ==="
+if [[ "$PENDING_COUNT" -gt 0 ]]; then
+    echo "  $PENDING_COUNT translation job(s) still pending."
+    echo "  Run the Claude Code subagent translation step (see README.md), then re-run this script."
+    echo "  Bundling with empty placeholders so you can preview app structure now."
+    python3 bundle_book.py "$SLUG"
+else
+    python3 bundle_book.py "$SLUG" --require-all
+fi
@@ -0,0 +1,136 @@
+#!/usr/bin/env python3
+"""Split chapters.json into translation jobs that Claude Code subagents can
+process in parallel. Resumable: jobs whose output file already exists are
+skipped.
+
+Usage:
+    python3 translate_chapters.py <slug> [--batch-size N] [--build BUILD_DIR]
+
+Inputs:
+    BUILD_DIR/<slug>/chapters.json  (from extract_epub.py)
+
+Outputs:
+    BUILD_DIR/<slug>/jobs/<jobid>.input.json    (one per batch — read by subagents)
+    BUILD_DIR/<slug>/jobs/_pending.txt           (list of job IDs still missing output)
+    BUILD_DIR/<slug>/jobs/_prompt_template.md    (prompt the orchestrator hands each subagent)
+
+Job layout (.input.json):
+    {
+      "jobId": "ch06_b00",
+      "chapter": 6,
+      "chapterTitle": "1. El Castillo",
+      "rangeStart": 0,
+      "rangeEnd": 30,
+      "paragraphsES": ["...", "..."]
+    }
+
+Subagents must write `<jobid>.output.json` with shape:
+    {"jobId": "ch06_b00", "paragraphsEN": ["...", "..."]}
+
+The output array MUST have the same length as paragraphsES, in the same order.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+
+PROMPT_TEMPLATE = """\
+You are translating a chunk of a Spanish-language book into English for a
+language-learning app.
+
+Input file: {input_path}
+Output file: {output_path}
+
+Read the input file. It contains a JSON object with a `paragraphsES` array.
+Translate each paragraph into natural English. Preserve meaning, tone, and
+dialogue markers (—, –, ¡, ¿) as appropriate for the English output. Keep
+the same number of paragraphs in the same order.
+
+Notes for translation quality:
+- This is a beginner Spanish reader, so prefer plain natural English over
+  literary flourish.
+- Preserve proper nouns (character names, place names) verbatim.
+- Convert Spanish dialogue dashes (–, —) to English-style quotation marks
+  ONLY if it reads more naturally; otherwise keep them as em-dashes.
+- Do NOT add explanatory parentheticals; the in-app dictionary handles
+  per-word lookup.
+
+Write the output as JSON with shape:
+    {{"jobId": "<the jobId from the input>", "paragraphsEN": [...]}}
+
+The `paragraphsEN` array MUST be the same length and order as `paragraphsES`
+in the input. Write nothing else to disk and produce no other output.
+"""
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("slug")
+    parser.add_argument("--batch-size", type=int, default=30)
+    parser.add_argument("--build", type=Path, default=Path("build"))
+    args = parser.parse_args()
+
+    base = args.build / args.slug
+    chapters_path = base / "chapters.json"
+    jobs_dir = base / "jobs"
+    jobs_dir.mkdir(parents=True, exist_ok=True)
+
+    data = json.loads(chapters_path.read_text(encoding="utf-8"))
+
+    pending: list[str] = []
+    completed: list[str] = []
+    total_jobs = 0
+
+    for ch in data["chapters"]:
+        paragraphs = ch["paragraphsES"]
+        if not paragraphs:
+            continue
+        for offset in range(0, len(paragraphs), args.batch_size):
+            chunk = paragraphs[offset : offset + args.batch_size]
+            job_id = f"ch{ch['number']:02d}_b{offset // args.batch_size:02d}"
+            input_path = jobs_dir / f"{job_id}.input.json"
+            output_path = jobs_dir / f"{job_id}.output.json"
+
+            input_path.write_text(
+                json.dumps(
+                    {
+                        "jobId": job_id,
+                        "chapter": ch["number"],
+                        "chapterTitle": ch["title"],
+                        "rangeStart": offset,
+                        "rangeEnd": offset + len(chunk),
+                        "paragraphsES": chunk,
+                    },
+                    ensure_ascii=False,
+                    indent=2,
+                ),
+                encoding="utf-8",
+            )
+            total_jobs += 1
+            if output_path.exists():
+                completed.append(job_id)
+            else:
+                pending.append(job_id)
+
+    (jobs_dir / "_pending.txt").write_text("\n".join(pending) + ("\n" if pending else ""))
+
+    (jobs_dir / "_prompt_template.md").write_text(
+        PROMPT_TEMPLATE.format(
+            input_path="<JOB_INPUT_PATH>",
+            output_path="<JOB_OUTPUT_PATH>",
+        ),
+        encoding="utf-8",
+    )
+
+    print(f"Total translation jobs: {total_jobs}")
+    print(f"  Completed:            {len(completed)}")
+    print(f"  Pending:              {len(pending)}")
+    print(f"Manifest at:            {jobs_dir / '_pending.txt'}")
+    print(f"Prompt template at:     {jobs_dir / '_prompt_template.md'}")
+
+
+if __name__ == "__main__":
+    main()