Add Books — read EPUB-imported books in Practice with tap-to-define

New "Books" row in the Practice tab opens a library of bundled bilingual
books. Each chapter renders Spanish paragraph-by-paragraph; tap any
word for a definition sheet (DictionaryService with on-device AI
fallback), or toggle the toolbar button to swap to the pre-computed
English translation inline.

Local-only Book + BookChapter SwiftData models added to the local
container schema (reset version bumped to 5). DataLoader.seedBooks
walks the bundle for `book_*.json` resources, so future books drop in
without touching app code — just bundle a new JSON and bump
bookDataVersion.

First book: Olly Richards' "Spanish Short Stories For Beginners
Vol 2" — 13 chapters, 2,646 paragraphs, bilingual.

Scripts/books/ is the repeatable pipeline for future EPUBs:
extract_epub.py → translate_chapters.py (per-chapter resumable jobs) →
bundle_book.py. Translation is done by parallel Claude Code subagents
reading per-job input files and writing output files — no API key
required, matching the pattern used for the textbook vocab vision
pass. See Scripts/books/README.md for the full how-to.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey T
2026-05-11 09:21:44 -05:00
parent ade091f108
commit 09e49bda2c
17 changed files with 6782 additions and 1 deletions
+1
View File
@@ -0,0 +1 @@
build/
+85
View File
@@ -0,0 +1,85 @@
# Books pipeline
Turns any EPUB into a chapter-structured JSON file the app bundles and reads.
## TL;DR
```bash
cd Conjuga/Scripts/books
./run.sh /path/to/book.epub --slug my-book-slug
```
This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells you how many translation jobs are pending. Run those via Claude Code subagents (Phase 2.5 below), then re-run `./run.sh` to bundle the final file.
## Phases
| Phase | Script | What it does | Output |
|---|---|---|---|
| 1 | `extract_epub.py` | Unzip the EPUB, walk `content.opf` spine + `toc.ncx` navMap, group HTML files into chapters, strip HTML→text. | `build/<slug>/chapters.json` |
| 2 | `translate_chapters.py` | Split each chapter into ~30-paragraph translation batches. Each batch becomes a job with its own input/output file. **Resumable**: jobs whose output file already exists are skipped. | `build/<slug>/jobs/<jobid>.input.json` + `_pending.txt` |
| 2.5 | Claude Code subagents | Read each job's `.input.json`, translate Spanish→English, write `<jobid>.output.json`. See "Running translations" below. | `build/<slug>/jobs/<jobid>.output.json` |
| 3 | `bundle_book.py` | Merge `chapters.json` + every `*.output.json` into the final bundled JSON the app reads. | `Conjuga/Conjuga/book_<slug>.json` |
`run.sh` chains 1 → 2 → 3. If Phase 2 produces pending jobs, Phase 3 still runs but bundles with empty `paragraphsEN` placeholders so you can preview app structure before translation completes. Re-running `run.sh` after subagents fill in the outputs gives you the real bundled file.
## Adding a new book
1. **Drop the EPUB** anywhere on disk.
2. **Run Phase 1+2**:
```bash
cd Conjuga/Scripts/books
./run.sh /path/to/book.epub --slug my-book
```
Sanity-check the chapter list it prints. If chapter grouping looks wrong (e.g. an EPUB without a usable `toc.ncx`), `extract_epub.py` will need a fallback heuristic — see "Open assumptions" below.
3. **Run translations** (Phase 2.5). The default approach is to spawn Claude Code subagents from inside a Claude Code session pointed at this repo:
For each pending job ID listed in `build/<slug>/jobs/_pending.txt`, hand a subagent the prompt at `build/<slug>/jobs/_prompt_template.md` with `<JOB_INPUT_PATH>` / `<JOB_OUTPUT_PATH>` filled in. The subagent reads the input, translates, and writes the output. Resumable — interrupted runs just leave the missing job IDs in `_pending.txt`.
Cluster jobs into agent batches of ~510 jobs each to keep per-agent context manageable. ~5 parallel agents is a good throughput target.
4. **Bundle**:
```bash
./run.sh /path/to/book.epub --slug my-book # re-running pulls in the new outputs
# or directly:
python3 bundle_book.py my-book --require-all
```
`--require-all` will fail loudly if any job is still missing.
5. **Bump `bookDataVersion`** in `DataLoader.swift` so the in-app store re-seeds the new book on next launch (or any time you re-run with new translations).
6. **Verify the file is bundled** in `Conjuga.xcodeproj`. The script writes `book_<slug>.json` into `Conjuga/Conjuga/Resources/`; if that folder is part of a recursive group reference, Xcode picks it up automatically. Otherwise, add it manually or via the `xcodeproj` ruby gem.
## File layout
```
Conjuga/Scripts/books/
├── extract_epub.py # Phase 1
├── translate_chapters.py # Phase 2
├── bundle_book.py # Phase 3
├── run.sh # Orchestrator
└── build/ # gitignored
└── <slug>/
├── chapters.json
└── jobs/
├── _pending.txt
├── _prompt_template.md
├── ch01_b00.input.json
├── ch01_b00.output.json
└── ...
```
The final output (`book_<slug>.json`) lives at `Conjuga/Conjuga/book_<slug>.json` so the iOS app bundle includes it. (Existing `textbook_data.json` / `conjuga_data.json` use the same layout — files in the app target root rather than a Resources subgroup.)
## Open assumptions
- **TOC drives chapter boundaries.** If an EPUB ships without a usable `toc.ncx`, or the navMap is too granular (e.g. one navPoint per page), `extract_epub.py` will need a fallback that groups by `<h1>` headings in spine order.
- **Spanish bold tags = inline emphasis.** The Olly Richards books bold vocab hints inside paragraphs. We strip the bold and let the in-app dictionary lookup handle definitions instead. If a future book uses bold for something else (titles, etc.), revisit.
- **Translation is per-paragraph 1:1.** Subagents must preserve paragraph count and order. `bundle_book.py` will warn + pad/truncate if a job's output array length doesn't match its input — but that's a sign the subagent misbehaved.
## Out of scope (intentional)
- OCR of vocab image tables (use `Scripts/textbook/` if your book is image-heavy).
- Exercise extraction (textbook pipeline).
- Pre-computed per-word annotations (the app uses `DictionaryService.lookup()` at runtime).
- Cover image extraction (covers are derived from a color hash in the app for now).
+128
View File
@@ -0,0 +1,128 @@
#!/usr/bin/env python3
"""Merge chapters.json + per-job translation outputs into the final bundled
book_<slug>.json that the iOS app reads from its bundle.
Usage:
python3 bundle_book.py <slug> [--build BUILD_DIR] [--dest DEST_DIR] [--require-all]
Inputs:
BUILD_DIR/<slug>/chapters.json
BUILD_DIR/<slug>/jobs/*.output.json (from translation subagents)
Output:
DEST_DIR/book_<slug>.json
{
"slug": "...",
"title": "...",
"author": "...",
"language": "...",
"chapters": [
{"id": "ch1", "number": 1, "title": "Preface",
"paragraphsES": ["...", ...],
"paragraphsEN": ["...", ...]},
...
]
}
If --require-all is passed, the script fails if any job is missing its output.
Otherwise it fills missing translations with empty strings and warns.
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
DEFAULT_DEST = Path("../../Conjuga")
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("slug")
parser.add_argument("--build", type=Path, default=Path("build"))
parser.add_argument("--dest", type=Path, default=None)
parser.add_argument("--require-all", action="store_true")
args = parser.parse_args()
base = args.build / args.slug
chapters = json.loads((base / "chapters.json").read_text(encoding="utf-8"))
jobs_dir = base / "jobs"
# Index translation jobs by chapter -> ordered (offset, paragraphsEN).
chapter_translations: dict[int, list[tuple[int, list[str]]]] = {}
missing: list[str] = []
for input_path in sorted(jobs_dir.glob("*.input.json")):
job_id = input_path.stem.removesuffix(".input")
input_data = json.loads(input_path.read_text(encoding="utf-8"))
output_path = jobs_dir / f"{job_id}.output.json"
if not output_path.exists():
missing.append(job_id)
continue
output_data = json.loads(output_path.read_text(encoding="utf-8"))
paragraphs_en = output_data.get("paragraphsEN", [])
expected = len(input_data["paragraphsES"])
if len(paragraphs_en) != expected:
print(
f"WARN: {job_id} length mismatch — got {len(paragraphs_en)}, "
f"expected {expected}. Padding/truncating.",
file=sys.stderr,
)
if len(paragraphs_en) < expected:
paragraphs_en = paragraphs_en + [""] * (expected - len(paragraphs_en))
else:
paragraphs_en = paragraphs_en[:expected]
chapter_translations.setdefault(input_data["chapter"], []).append(
(input_data["rangeStart"], paragraphs_en)
)
if missing:
msg = f"{len(missing)} translation job(s) missing output: {missing[:5]}{'...' if len(missing) > 5 else ''}"
if args.require_all:
print(f"ERROR: {msg}", file=sys.stderr)
sys.exit(1)
print(f"WARN: {msg} — using empty strings for those paragraphs.", file=sys.stderr)
bundled_chapters: list[dict] = []
for ch in chapters["chapters"]:
translations = sorted(chapter_translations.get(ch["number"], []))
paragraphs_en: list[str] = []
for _, en_chunk in translations:
paragraphs_en.extend(en_chunk)
# Pad to match ES length if jobs were missing for parts of this chapter.
if len(paragraphs_en) < len(ch["paragraphsES"]):
paragraphs_en += [""] * (len(ch["paragraphsES"]) - len(paragraphs_en))
elif len(paragraphs_en) > len(ch["paragraphsES"]):
paragraphs_en = paragraphs_en[: len(ch["paragraphsES"])]
bundled_chapters.append(
{
"id": ch["id"],
"number": ch["number"],
"title": ch["title"],
"paragraphsES": ch["paragraphsES"],
"paragraphsEN": paragraphs_en,
}
)
payload = {
"slug": chapters["slug"],
"title": chapters["title"],
"author": chapters["author"],
"language": chapters["language"],
"chapters": bundled_chapters,
}
dest_dir = (args.dest or DEFAULT_DEST).resolve()
dest_dir.mkdir(parents=True, exist_ok=True)
out_path = dest_dir / f"book_{args.slug}.json"
out_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
print(f"Wrote {out_path}")
print(f" Chapters: {len(bundled_chapters)}")
print(f" Translated jobs: {sum(len(v) for v in chapter_translations.values())} / {sum(len(v) for v in chapter_translations.values()) + len(missing)}")
if __name__ == "__main__":
main()
+258
View File
@@ -0,0 +1,258 @@
#!/usr/bin/env python3
"""Parse an EPUB into chapters.json for the in-app Books feature.
Usage:
python3 extract_epub.py <epub_path> [--slug SLUG] [--out OUT_DIR]
Defaults:
SLUG derived from the EPUB filename (lowercased, dashed)
OUT_DIR ./build/<slug>
Output:
OUT_DIR/chapters.json
{
"title": "...",
"author": "...",
"language": "...",
"slug": "...",
"chapters": [
{"id": "ch1", "number": 1, "title": "Preface",
"paragraphsES": ["...", "..."]},
...
]
}
How chapter grouping works:
1. Read content.opf manifest (id -> href) and spine (ordered idrefs).
2. Read toc.ncx navMap to get the ordered list of chapter (title, first-href).
3. For each chapter, claim every spine file from its first href up to (but
not including) the next chapter's first href.
4. For each file in the chapter's range, parse <p> elements, strip tags,
normalise whitespace + smart quotes, drop empties.
"""
from __future__ import annotations
import argparse
import json
import re
import sys
import unicodedata
import warnings
import zipfile
from pathlib import Path
from typing import Iterable
from xml.etree import ElementTree as ET
from bs4 import BeautifulSoup, XMLParsedAsHTMLWarning
warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)
NS = {
"opf": "http://www.idpf.org/2007/opf",
"dc": "http://purl.org/dc/elements/1.1/",
"ncx": "http://www.daisy.org/z3986/2005/ncx/",
"xhtml": "http://www.w3.org/1999/xhtml",
}
def _slugify(s: str) -> str:
s = unicodedata.normalize("NFKD", s).encode("ascii", "ignore").decode("ascii")
s = re.sub(r"[^a-zA-Z0-9]+", "-", s).strip("-").lower()
return s or "book"
def _normalise(text: str) -> str:
# Collapse runs of whitespace, normalise smart quotes to plain ones.
text = text.replace(" ", " ")
text = re.sub(r"\s+", " ", text).strip()
text = re.sub(r"\s+([.,;:!?…])", r"\1", text)
text = re.sub(r"([¡¿])\s+", r"\1", text)
return text
def _read_zip_text(zf: zipfile.ZipFile, path: str) -> str:
return zf.read(path).decode("utf-8")
def _container_root(zf: zipfile.ZipFile) -> str:
container = ET.fromstring(_read_zip_text(zf, "META-INF/container.xml"))
rootfile = container.find(".//{urn:oasis:names:tc:opendocument:xmlns:container}rootfile")
if rootfile is None:
raise RuntimeError("Missing rootfile entry in META-INF/container.xml")
return rootfile.attrib["full-path"]
def _parse_opf(zf: zipfile.ZipFile, opf_path: str):
text = _read_zip_text(zf, opf_path)
root = ET.fromstring(text)
title = (root.findtext(".//dc:title", default="", namespaces=NS) or "").strip()
author = (root.findtext(".//dc:creator", default="", namespaces=NS) or "").strip()
language = (root.findtext(".//dc:language", default="", namespaces=NS) or "").strip()
manifest: dict[str, str] = {}
for item in root.findall("opf:manifest/opf:item", NS):
manifest[item.attrib["id"]] = item.attrib["href"]
spine: list[str] = []
for itemref in root.findall("opf:spine/opf:itemref", NS):
spine.append(itemref.attrib["idref"])
ncx_id = root.find("opf:spine", NS).attrib.get("toc") if root.find("opf:spine", NS) is not None else None
ncx_href = manifest.get(ncx_id) if ncx_id else None
return {
"title": title,
"author": author,
"language": language,
"manifest": manifest,
"spine": spine,
"ncx_href": ncx_href,
"opf_dir": str(Path(opf_path).parent) if "/" in opf_path else "",
}
def _parse_ncx(zf: zipfile.ZipFile, ncx_path: str) -> list[dict]:
text = _read_zip_text(zf, ncx_path)
root = ET.fromstring(text)
chapters: list[dict] = []
for nav in root.findall("ncx:navMap/ncx:navPoint", NS):
title = (nav.findtext("ncx:navLabel/ncx:text", default="", namespaces=NS) or "").strip()
content = nav.find("ncx:content", NS)
src = content.attrib.get("src", "") if content is not None else ""
# Strip the anchor — we want the file path only.
href = src.split("#", 1)[0]
chapters.append({"title": title, "href": href})
return chapters
def _resolve_zip_path(base_dir: str, href: str) -> str:
if not base_dir:
return href
return f"{base_dir}/{href}".lstrip("/")
def _extract_paragraphs(zf: zipfile.ZipFile, zip_path: str) -> list[str]:
try:
html = _read_zip_text(zf, zip_path)
except KeyError:
return []
soup = BeautifulSoup(html, "lxml")
paragraphs: list[str] = []
for p in soup.find_all("p"):
# Drop nav-anchor wrappers that contain no real text.
text = _normalise(p.get_text(" ", strip=True))
if not text:
continue
# Drop chapter-heading paragraphs that only echo the title — handled
# separately by the TOC. Heuristic: very short paragraph that's just
# numbers + the chapter title pattern. Keep everything else.
paragraphs.append(text)
return paragraphs
def _chapter_files(
spine_files: list[str], chapter_hrefs: list[str]
) -> list[list[str]]:
"""Slice the spine into one list of files per chapter, using the chapter's
first href as the chapter boundary. Files before the first chapter (e.g.
cover, titlepage) are dropped."""
boundaries: list[int] = []
for href in chapter_hrefs:
try:
idx = spine_files.index(href)
except ValueError:
boundaries.append(-1)
continue
boundaries.append(idx)
ranges: list[list[str]] = []
for i, start in enumerate(boundaries):
if start < 0:
ranges.append([])
continue
end = len(spine_files)
for next_start in boundaries[i + 1:]:
if next_start >= 0:
end = next_start
break
ranges.append(spine_files[start:end])
return ranges
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("epub", type=Path)
parser.add_argument("--slug", default=None)
parser.add_argument("--out", type=Path, default=None)
args = parser.parse_args()
if not args.epub.exists():
print(f"EPUB not found: {args.epub}", file=sys.stderr)
sys.exit(2)
with zipfile.ZipFile(args.epub) as zf:
opf_path = _container_root(zf)
opf = _parse_opf(zf, opf_path)
if not opf["ncx_href"]:
print("No NCX found in spine; cannot derive chapter structure.", file=sys.stderr)
sys.exit(3)
ncx_path = _resolve_zip_path(opf["opf_dir"], opf["ncx_href"])
toc = _parse_ncx(zf, ncx_path)
spine_files = [
_resolve_zip_path(opf["opf_dir"], opf["manifest"].get(idref, ""))
for idref in opf["spine"]
]
chapter_hrefs = [_resolve_zip_path(opf["opf_dir"], c["href"]) for c in toc]
chapter_file_ranges = _chapter_files(spine_files, chapter_hrefs)
chapters_out: list[dict] = []
for i, (meta, files) in enumerate(zip(toc, chapter_file_ranges), start=1):
paragraphs: list[str] = []
for f in files:
paragraphs.extend(_extract_paragraphs(zf, f))
# Drop leading paragraph(s) that just echo the chapter title — the
# title is already stored separately.
title_norm = _normalise(meta["title"]).lower()
while paragraphs and _normalise(paragraphs[0]).lower() == title_norm:
paragraphs.pop(0)
chapters_out.append(
{
"id": f"ch{i}",
"number": i,
"title": meta["title"],
"paragraphsES": paragraphs,
}
)
slug = args.slug or _slugify(opf["title"]) or args.epub.stem
out_dir = args.out or (Path("build") / slug)
out_dir.mkdir(parents=True, exist_ok=True)
out_path = out_dir / "chapters.json"
payload = {
"title": opf["title"],
"author": opf["author"],
"language": opf["language"],
"slug": slug,
"chapters": chapters_out,
}
out_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
total_paragraphs = sum(len(c["paragraphsES"]) for c in chapters_out)
print(f"Wrote {out_path}")
print(f" Title: {opf['title']}")
print(f" Author: {opf['author']}")
print(f" Chapters: {len(chapters_out)}")
print(f" Paragraphs: {total_paragraphs}")
for ch in chapters_out:
print(f" ch{ch['number']:02d} {len(ch['paragraphsES']):4d}{ch['title']}")
if __name__ == "__main__":
main()
+65
View File
@@ -0,0 +1,65 @@
#!/usr/bin/env bash
# Orchestrate the books pipeline: EPUB -> chapters.json -> per-chapter job
# manifest -> (translation by Claude Code subagents) -> bundled book_<slug>.json.
#
# This script DOES NOT run the LLM translation pass. After Phase 2 it stops
# and prints how many jobs are pending. Use Claude Code subagents (or a fresh
# session per the README) to fill in build/<slug>/jobs/*.output.json, then
# re-run this script — it will pick up where it left off via Phase 3.
#
# Usage:
# ./run.sh <epub_path> [--slug SLUG] [--batch-size N]
set -euo pipefail
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$HERE"
if [[ $# -lt 1 ]]; then
echo "usage: $0 <epub_path> [--slug SLUG] [--batch-size N]"
exit 2
fi
EPUB="$1"; shift
SLUG=""
BATCH_SIZE="30"
while [[ $# -gt 0 ]]; do
case "$1" in
--slug) SLUG="$2"; shift 2 ;;
--batch-size) BATCH_SIZE="$2"; shift 2 ;;
*) echo "unknown option: $1" >&2; exit 2 ;;
esac
done
EPUB_ABS="$(cd "$(dirname "$EPUB")" && pwd)/$(basename "$EPUB")"
echo "=== Phase 1: extract_epub.py ==="
if [[ -n "$SLUG" ]]; then
python3 extract_epub.py "$EPUB_ABS" --slug "$SLUG"
else
python3 extract_epub.py "$EPUB_ABS"
fi
# If --slug wasn't passed, recover the slug from the chapters file just written.
if [[ -z "$SLUG" ]]; then
SLUG=$(python3 -c "import json,glob; p=sorted(glob.glob('build/*/chapters.json'), key=lambda x: -__import__('os').path.getmtime(x))[0]; print(json.load(open(p))['slug'])")
fi
echo
echo "=== Phase 2: translate_chapters.py ==="
python3 translate_chapters.py "$SLUG" --batch-size "$BATCH_SIZE"
PENDING_FILE="build/$SLUG/jobs/_pending.txt"
PENDING_COUNT=$(wc -l < "$PENDING_FILE" | tr -d ' ')
echo
echo "=== Phase 3: bundle_book.py ==="
if [[ "$PENDING_COUNT" -gt 0 ]]; then
echo " $PENDING_COUNT translation job(s) still pending."
echo " Run the Claude Code subagent translation step (see README.md), then re-run this script."
echo " Bundling with empty placeholders so you can preview app structure now."
python3 bundle_book.py "$SLUG"
else
python3 bundle_book.py "$SLUG" --require-all
fi
+136
View File
@@ -0,0 +1,136 @@
#!/usr/bin/env python3
"""Split chapters.json into translation jobs that Claude Code subagents can
process in parallel. Resumable: jobs whose output file already exists are
skipped.
Usage:
python3 translate_chapters.py <slug> [--batch-size N] [--build BUILD_DIR]
Inputs:
BUILD_DIR/<slug>/chapters.json (from extract_epub.py)
Outputs:
BUILD_DIR/<slug>/jobs/<jobid>.input.json (one per batch — read by subagents)
BUILD_DIR/<slug>/jobs/_pending.txt (list of job IDs still missing output)
BUILD_DIR/<slug>/jobs/_prompt_template.md (prompt the orchestrator hands each subagent)
Job layout (.input.json):
{
"jobId": "ch06_b00",
"chapter": 6,
"chapterTitle": "1. El Castillo",
"rangeStart": 0,
"rangeEnd": 30,
"paragraphsES": ["...", "..."]
}
Subagents must write `<jobid>.output.json` with shape:
{"jobId": "ch06_b00", "paragraphsEN": ["...", "..."]}
The output array MUST have the same length as paragraphsES, in the same order.
"""
from __future__ import annotations
import argparse
import json
from pathlib import Path
PROMPT_TEMPLATE = """\
You are translating a chunk of a Spanish-language book into English for a
language-learning app.
Input file: {input_path}
Output file: {output_path}
Read the input file. It contains a JSON object with a `paragraphsES` array.
Translate each paragraph into natural English. Preserve meaning, tone, and
dialogue markers (—, , ¡, ¿) as appropriate for the English output. Keep
the same number of paragraphs in the same order.
Notes for translation quality:
- This is a beginner Spanish reader, so prefer plain natural English over
literary flourish.
- Preserve proper nouns (character names, place names) verbatim.
- Convert Spanish dialogue dashes (, —) to English-style quotation marks
ONLY if it reads more naturally; otherwise keep them as em-dashes.
- Do NOT add explanatory parentheticals; the in-app dictionary handles
per-word lookup.
Write the output as JSON with shape:
{{"jobId": "<the jobId from the input>", "paragraphsEN": [...]}}
The `paragraphsEN` array MUST be the same length and order as `paragraphsES`
in the input. Write nothing else to disk and produce no other output.
"""
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("slug")
parser.add_argument("--batch-size", type=int, default=30)
parser.add_argument("--build", type=Path, default=Path("build"))
args = parser.parse_args()
base = args.build / args.slug
chapters_path = base / "chapters.json"
jobs_dir = base / "jobs"
jobs_dir.mkdir(parents=True, exist_ok=True)
data = json.loads(chapters_path.read_text(encoding="utf-8"))
pending: list[str] = []
completed: list[str] = []
total_jobs = 0
for ch in data["chapters"]:
paragraphs = ch["paragraphsES"]
if not paragraphs:
continue
for offset in range(0, len(paragraphs), args.batch_size):
chunk = paragraphs[offset : offset + args.batch_size]
job_id = f"ch{ch['number']:02d}_b{offset // args.batch_size:02d}"
input_path = jobs_dir / f"{job_id}.input.json"
output_path = jobs_dir / f"{job_id}.output.json"
input_path.write_text(
json.dumps(
{
"jobId": job_id,
"chapter": ch["number"],
"chapterTitle": ch["title"],
"rangeStart": offset,
"rangeEnd": offset + len(chunk),
"paragraphsES": chunk,
},
ensure_ascii=False,
indent=2,
),
encoding="utf-8",
)
total_jobs += 1
if output_path.exists():
completed.append(job_id)
else:
pending.append(job_id)
(jobs_dir / "_pending.txt").write_text("\n".join(pending) + ("\n" if pending else ""))
(jobs_dir / "_prompt_template.md").write_text(
PROMPT_TEMPLATE.format(
input_path="<JOB_INPUT_PATH>",
output_path="<JOB_OUTPUT_PATH>",
),
encoding="utf-8",
)
print(f"Total translation jobs: {total_jobs}")
print(f" Completed: {len(completed)}")
print(f" Pending: {len(pending)}")
print(f"Manifest at: {jobs_dir / '_pending.txt'}")
print(f"Prompt template at: {jobs_dir / '_prompt_template.md'}")
if __name__ == "__main__":
main()