Add Books — read EPUB-imported books in Practice with tap-to-define
New "Books" row in the Practice tab opens a library of bundled bilingual books. Each chapter renders Spanish paragraph-by-paragraph; tap any word for a definition sheet (DictionaryService with on-device AI fallback), or toggle the toolbar button to swap to the pre-computed English translation inline. Local-only Book + BookChapter SwiftData models added to the local container schema (reset version bumped to 5). DataLoader.seedBooks walks the bundle for `book_*.json` resources, so future books drop in without touching app code — just bundle a new JSON and bump bookDataVersion. First book: Olly Richards' "Spanish Short Stories For Beginners Vol 2" — 13 chapters, 2,646 paragraphs, bilingual. Scripts/books/ is the repeatable pipeline for future EPUBs: extract_epub.py → translate_chapters.py (per-chapter resumable jobs) → bundle_book.py. Translation is done by parallel Claude Code subagents reading per-job input files and writing output files — no API key required, matching the pattern used for the textbook vocab vision pass. See Scripts/books/README.md for the full how-to. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1 @@
|
||||
build/
|
||||
@@ -0,0 +1,85 @@
|
||||
# Books pipeline
|
||||
|
||||
Turns any EPUB into a chapter-structured JSON file the app bundles and reads.
|
||||
|
||||
## TL;DR
|
||||
|
||||
```bash
|
||||
cd Conjuga/Scripts/books
|
||||
./run.sh /path/to/book.epub --slug my-book-slug
|
||||
```
|
||||
|
||||
This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells you how many translation jobs are pending. Run those via Claude Code subagents (Phase 2.5 below), then re-run `./run.sh` to bundle the final file.
|
||||
|
||||
## Phases
|
||||
|
||||
| Phase | Script | What it does | Output |
|
||||
|---|---|---|---|
|
||||
| 1 | `extract_epub.py` | Unzip the EPUB, walk `content.opf` spine + `toc.ncx` navMap, group HTML files into chapters, strip HTML→text. | `build/<slug>/chapters.json` |
|
||||
| 2 | `translate_chapters.py` | Split each chapter into ~30-paragraph translation batches. Each batch becomes a job with its own input/output file. **Resumable**: jobs whose output file already exists are skipped. | `build/<slug>/jobs/<jobid>.input.json` + `_pending.txt` |
|
||||
| 2.5 | Claude Code subagents | Read each job's `.input.json`, translate Spanish→English, write `<jobid>.output.json`. See "Running translations" below. | `build/<slug>/jobs/<jobid>.output.json` |
|
||||
| 3 | `bundle_book.py` | Merge `chapters.json` + every `*.output.json` into the final bundled JSON the app reads. | `Conjuga/Conjuga/book_<slug>.json` |
|
||||
|
||||
`run.sh` chains 1 → 2 → 3. If Phase 2 produces pending jobs, Phase 3 still runs but bundles with empty `paragraphsEN` placeholders so you can preview app structure before translation completes. Re-running `run.sh` after subagents fill in the outputs gives you the real bundled file.
|
||||
|
||||
## Adding a new book
|
||||
|
||||
1. **Drop the EPUB** anywhere on disk.
|
||||
2. **Run Phase 1+2**:
|
||||
```bash
|
||||
cd Conjuga/Scripts/books
|
||||
./run.sh /path/to/book.epub --slug my-book
|
||||
```
|
||||
Sanity-check the chapter list it prints. If chapter grouping looks wrong (e.g. an EPUB without a usable `toc.ncx`), `extract_epub.py` will need a fallback heuristic — see "Open assumptions" below.
|
||||
|
||||
3. **Run translations** (Phase 2.5). The default approach is to spawn Claude Code subagents from inside a Claude Code session pointed at this repo:
|
||||
|
||||
For each pending job ID listed in `build/<slug>/jobs/_pending.txt`, hand a subagent the prompt at `build/<slug>/jobs/_prompt_template.md` with `<JOB_INPUT_PATH>` / `<JOB_OUTPUT_PATH>` filled in. The subagent reads the input, translates, and writes the output. Resumable — interrupted runs just leave the missing job IDs in `_pending.txt`.
|
||||
|
||||
Cluster jobs into agent batches of ~5–10 jobs each to keep per-agent context manageable. ~5 parallel agents is a good throughput target.
|
||||
|
||||
4. **Bundle**:
|
||||
```bash
|
||||
./run.sh /path/to/book.epub --slug my-book # re-running pulls in the new outputs
|
||||
# or directly:
|
||||
python3 bundle_book.py my-book --require-all
|
||||
```
|
||||
`--require-all` will fail loudly if any job is still missing.
|
||||
|
||||
5. **Bump `bookDataVersion`** in `DataLoader.swift` so the in-app store re-seeds the new book on next launch (or any time you re-run with new translations).
|
||||
|
||||
6. **Verify the file is bundled** in `Conjuga.xcodeproj`. The script writes `book_<slug>.json` into `Conjuga/Conjuga/Resources/`; if that folder is part of a recursive group reference, Xcode picks it up automatically. Otherwise, add it manually or via the `xcodeproj` ruby gem.
|
||||
|
||||
## File layout
|
||||
|
||||
```
|
||||
Conjuga/Scripts/books/
|
||||
├── extract_epub.py # Phase 1
|
||||
├── translate_chapters.py # Phase 2
|
||||
├── bundle_book.py # Phase 3
|
||||
├── run.sh # Orchestrator
|
||||
└── build/ # gitignored
|
||||
└── <slug>/
|
||||
├── chapters.json
|
||||
└── jobs/
|
||||
├── _pending.txt
|
||||
├── _prompt_template.md
|
||||
├── ch01_b00.input.json
|
||||
├── ch01_b00.output.json
|
||||
└── ...
|
||||
```
|
||||
|
||||
The final output (`book_<slug>.json`) lives at `Conjuga/Conjuga/book_<slug>.json` so the iOS app bundle includes it. (Existing `textbook_data.json` / `conjuga_data.json` use the same layout — files in the app target root rather than a Resources subgroup.)
|
||||
|
||||
## Open assumptions
|
||||
|
||||
- **TOC drives chapter boundaries.** If an EPUB ships without a usable `toc.ncx`, or the navMap is too granular (e.g. one navPoint per page), `extract_epub.py` will need a fallback that groups by `<h1>` headings in spine order.
|
||||
- **Spanish bold tags = inline emphasis.** The Olly Richards books bold vocab hints inside paragraphs. We strip the bold and let the in-app dictionary lookup handle definitions instead. If a future book uses bold for something else (titles, etc.), revisit.
|
||||
- **Translation is per-paragraph 1:1.** Subagents must preserve paragraph count and order. `bundle_book.py` will warn + pad/truncate if a job's output array length doesn't match its input — but that's a sign the subagent misbehaved.
|
||||
|
||||
## Out of scope (intentional)
|
||||
|
||||
- OCR of vocab image tables (use `Scripts/textbook/` if your book is image-heavy).
|
||||
- Exercise extraction (textbook pipeline).
|
||||
- Pre-computed per-word annotations (the app uses `DictionaryService.lookup()` at runtime).
|
||||
- Cover image extraction (covers are derived from a color hash in the app for now).
|
||||
@@ -0,0 +1,128 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Merge chapters.json + per-job translation outputs into the final bundled
|
||||
book_<slug>.json that the iOS app reads from its bundle.
|
||||
|
||||
Usage:
|
||||
python3 bundle_book.py <slug> [--build BUILD_DIR] [--dest DEST_DIR] [--require-all]
|
||||
|
||||
Inputs:
|
||||
BUILD_DIR/<slug>/chapters.json
|
||||
BUILD_DIR/<slug>/jobs/*.output.json (from translation subagents)
|
||||
|
||||
Output:
|
||||
DEST_DIR/book_<slug>.json
|
||||
{
|
||||
"slug": "...",
|
||||
"title": "...",
|
||||
"author": "...",
|
||||
"language": "...",
|
||||
"chapters": [
|
||||
{"id": "ch1", "number": 1, "title": "Preface",
|
||||
"paragraphsES": ["...", ...],
|
||||
"paragraphsEN": ["...", ...]},
|
||||
...
|
||||
]
|
||||
}
|
||||
|
||||
If --require-all is passed, the script fails if any job is missing its output.
|
||||
Otherwise it fills missing translations with empty strings and warns.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
DEFAULT_DEST = Path("../../Conjuga")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("slug")
|
||||
parser.add_argument("--build", type=Path, default=Path("build"))
|
||||
parser.add_argument("--dest", type=Path, default=None)
|
||||
parser.add_argument("--require-all", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
base = args.build / args.slug
|
||||
chapters = json.loads((base / "chapters.json").read_text(encoding="utf-8"))
|
||||
jobs_dir = base / "jobs"
|
||||
|
||||
# Index translation jobs by chapter -> ordered (offset, paragraphsEN).
|
||||
chapter_translations: dict[int, list[tuple[int, list[str]]]] = {}
|
||||
missing: list[str] = []
|
||||
|
||||
for input_path in sorted(jobs_dir.glob("*.input.json")):
|
||||
job_id = input_path.stem.removesuffix(".input")
|
||||
input_data = json.loads(input_path.read_text(encoding="utf-8"))
|
||||
output_path = jobs_dir / f"{job_id}.output.json"
|
||||
if not output_path.exists():
|
||||
missing.append(job_id)
|
||||
continue
|
||||
output_data = json.loads(output_path.read_text(encoding="utf-8"))
|
||||
paragraphs_en = output_data.get("paragraphsEN", [])
|
||||
expected = len(input_data["paragraphsES"])
|
||||
if len(paragraphs_en) != expected:
|
||||
print(
|
||||
f"WARN: {job_id} length mismatch — got {len(paragraphs_en)}, "
|
||||
f"expected {expected}. Padding/truncating.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
if len(paragraphs_en) < expected:
|
||||
paragraphs_en = paragraphs_en + [""] * (expected - len(paragraphs_en))
|
||||
else:
|
||||
paragraphs_en = paragraphs_en[:expected]
|
||||
chapter_translations.setdefault(input_data["chapter"], []).append(
|
||||
(input_data["rangeStart"], paragraphs_en)
|
||||
)
|
||||
|
||||
if missing:
|
||||
msg = f"{len(missing)} translation job(s) missing output: {missing[:5]}{'...' if len(missing) > 5 else ''}"
|
||||
if args.require_all:
|
||||
print(f"ERROR: {msg}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
print(f"WARN: {msg} — using empty strings for those paragraphs.", file=sys.stderr)
|
||||
|
||||
bundled_chapters: list[dict] = []
|
||||
for ch in chapters["chapters"]:
|
||||
translations = sorted(chapter_translations.get(ch["number"], []))
|
||||
paragraphs_en: list[str] = []
|
||||
for _, en_chunk in translations:
|
||||
paragraphs_en.extend(en_chunk)
|
||||
# Pad to match ES length if jobs were missing for parts of this chapter.
|
||||
if len(paragraphs_en) < len(ch["paragraphsES"]):
|
||||
paragraphs_en += [""] * (len(ch["paragraphsES"]) - len(paragraphs_en))
|
||||
elif len(paragraphs_en) > len(ch["paragraphsES"]):
|
||||
paragraphs_en = paragraphs_en[: len(ch["paragraphsES"])]
|
||||
bundled_chapters.append(
|
||||
{
|
||||
"id": ch["id"],
|
||||
"number": ch["number"],
|
||||
"title": ch["title"],
|
||||
"paragraphsES": ch["paragraphsES"],
|
||||
"paragraphsEN": paragraphs_en,
|
||||
}
|
||||
)
|
||||
|
||||
payload = {
|
||||
"slug": chapters["slug"],
|
||||
"title": chapters["title"],
|
||||
"author": chapters["author"],
|
||||
"language": chapters["language"],
|
||||
"chapters": bundled_chapters,
|
||||
}
|
||||
|
||||
dest_dir = (args.dest or DEFAULT_DEST).resolve()
|
||||
dest_dir.mkdir(parents=True, exist_ok=True)
|
||||
out_path = dest_dir / f"book_{args.slug}.json"
|
||||
out_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
|
||||
print(f"Wrote {out_path}")
|
||||
print(f" Chapters: {len(bundled_chapters)}")
|
||||
print(f" Translated jobs: {sum(len(v) for v in chapter_translations.values())} / {sum(len(v) for v in chapter_translations.values()) + len(missing)}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,258 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Parse an EPUB into chapters.json for the in-app Books feature.
|
||||
|
||||
Usage:
|
||||
python3 extract_epub.py <epub_path> [--slug SLUG] [--out OUT_DIR]
|
||||
|
||||
Defaults:
|
||||
SLUG derived from the EPUB filename (lowercased, dashed)
|
||||
OUT_DIR ./build/<slug>
|
||||
|
||||
Output:
|
||||
OUT_DIR/chapters.json
|
||||
{
|
||||
"title": "...",
|
||||
"author": "...",
|
||||
"language": "...",
|
||||
"slug": "...",
|
||||
"chapters": [
|
||||
{"id": "ch1", "number": 1, "title": "Preface",
|
||||
"paragraphsES": ["...", "..."]},
|
||||
...
|
||||
]
|
||||
}
|
||||
|
||||
How chapter grouping works:
|
||||
1. Read content.opf manifest (id -> href) and spine (ordered idrefs).
|
||||
2. Read toc.ncx navMap to get the ordered list of chapter (title, first-href).
|
||||
3. For each chapter, claim every spine file from its first href up to (but
|
||||
not including) the next chapter's first href.
|
||||
4. For each file in the chapter's range, parse <p> elements, strip tags,
|
||||
normalise whitespace + smart quotes, drop empties.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import unicodedata
|
||||
import warnings
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from typing import Iterable
|
||||
from xml.etree import ElementTree as ET
|
||||
|
||||
from bs4 import BeautifulSoup, XMLParsedAsHTMLWarning
|
||||
|
||||
warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)
|
||||
|
||||
|
||||
NS = {
|
||||
"opf": "http://www.idpf.org/2007/opf",
|
||||
"dc": "http://purl.org/dc/elements/1.1/",
|
||||
"ncx": "http://www.daisy.org/z3986/2005/ncx/",
|
||||
"xhtml": "http://www.w3.org/1999/xhtml",
|
||||
}
|
||||
|
||||
|
||||
def _slugify(s: str) -> str:
|
||||
s = unicodedata.normalize("NFKD", s).encode("ascii", "ignore").decode("ascii")
|
||||
s = re.sub(r"[^a-zA-Z0-9]+", "-", s).strip("-").lower()
|
||||
return s or "book"
|
||||
|
||||
|
||||
def _normalise(text: str) -> str:
|
||||
# Collapse runs of whitespace, normalise smart quotes to plain ones.
|
||||
text = text.replace(" ", " ")
|
||||
text = re.sub(r"\s+", " ", text).strip()
|
||||
text = re.sub(r"\s+([.,;:!?…])", r"\1", text)
|
||||
text = re.sub(r"([¡¿])\s+", r"\1", text)
|
||||
return text
|
||||
|
||||
|
||||
def _read_zip_text(zf: zipfile.ZipFile, path: str) -> str:
|
||||
return zf.read(path).decode("utf-8")
|
||||
|
||||
|
||||
def _container_root(zf: zipfile.ZipFile) -> str:
|
||||
container = ET.fromstring(_read_zip_text(zf, "META-INF/container.xml"))
|
||||
rootfile = container.find(".//{urn:oasis:names:tc:opendocument:xmlns:container}rootfile")
|
||||
if rootfile is None:
|
||||
raise RuntimeError("Missing rootfile entry in META-INF/container.xml")
|
||||
return rootfile.attrib["full-path"]
|
||||
|
||||
|
||||
def _parse_opf(zf: zipfile.ZipFile, opf_path: str):
|
||||
text = _read_zip_text(zf, opf_path)
|
||||
root = ET.fromstring(text)
|
||||
|
||||
title = (root.findtext(".//dc:title", default="", namespaces=NS) or "").strip()
|
||||
author = (root.findtext(".//dc:creator", default="", namespaces=NS) or "").strip()
|
||||
language = (root.findtext(".//dc:language", default="", namespaces=NS) or "").strip()
|
||||
|
||||
manifest: dict[str, str] = {}
|
||||
for item in root.findall("opf:manifest/opf:item", NS):
|
||||
manifest[item.attrib["id"]] = item.attrib["href"]
|
||||
|
||||
spine: list[str] = []
|
||||
for itemref in root.findall("opf:spine/opf:itemref", NS):
|
||||
spine.append(itemref.attrib["idref"])
|
||||
|
||||
ncx_id = root.find("opf:spine", NS).attrib.get("toc") if root.find("opf:spine", NS) is not None else None
|
||||
ncx_href = manifest.get(ncx_id) if ncx_id else None
|
||||
|
||||
return {
|
||||
"title": title,
|
||||
"author": author,
|
||||
"language": language,
|
||||
"manifest": manifest,
|
||||
"spine": spine,
|
||||
"ncx_href": ncx_href,
|
||||
"opf_dir": str(Path(opf_path).parent) if "/" in opf_path else "",
|
||||
}
|
||||
|
||||
|
||||
def _parse_ncx(zf: zipfile.ZipFile, ncx_path: str) -> list[dict]:
|
||||
text = _read_zip_text(zf, ncx_path)
|
||||
root = ET.fromstring(text)
|
||||
chapters: list[dict] = []
|
||||
for nav in root.findall("ncx:navMap/ncx:navPoint", NS):
|
||||
title = (nav.findtext("ncx:navLabel/ncx:text", default="", namespaces=NS) or "").strip()
|
||||
content = nav.find("ncx:content", NS)
|
||||
src = content.attrib.get("src", "") if content is not None else ""
|
||||
# Strip the anchor — we want the file path only.
|
||||
href = src.split("#", 1)[0]
|
||||
chapters.append({"title": title, "href": href})
|
||||
return chapters
|
||||
|
||||
|
||||
def _resolve_zip_path(base_dir: str, href: str) -> str:
|
||||
if not base_dir:
|
||||
return href
|
||||
return f"{base_dir}/{href}".lstrip("/")
|
||||
|
||||
|
||||
def _extract_paragraphs(zf: zipfile.ZipFile, zip_path: str) -> list[str]:
|
||||
try:
|
||||
html = _read_zip_text(zf, zip_path)
|
||||
except KeyError:
|
||||
return []
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
paragraphs: list[str] = []
|
||||
for p in soup.find_all("p"):
|
||||
# Drop nav-anchor wrappers that contain no real text.
|
||||
text = _normalise(p.get_text(" ", strip=True))
|
||||
if not text:
|
||||
continue
|
||||
# Drop chapter-heading paragraphs that only echo the title — handled
|
||||
# separately by the TOC. Heuristic: very short paragraph that's just
|
||||
# numbers + the chapter title pattern. Keep everything else.
|
||||
paragraphs.append(text)
|
||||
return paragraphs
|
||||
|
||||
|
||||
def _chapter_files(
|
||||
spine_files: list[str], chapter_hrefs: list[str]
|
||||
) -> list[list[str]]:
|
||||
"""Slice the spine into one list of files per chapter, using the chapter's
|
||||
first href as the chapter boundary. Files before the first chapter (e.g.
|
||||
cover, titlepage) are dropped."""
|
||||
boundaries: list[int] = []
|
||||
for href in chapter_hrefs:
|
||||
try:
|
||||
idx = spine_files.index(href)
|
||||
except ValueError:
|
||||
boundaries.append(-1)
|
||||
continue
|
||||
boundaries.append(idx)
|
||||
|
||||
ranges: list[list[str]] = []
|
||||
for i, start in enumerate(boundaries):
|
||||
if start < 0:
|
||||
ranges.append([])
|
||||
continue
|
||||
end = len(spine_files)
|
||||
for next_start in boundaries[i + 1:]:
|
||||
if next_start >= 0:
|
||||
end = next_start
|
||||
break
|
||||
ranges.append(spine_files[start:end])
|
||||
return ranges
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("epub", type=Path)
|
||||
parser.add_argument("--slug", default=None)
|
||||
parser.add_argument("--out", type=Path, default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.epub.exists():
|
||||
print(f"EPUB not found: {args.epub}", file=sys.stderr)
|
||||
sys.exit(2)
|
||||
|
||||
with zipfile.ZipFile(args.epub) as zf:
|
||||
opf_path = _container_root(zf)
|
||||
opf = _parse_opf(zf, opf_path)
|
||||
|
||||
if not opf["ncx_href"]:
|
||||
print("No NCX found in spine; cannot derive chapter structure.", file=sys.stderr)
|
||||
sys.exit(3)
|
||||
|
||||
ncx_path = _resolve_zip_path(opf["opf_dir"], opf["ncx_href"])
|
||||
toc = _parse_ncx(zf, ncx_path)
|
||||
|
||||
spine_files = [
|
||||
_resolve_zip_path(opf["opf_dir"], opf["manifest"].get(idref, ""))
|
||||
for idref in opf["spine"]
|
||||
]
|
||||
chapter_hrefs = [_resolve_zip_path(opf["opf_dir"], c["href"]) for c in toc]
|
||||
chapter_file_ranges = _chapter_files(spine_files, chapter_hrefs)
|
||||
|
||||
chapters_out: list[dict] = []
|
||||
for i, (meta, files) in enumerate(zip(toc, chapter_file_ranges), start=1):
|
||||
paragraphs: list[str] = []
|
||||
for f in files:
|
||||
paragraphs.extend(_extract_paragraphs(zf, f))
|
||||
# Drop leading paragraph(s) that just echo the chapter title — the
|
||||
# title is already stored separately.
|
||||
title_norm = _normalise(meta["title"]).lower()
|
||||
while paragraphs and _normalise(paragraphs[0]).lower() == title_norm:
|
||||
paragraphs.pop(0)
|
||||
chapters_out.append(
|
||||
{
|
||||
"id": f"ch{i}",
|
||||
"number": i,
|
||||
"title": meta["title"],
|
||||
"paragraphsES": paragraphs,
|
||||
}
|
||||
)
|
||||
|
||||
slug = args.slug or _slugify(opf["title"]) or args.epub.stem
|
||||
out_dir = args.out or (Path("build") / slug)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
out_path = out_dir / "chapters.json"
|
||||
|
||||
payload = {
|
||||
"title": opf["title"],
|
||||
"author": opf["author"],
|
||||
"language": opf["language"],
|
||||
"slug": slug,
|
||||
"chapters": chapters_out,
|
||||
}
|
||||
out_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
|
||||
|
||||
total_paragraphs = sum(len(c["paragraphsES"]) for c in chapters_out)
|
||||
print(f"Wrote {out_path}")
|
||||
print(f" Title: {opf['title']}")
|
||||
print(f" Author: {opf['author']}")
|
||||
print(f" Chapters: {len(chapters_out)}")
|
||||
print(f" Paragraphs: {total_paragraphs}")
|
||||
for ch in chapters_out:
|
||||
print(f" ch{ch['number']:02d} {len(ch['paragraphsES']):4d} ¶ {ch['title']}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Executable
+65
@@ -0,0 +1,65 @@
|
||||
#!/usr/bin/env bash
|
||||
# Orchestrate the books pipeline: EPUB -> chapters.json -> per-chapter job
|
||||
# manifest -> (translation by Claude Code subagents) -> bundled book_<slug>.json.
|
||||
#
|
||||
# This script DOES NOT run the LLM translation pass. After Phase 2 it stops
|
||||
# and prints how many jobs are pending. Use Claude Code subagents (or a fresh
|
||||
# session per the README) to fill in build/<slug>/jobs/*.output.json, then
|
||||
# re-run this script — it will pick up where it left off via Phase 3.
|
||||
#
|
||||
# Usage:
|
||||
# ./run.sh <epub_path> [--slug SLUG] [--batch-size N]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
cd "$HERE"
|
||||
|
||||
if [[ $# -lt 1 ]]; then
|
||||
echo "usage: $0 <epub_path> [--slug SLUG] [--batch-size N]"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
EPUB="$1"; shift
|
||||
SLUG=""
|
||||
BATCH_SIZE="30"
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--slug) SLUG="$2"; shift 2 ;;
|
||||
--batch-size) BATCH_SIZE="$2"; shift 2 ;;
|
||||
*) echo "unknown option: $1" >&2; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
EPUB_ABS="$(cd "$(dirname "$EPUB")" && pwd)/$(basename "$EPUB")"
|
||||
|
||||
echo "=== Phase 1: extract_epub.py ==="
|
||||
if [[ -n "$SLUG" ]]; then
|
||||
python3 extract_epub.py "$EPUB_ABS" --slug "$SLUG"
|
||||
else
|
||||
python3 extract_epub.py "$EPUB_ABS"
|
||||
fi
|
||||
|
||||
# If --slug wasn't passed, recover the slug from the chapters file just written.
|
||||
if [[ -z "$SLUG" ]]; then
|
||||
SLUG=$(python3 -c "import json,glob; p=sorted(glob.glob('build/*/chapters.json'), key=lambda x: -__import__('os').path.getmtime(x))[0]; print(json.load(open(p))['slug'])")
|
||||
fi
|
||||
|
||||
echo
|
||||
echo "=== Phase 2: translate_chapters.py ==="
|
||||
python3 translate_chapters.py "$SLUG" --batch-size "$BATCH_SIZE"
|
||||
|
||||
PENDING_FILE="build/$SLUG/jobs/_pending.txt"
|
||||
PENDING_COUNT=$(wc -l < "$PENDING_FILE" | tr -d ' ')
|
||||
|
||||
echo
|
||||
echo "=== Phase 3: bundle_book.py ==="
|
||||
if [[ "$PENDING_COUNT" -gt 0 ]]; then
|
||||
echo " $PENDING_COUNT translation job(s) still pending."
|
||||
echo " Run the Claude Code subagent translation step (see README.md), then re-run this script."
|
||||
echo " Bundling with empty placeholders so you can preview app structure now."
|
||||
python3 bundle_book.py "$SLUG"
|
||||
else
|
||||
python3 bundle_book.py "$SLUG" --require-all
|
||||
fi
|
||||
@@ -0,0 +1,136 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Split chapters.json into translation jobs that Claude Code subagents can
|
||||
process in parallel. Resumable: jobs whose output file already exists are
|
||||
skipped.
|
||||
|
||||
Usage:
|
||||
python3 translate_chapters.py <slug> [--batch-size N] [--build BUILD_DIR]
|
||||
|
||||
Inputs:
|
||||
BUILD_DIR/<slug>/chapters.json (from extract_epub.py)
|
||||
|
||||
Outputs:
|
||||
BUILD_DIR/<slug>/jobs/<jobid>.input.json (one per batch — read by subagents)
|
||||
BUILD_DIR/<slug>/jobs/_pending.txt (list of job IDs still missing output)
|
||||
BUILD_DIR/<slug>/jobs/_prompt_template.md (prompt the orchestrator hands each subagent)
|
||||
|
||||
Job layout (.input.json):
|
||||
{
|
||||
"jobId": "ch06_b00",
|
||||
"chapter": 6,
|
||||
"chapterTitle": "1. El Castillo",
|
||||
"rangeStart": 0,
|
||||
"rangeEnd": 30,
|
||||
"paragraphsES": ["...", "..."]
|
||||
}
|
||||
|
||||
Subagents must write `<jobid>.output.json` with shape:
|
||||
{"jobId": "ch06_b00", "paragraphsEN": ["...", "..."]}
|
||||
|
||||
The output array MUST have the same length as paragraphsES, in the same order.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
PROMPT_TEMPLATE = """\
|
||||
You are translating a chunk of a Spanish-language book into English for a
|
||||
language-learning app.
|
||||
|
||||
Input file: {input_path}
|
||||
Output file: {output_path}
|
||||
|
||||
Read the input file. It contains a JSON object with a `paragraphsES` array.
|
||||
Translate each paragraph into natural English. Preserve meaning, tone, and
|
||||
dialogue markers (—, –, ¡, ¿) as appropriate for the English output. Keep
|
||||
the same number of paragraphs in the same order.
|
||||
|
||||
Notes for translation quality:
|
||||
- This is a beginner Spanish reader, so prefer plain natural English over
|
||||
literary flourish.
|
||||
- Preserve proper nouns (character names, place names) verbatim.
|
||||
- Convert Spanish dialogue dashes (–, —) to English-style quotation marks
|
||||
ONLY if it reads more naturally; otherwise keep them as em-dashes.
|
||||
- Do NOT add explanatory parentheticals; the in-app dictionary handles
|
||||
per-word lookup.
|
||||
|
||||
Write the output as JSON with shape:
|
||||
{{"jobId": "<the jobId from the input>", "paragraphsEN": [...]}}
|
||||
|
||||
The `paragraphsEN` array MUST be the same length and order as `paragraphsES`
|
||||
in the input. Write nothing else to disk and produce no other output.
|
||||
"""
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("slug")
|
||||
parser.add_argument("--batch-size", type=int, default=30)
|
||||
parser.add_argument("--build", type=Path, default=Path("build"))
|
||||
args = parser.parse_args()
|
||||
|
||||
base = args.build / args.slug
|
||||
chapters_path = base / "chapters.json"
|
||||
jobs_dir = base / "jobs"
|
||||
jobs_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
data = json.loads(chapters_path.read_text(encoding="utf-8"))
|
||||
|
||||
pending: list[str] = []
|
||||
completed: list[str] = []
|
||||
total_jobs = 0
|
||||
|
||||
for ch in data["chapters"]:
|
||||
paragraphs = ch["paragraphsES"]
|
||||
if not paragraphs:
|
||||
continue
|
||||
for offset in range(0, len(paragraphs), args.batch_size):
|
||||
chunk = paragraphs[offset : offset + args.batch_size]
|
||||
job_id = f"ch{ch['number']:02d}_b{offset // args.batch_size:02d}"
|
||||
input_path = jobs_dir / f"{job_id}.input.json"
|
||||
output_path = jobs_dir / f"{job_id}.output.json"
|
||||
|
||||
input_path.write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"jobId": job_id,
|
||||
"chapter": ch["number"],
|
||||
"chapterTitle": ch["title"],
|
||||
"rangeStart": offset,
|
||||
"rangeEnd": offset + len(chunk),
|
||||
"paragraphsES": chunk,
|
||||
},
|
||||
ensure_ascii=False,
|
||||
indent=2,
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
total_jobs += 1
|
||||
if output_path.exists():
|
||||
completed.append(job_id)
|
||||
else:
|
||||
pending.append(job_id)
|
||||
|
||||
(jobs_dir / "_pending.txt").write_text("\n".join(pending) + ("\n" if pending else ""))
|
||||
|
||||
(jobs_dir / "_prompt_template.md").write_text(
|
||||
PROMPT_TEMPLATE.format(
|
||||
input_path="<JOB_INPUT_PATH>",
|
||||
output_path="<JOB_OUTPUT_PATH>",
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
print(f"Total translation jobs: {total_jobs}")
|
||||
print(f" Completed: {len(completed)}")
|
||||
print(f" Pending: {len(pending)}")
|
||||
print(f"Manifest at: {jobs_dir / '_pending.txt'}")
|
||||
print(f"Prompt template at: {jobs_dir / '_prompt_template.md'}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user