Files
Spanish/.gitignore
T
Trey T f368c24ad6 Fixes #32 — LLM vision pass for vocab pairs, fixes scrambled English/Spanish
The bbox-OCR pipeline mis-paired ~114 vocab tables across the book — the
chapter 7 "Other Idioms" image (issue #32) being the most visible.
Three failure modes were collapsing the data:
  1) classifier blind to subject pronouns ("yo", "I", etc.)
  2) right-then-left OCR reads on 2-col tables
  3) Y-cluster drift on multi-line cells in 4-col layouts

Replaced the entire vocab-extraction tier with a Claude vision pass over
all 931 vocab images. Output is keyed by image with three classifications:
  - pair_table       (extract all Spanish↔English pairs)
  - reference_only   (Spanish-only conjugation tables — no pairs, UI shows
                      the flat OCR lines as a reference list instead)
  - hybrid           (some header pairs + reference content beneath; only
                      the genuine pairs become cards)

merge_pdf_into_book.py now picks pair source by priority:
  llm-vision → bounding-box OCR → block-alternation heuristic.

Numbers (across the whole book):
  - mis-oriented tables: 114 → 5
  - quarantined cards:   250 → 2
  - extracted pairs:     2832 → 4569

textbookDataVersion bumped to 13. Per-batch agent outputs gitignored
under Conjuga/Scripts/textbook/paired_vocab_llm/ — only the merged
paired_vocab_llm.json (also gitignored) is needed to rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:48:04 -05:00

57 lines
1.1 KiB
Plaintext

# Xcode
build/
DerivedData/
*.xcodeproj/xcuserdata/
*.xcworkspace/xcuserdata/
*.xcodeproj/project.xcworkspace/xcshareddata/IDEWorkspaceChecks.plist
xcuserdata/
# Swift Package Manager
.build/
.swiftpm/
Packages/
# macOS
.DS_Store
*.swp
*~
# CocoaPods
Pods/
# Secrets / env
.env
*.p12
*.mobileprovision
# Archives
*.xcarchive
# Claude
.claude/
# Reference/research docs (not part of the app)
screens/
conjugato/
conjuu-es/
# Video scraping pipeline (kept locally for reruns, not committed)
scrape/
*.webm
*.mp4
*.mkv
# Third-party textbook sources (not redistributable)
*.pdf
*.epub
epub_extract/
# Textbook extraction artifacts — regenerate locally via run_pipeline.sh.
# Scripts are committed; their generated outputs are not.
Conjuga/Scripts/textbook/*.json
Conjuga/Scripts/textbook/review.html
Conjuga/Scripts/textbook/paired_vocab_llm/
# Note: the app-bundle copies (Conjuga/Conjuga/textbook_{data,vocab}.json)
# ARE committed so `xcodebuild` works on a fresh clone without first running
# the pipeline. They're regenerated from the scripts when content changes.