# Vocab catalog build `build_lexemes.py` produces `Conjuga/vocab_lexemes.json`, the bundled catalog of frequency-ranked Spanish nouns and adjectives that powers the Noun / Adjective flashcard study modes. ## Run ```sh python3 build_lexemes.py ``` Downloads `frequency.csv` + `es-en.data` from a pinned commit of [`doozan/spanish_data`](https://github.com/doozan/spanish_data), caches them under `.cache//`, joins them, and writes the JSON. Re-running is fast — only the join step happens after the first download. Override defaults: ```sh python3 build_lexemes.py --max-nouns 3000 --max-adjectives 1000 python3 build_lexemes.py --output /tmp/vocab.json ``` ## Data sources & attribution All datasets are CC-licensed; the bundled catalog inherits CC-BY-SA. Credit in the app's About screen must read: > Vocabulary data: Wiktionary (CC-BY-SA), OpenSubtitles via FrequencyWords > (CC-BY-SA 3.0). - **`frequency.csv`** — derived from [hermitdave/FrequencyWords](https://github.com/hermitdave/FrequencyWords) (OpenSubtitles corpus), packaged by doozan. License: CC-BY-SA 3.0. - **`es-en.data`** — Spanish→English Wiktionary export in the [`enwiktionary_wordlist`](https://github.com/doozan/enwiktionary_wordlist) format. License: CC-BY-SA. The pinned doozan commit is at the top of `build_lexemes.py` (`DOOZAN_COMMIT`). Bump it to refresh; the cache key includes the commit so old data is auto-replaced. ## Output shape ```json [ { "baseForm": "casa", "english": "house", "partOfSpeech": "noun", "gender": "f", "frequencyRank": 142, "exampleES": "La casa es grande", "exampleEN": "The house is big" }, ... ] ``` Sorted by `frequencyRank` ascending so the fresh-card path in `LexemePool` surfaces the most useful words first.