Files
OFApp/server
Trey T 4ba88d96f4 forum scraper: dedup downloads by content hash, not just filename
Mirror of a627388 but for the forum image path. The same image is often
re-uploaded under different filenames across pages/posts, so existsSync
on the target name can't catch content-duplicates. After fetching the
buffer, hash the first 64KB and compare against existing same-size files
in the target folder (same md5+size signature as gallery's duplicate
scanner). Confirmed against a known dani-speegle-2 pair:

  skip IMG_79695f8914f20ce38b07.jpg — same content as
       72759c89-7e53-4976-839a-7d952c444579.jpg

buildSizeIndex is built once per job in runForumScrape and threaded
through scrapeForumPage → downloadImage; the hash cache amortizes across
all pages in the job.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 10:14:23 -05:00
..