DiffSeeker is a cross-volume file comparison tool that scans, catalogs, and queries files by content (hash + size) rather than path, enabling reliable duplicate detection and difference analysis across disks, archives, and filesystems.
Go to file
Wesley R. Elsberry 80f3ecff3a Initial Makefile 2025-12-17 04:47:09 +00:00
python Initial commit of GPT 5.2 content 2025-12-17 04:40:30 +00:00
.gitignore Initial commit 2025-12-16 23:28:17 -05:00
LICENSE Initial commit 2025-12-16 23:28:17 -05:00
Makefile Initial Makefile 2025-12-17 04:47:09 +00:00
README.md Update 2025-12-17 04:44:01 +00:00
pyproject.toml Initial commit of GPT 5.2 content 2025-12-17 04:40:30 +00:00

README.md

DiffSeeker

DiffSeeker scans directory trees, records file metadata plus content hashes, and supports cross-volume comparison for:

  • duplicates (same hash + size) across volumes
  • missing files (present on one volume, absent on others by hash+size)
  • suspicious divergences (same name, different size)

Python CLI (mpchunkcfa compatible)

Install (editable dev install):

pip install -e .

Scan a directory and emit CSV:

mpchunkcfa --walk /path/to/root -V "VOL_A" -c vol_a.csv

Scan and ingest into SQLite:

mpchunkcfa --walk /path/to/root -V "VOL_A" --db diffseeker.db

Exclude directory elements (repeatable):

mpchunkcfa --walk . -V "VOL_A" --exclude .git --exclude .svn

Data model

Each file record includes: name, relative_path, extension, size, creation_date, modified_date, hash_value, file_type, number_of_files, volume_name.

5) Notes on compatibility and correctness

  • Your earlier runtime error (root undefined in worker) is eliminated by passing root and directory into _compute_record.
  • The exclusion logic is path-element based, so .gitignore is not excluded when excluding .git.
  • creation_date is OS-dependent semantics (ctime on Unix). If you later want “birth time” portability, we can normalize explicitly per platform.