Dedup API¶
find_duplicates()¶
def find_duplicates(
paths: Sequence[str | PathLike[str]],
*,
recursive: bool = True,
min_size: int = 1,
algorithm: Literal["sha256", "blake3"] = "blake3",
partial_hash_size: int = 4096,
max_workers: int | None = None,
progress_callback: Callable[[str, int, int], None] | None = None,
) -> list[DuplicateGroup]
Find duplicate files using a staged pipeline.
Efficiently identifies duplicates in three stages, each eliminating non-duplicates before the next expensive step:
- Size grouping — files with unique sizes are eliminated.
- Partial hash — first and last
partial_hash_sizebytes are compared. - Full hash — remaining candidates are fully hashed to confirm.
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
paths |
Sequence[str \| PathLike[str]] |
required | Directories or files to scan |
recursive |
bool |
True |
Whether to recurse into subdirectories |
min_size |
int |
1 |
Ignore files smaller than this many bytes |
algorithm |
Literal["sha256", "blake3"] |
"blake3" |
Hash algorithm |
partial_hash_size |
int |
4096 |
Bytes to read from head and tail for partial hashing |
max_workers |
int \| None |
None |
Max parallel threads (None = all cores) |
progress_callback |
Callable[[str, int, int], None] \| None |
None |
(stage, processed, total) callback |
Progress Callback¶
The callback receives three arguments:
| Argument | Type | Description |
|---|---|---|
stage |
str |
"collecting", "partial_hash", or "full_hash" |
processed |
int |
Items processed so far in this stage |
total |
int |
Total items in this stage |
Returns¶
A list of DuplicateGroup objects sorted by wasted_bytes descending.
Raises¶
HashError— If hashing fails for any file.
Example¶
groups = pyfs_watcher.find_duplicates(
["/photos", "/backup"],
min_size=1024,
progress_callback=lambda stage, done, total: print(f"{stage}: {done}/{total}"),
)
for g in groups:
print(f"{g.file_size}B x {len(g.paths)} copies = {g.wasted_bytes}B wasted")
for path in g.paths:
print(f" {path}")
DuplicateGroup¶
A group of files that share identical content. Returned by find_duplicates(). Groups are sorted by wasted_bytes in descending order.
Properties¶
| Property | Type | Description |
|---|---|---|
hash_hex |
str |
Hex-encoded hash digest shared by all files |
file_size |
int |
Size of each file in bytes |
paths |
list[str] |
Absolute paths of the duplicate files |
wasted_bytes |
int |
file_size * (count - 1) |
Protocols¶
__len__() -> int— Number of duplicate files in this group__repr__() -> str