Skip to content

Hash

Hash files using BLAKE3 or SHA-256 with automatic memory-mapped I/O and parallel batch processing.

Algorithms

Algorithm Speed Output Use Case
blake3 (default) ~10x faster than SHA-256 64-char hex Deduplication, integrity checks, caching
sha256 Standard speed 64-char hex Interoperability, compliance, verification

BLAKE3 is the default because it is cryptographically secure and extremely fast — it leverages SIMD and multithreading internally.


Single File Hashing

import pyfs_watcher

result = pyfs_watcher.hash_file("large.iso")
print(result.hash_hex)       # "d74981efa70a0c880b..."
print(result.algorithm)      # "blake3"
print(result.file_size)      # 4294967296

Choosing an algorithm

# BLAKE3 (default, fastest)
result = pyfs_watcher.hash_file("data.bin")

# SHA-256 (for compatibility)
result = pyfs_watcher.hash_file("data.bin", algorithm="sha256")

Custom chunk size

# Smaller chunks for memory-constrained environments
result = pyfs_watcher.hash_file("data.bin", chunk_size=65536)

The default chunk size is 1 MB (1_048_576 bytes). Files larger than 4 MB automatically use memory-mapped I/O regardless of chunk size.


Parallel Batch Hashing

Hash many files at once using all available CPU cores:

paths = ["file1.bin", "file2.bin", "file3.bin"]
results = pyfs_watcher.hash_files(paths, algorithm="blake3")

for r in results:
    print(f"{r.path}: {r.hash_hex}")

Note

The order of results may differ from the input order because files are processed in parallel.

Progress callback

def on_hash(result):
    print(f"Hashed: {result.path} ({result.file_size} bytes)")

results = pyfs_watcher.hash_files(paths, callback=on_hash)

Limiting workers

# Use at most 4 threads
results = pyfs_watcher.hash_files(paths, max_workers=4)

By default, max_workers=None uses all available CPU cores via Rayon.


Memory-Mapped I/O

For files larger than 4 MB, pyfs-watcher automatically uses memory-mapped I/O (mmap) instead of buffered reads. This allows the OS to manage page caching efficiently, which is particularly beneficial when hashing large files.

You don't need to configure this — it happens automatically.


HashResult as Dict Key

HashResult supports equality comparison and hashing based on the digest and algorithm, so you can use instances in sets and as dictionary keys:

results = pyfs_watcher.hash_files(paths)

# Group files by hash
by_hash = {}
for r in results:
    by_hash.setdefault(r.hash_hex, []).append(r.path)

# Find duplicates
duplicates = {h: paths for h, paths in by_hash.items() if len(paths) > 1}

Tip

For full duplicate detection with a three-stage pipeline, use find_duplicates() instead.


Error Handling

try:
    result = pyfs_watcher.hash_file("/nonexistent")
except FileNotFoundError:
    print("File does not exist")
except pyfs_watcher.HashError as e:
    print(f"Hashing failed: {e}")

When using hash_files(), individual file errors are included in the raised HashError rather than silently skipped.


Performance Tips

  • Use BLAKE3 (the default) for maximum throughput.
  • Use hash_files() for batch operations — parallel processing saturates disk I/O better than sequential calls.
  • The 4 MB mmap threshold is tuned for modern SSDs. No configuration needed.
  • Limit max_workers if you want to leave CPU headroom for other tasks.