Architecture¶

pyfs-watcher is a Python package with a Rust core, using PyO3 to bridge the two languages. This page explains the internal design decisions.

System Overview¶

graph TB
    subgraph Python
        A[pyfs_watcher] --> B[_core.pyi<br/>Type Stubs]
        A --> C[watch.py<br/>async_watch]
    end

    subgraph PyO3 Boundary
        D[pyo3 Extension Module<br/>pyfs_watcher._core]
    end

    subgraph Rust
        E[walk module<br/>jwalk + rayon]
        F[hash module<br/>blake3 + sha2 + mmap]
        G[copy module<br/>chunked I/O]
        H[watch module<br/>notify + debouncer]
        I[dedup module<br/>staged pipeline]
        J[search module<br/>regex + rayon]
        K[diff module<br/>walk + hash]
        L[sync module<br/>walk + hash + copy]
        M[snapshot module<br/>walk + hash + serde]
        N[du module<br/>jwalk]
        O[rename module<br/>regex + walk]
    end

    B --> D
    C --> D
    D --> E
    D --> F
    D --> G
    D --> H
    D --> I
    D --> J
    D --> K
    D --> L
    D --> M
    D --> N
    D --> O

GIL Management¶

The Global Interpreter Lock (GIL) is Python's mechanism for thread safety. Since Rust operations don't need the GIL, pyfs-watcher releases it during heavy computation using py.allow_threads():

sequenceDiagram
    participant Python
    participant PyO3
    participant Rust

    Python->>PyO3: hash_files(paths)
    PyO3->>PyO3: Release GIL
    PyO3->>Rust: Parallel hashing (rayon)
    Note right of Rust: GIL released —<br/>Python threads can run
    Rust-->>PyO3: Results
    PyO3->>PyO3: Acquire GIL
    PyO3-->>Python: list[HashResult]

This means other Python threads can run while Rust is working, making pyfs-watcher a good citizen in multi-threaded Python applications.

Exception: The streaming walk() iterator acquires the GIL for each yielded item. For maximum throughput when you need all results, use walk_collect() which only acquires the GIL once.

Streaming Walk Architecture¶

The walk() function uses a crossbeam channel to stream results from the parallel traversal engine:

graph LR
    subgraph Rust Threads
        A[jwalk<br/>Thread Pool] -- WalkEntry --> B[crossbeam<br/>channel]
    end

    subgraph Python Thread
        B -- GIL acquire per item --> C[WalkIter<br/>__next__]
    end

jwalk uses Rayon to traverse directories in parallel
Entries are sent through a bounded crossbeam channel
The Python WalkIter.__next__() receives one entry at a time, acquiring the GIL for each
Early termination (breaking from the loop) drops the channel, cleanly stopping the traversal

Memory-Mapped I/O¶

For file hashing, pyfs-watcher uses a size-based strategy:

File Size	Strategy	Why
< 4 MB	Buffered reads	Lower overhead for small files
>= 4 MB	Memory-mapped I/O	OS handles page caching efficiently

The 4 MB threshold is hardcoded based on benchmarking across SSDs and HDDs. Memory mapping avoids an extra copy from kernel space to user space, which becomes significant for large files.

Dedup Pipeline¶

The deduplication pipeline is designed to minimize I/O by eliminating non-duplicates early:

graph TD
    A[All Files] --> B[Group by Size]
    B -- Unique sizes eliminated --> C[Partial Hash<br/>First + last 4KB]
    C -- Unique partial hashes eliminated --> D[Full Hash<br/>Entire file content]
    D --> E[Duplicate Groups]

Stage 1 — Size grouping: Files are grouped by size. Any file with a unique size is immediately excluded. This is a pure metadata operation (no I/O).

Stage 2 — Partial hash: For each size group, the first and last partial_hash_size bytes (default 4096) are read and hashed. Files with unique partial hashes are eliminated. This reads at most 8 KB per file.

Stage 3 — Full hash: Remaining candidates are fully hashed using the selected algorithm. Files with matching full hashes are confirmed duplicates.

Each stage uses Rayon for parallel processing.

Rayon Thread Pools¶

CPU-bound operations (hashing, dedup) use Rayon thread pools:

Default: Uses all available CPU cores
Configurable: max_workers parameter limits the pool size
Per-call pools: Each function call creates its own thread pool to avoid contention
Rayon handles work-stealing and load balancing automatically

Error Mapping¶

Rust errors are mapped to Python exceptions at the PyO3 boundary:

Rust Error	Python Exception
`FsError::Walk(...)`	`WalkError`
`FsError::Hash(...)`	`HashError`
`FsError::Copy(...)`	`CopyError`
`FsError::Watch(...)`	`WatchError`
`FsError::Search(...)`	`SearchError`
`FsError::DirDiff(...)`	`DirDiffError`
`FsError::Sync(...)`	`SyncError`
`FsError::Snapshot(...)`	`SnapshotError`
`FsError::DiskUsage(...)`	`DiskUsageError`
`FsError::Rename(...)`	`RenameError`
`std::io::ErrorKind::NotFound`	`FileNotFoundError`
`std::io::ErrorKind::PermissionDenied`	`PermissionError`

All custom exceptions inherit from FsWatcherError, which inherits from Python's Exception.

Logging Bridge¶

pyfs-watcher uses pyo3-log to bridge Rust's log crate to Python's logging module. Enable debug logging to see internal operations:

import logging
logging.basicConfig(level=logging.DEBUG)

import pyfs_watcher
# Rust-level log messages now appear in Python's logging output

Key Dependencies¶

Crate	Purpose
pyo3	Python ↔ Rust bindings
jwalk	Parallel directory traversal
blake3	BLAKE3 hashing
sha2	SHA-256 hashing
rayon	Data parallelism
notify	Cross-platform filesystem events
notify-debouncer-full	Event debouncing
crossbeam-channel	Lock-free channels
memmap2	Memory-mapped file I/O
pyo3-log	Logging bridge
regex	Content search and batch rename
serde	Snapshot serialization
serde_json	JSON snapshot format
chrono	Timestamp generation