Walk¶
Recursively traverse directory trees with parallel I/O powered by jwalk and Rayon.
Streaming vs Collecting¶
pyfs-watcher offers two walk functions, each suited to different use cases:
| Function | Returns | Best For |
|---|---|---|
walk() |
Streaming WalkIter |
Large trees, early termination, low memory |
walk_collect() |
list[WalkEntry] |
Full result set, maximum throughput |
walk_collect() is faster when you need all results because it avoids per-item GIL overhead by collecting everything in Rust before returning to Python.
Basic Usage¶
Streaming iteration¶
import pyfs_watcher
for entry in pyfs_watcher.walk("/data"):
print(entry.path, entry.is_file, entry.file_size)
The iterator yields WalkEntry objects as the parallel traversal engine discovers them. You can break out of the loop early without waiting for the full scan to complete.
Bulk collection¶
Filtering¶
Both functions accept the same filtering parameters:
By file type¶
# Only files
files = pyfs_watcher.walk_collect("/src", file_type="file")
# Only directories
dirs = pyfs_watcher.walk_collect("/src", file_type="dir")
By glob pattern¶
# Only Python files
for entry in pyfs_watcher.walk("/project", file_type="file", glob_pattern="*.py"):
print(entry.path)
The glob pattern matches against the filename only, not the full path.
By depth¶
# Only top-level contents (depth 1)
entries = pyfs_watcher.walk_collect("/data", max_depth=1)
# Up to 3 levels deep
entries = pyfs_watcher.walk_collect("/data", max_depth=3)
Skip hidden files¶
Entries whose name starts with a dot (.git, .env, etc.) are excluded.
Sorting¶
When sort=True, entries within each directory are sorted by path. This makes output deterministic but adds overhead.
Symlinks¶
By default, symlinks are not followed to avoid infinite loops.
WalkEntry Properties¶
Each entry provides:
| Property | Type | Description |
|---|---|---|
path |
str |
Absolute path |
is_file |
bool |
Regular file? |
is_dir |
bool |
Directory? |
is_symlink |
bool |
Symbolic link? |
depth |
int |
Depth relative to root (children = 1) |
file_size |
int |
Size in bytes (0 for directories) |
Comparison with os.walk¶
# os.walk — single-threaded, yields (dirpath, dirnames, filenames)
import os
for root, dirs, files in os.walk("/data"):
for f in files:
print(os.path.join(root, f))
# pyfs_watcher.walk — parallel, yields WalkEntry with metadata
import pyfs_watcher
for entry in pyfs_watcher.walk("/data", file_type="file"):
print(entry.path, entry.file_size)
Key differences:
- Speed: pyfs-watcher uses parallel I/O across multiple threads, which is significantly faster on SSDs and network filesystems.
- Metadata:
WalkEntryincludesfile_size,is_symlink, anddepthwithout extraos.stat()calls. - Filtering: Built-in glob, depth, type, and hidden-file filtering happens in Rust before crossing the Python boundary.
Error Handling¶
try:
entries = pyfs_watcher.walk_collect("/nonexistent")
except pyfs_watcher.WalkError as e:
print(f"Walk failed: {e}")
Individual unreadable subdirectories are silently skipped rather than raising exceptions. Only root-level errors raise WalkError.
Performance Tips¶
- Use
walk_collect()when you need all results — it avoids per-entry GIL acquisition. - Use
file_typeandglob_patternto filter in Rust rather than in Python. - Avoid
sort=Trueunless you need deterministic ordering. - Set
max_depthwhen you only need shallow results.