Skip to content

Walk

Recursively traverse directory trees with parallel I/O powered by jwalk and Rayon.

Streaming vs Collecting

pyfs-watcher offers two walk functions, each suited to different use cases:

Function Returns Best For
walk() Streaming WalkIter Large trees, early termination, low memory
walk_collect() list[WalkEntry] Full result set, maximum throughput

walk_collect() is faster when you need all results because it avoids per-item GIL overhead by collecting everything in Rust before returning to Python.


Basic Usage

Streaming iteration

import pyfs_watcher

for entry in pyfs_watcher.walk("/data"):
    print(entry.path, entry.is_file, entry.file_size)

The iterator yields WalkEntry objects as the parallel traversal engine discovers them. You can break out of the loop early without waiting for the full scan to complete.

Bulk collection

entries = pyfs_watcher.walk_collect("/data", sort=True)
print(f"Found {len(entries)} entries")

Filtering

Both functions accept the same filtering parameters:

By file type

# Only files
files = pyfs_watcher.walk_collect("/src", file_type="file")

# Only directories
dirs = pyfs_watcher.walk_collect("/src", file_type="dir")

By glob pattern

# Only Python files
for entry in pyfs_watcher.walk("/project", file_type="file", glob_pattern="*.py"):
    print(entry.path)

The glob pattern matches against the filename only, not the full path.

By depth

# Only top-level contents (depth 1)
entries = pyfs_watcher.walk_collect("/data", max_depth=1)

# Up to 3 levels deep
entries = pyfs_watcher.walk_collect("/data", max_depth=3)

Skip hidden files

entries = pyfs_watcher.walk_collect("/home/user", skip_hidden=True)

Entries whose name starts with a dot (.git, .env, etc.) are excluded.

Sorting

entries = pyfs_watcher.walk_collect("/data", sort=True)

When sort=True, entries within each directory are sorted by path. This makes output deterministic but adds overhead.

entries = pyfs_watcher.walk_collect("/data", follow_symlinks=True)

By default, symlinks are not followed to avoid infinite loops.


WalkEntry Properties

Each entry provides:

Property Type Description
path str Absolute path
is_file bool Regular file?
is_dir bool Directory?
is_symlink bool Symbolic link?
depth int Depth relative to root (children = 1)
file_size int Size in bytes (0 for directories)

Comparison with os.walk

# os.walk — single-threaded, yields (dirpath, dirnames, filenames)
import os
for root, dirs, files in os.walk("/data"):
    for f in files:
        print(os.path.join(root, f))

# pyfs_watcher.walk — parallel, yields WalkEntry with metadata
import pyfs_watcher
for entry in pyfs_watcher.walk("/data", file_type="file"):
    print(entry.path, entry.file_size)

Key differences:

  • Speed: pyfs-watcher uses parallel I/O across multiple threads, which is significantly faster on SSDs and network filesystems.
  • Metadata: WalkEntry includes file_size, is_symlink, and depth without extra os.stat() calls.
  • Filtering: Built-in glob, depth, type, and hidden-file filtering happens in Rust before crossing the Python boundary.

Error Handling

try:
    entries = pyfs_watcher.walk_collect("/nonexistent")
except pyfs_watcher.WalkError as e:
    print(f"Walk failed: {e}")

Individual unreadable subdirectories are silently skipped rather than raising exceptions. Only root-level errors raise WalkError.


Performance Tips

  • Use walk_collect() when you need all results — it avoids per-entry GIL acquisition.
  • Use file_type and glob_pattern to filter in Rust rather than in Python.
  • Avoid sort=True unless you need deterministic ordering.
  • Set max_depth when you only need shallow results.