Content-Hash File Cache Pattern

Overview

Cache expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashes as cache keys. Unlike path-based caching, this survives file moves/renames and auto-invalidates when content changes.

Core Pattern

1. Content-Hash Based Cache Key

def compute_file_hash(path: Path) -> str:
    """SHA-256 of file contents (chunked for large files)."""
    sha256 = hashlib.sha256()
    with open(path, "rb") as f:
        while chunk := f.read(65536):
            sha256.update(chunk)
    return sha256.hexdigest()

Why content hash? File rename/move = cache hit. Content change = automatic invalidation. No index file needed.

2. File-Based Cache Storage

Each entry stored as {hash}.json — O(1) lookup, no index required.

3. Service Layer Wrapper (SRP)

Keep processing functions pure. Add caching as a separate service layer:

def extract_with_cache(file_path, *, cache_enabled=True, cache_dir=Path(".cache")):
    if not cache_enabled:
        return extract_text(file_path)  # Pure function, no cache knowledge
    file_hash = compute_file_hash(file_path)
    cached = read_cache(cache_dir, file_hash)
    if cached is not None:
        return cached.document
    doc = extract_text(file_path)
    write_cache(cache_dir, CacheEntry(file_hash, str(file_path), doc))
    return doc

Key Design Decisions

Decision	Rationale
SHA-256 content hash	Path-independent, auto-invalidates on change
`{hash}.json` file naming	O(1) lookup, no index needed
Service layer wrapper	SRP: extraction stays pure, cache is separate
Corruption returns `None`	Graceful degradation, re-processes on next run

Best Practices

Hash content, not paths — paths change, content identity doesn't
Chunk large files when hashing — avoid loading entire files into memory
Keep processing functions pure — they should know nothing about caching
Log cache hit/miss with truncated hashes for debugging
Handle corruption gracefully — treat invalid cache entries as misses

Anti-Patterns

Path-based caching (breaks on file move/rename)
Adding cache logic inside processing functions (SRP violation)
Using dataclasses.asdict() with nested frozen dataclasses (use manual serialization)

When to Use / When NOT to Use

Use for: File processing pipelines, CLI tools with --cache/--no-cache, batch processing with repeated files.

Avoid for: Real-time data that must always be fresh, extremely large cache entries (use streaming), results depending on parameters beyond file content.

Content-Hash File Cache Pattern

Content-Hash File Cache Pattern

Overview

Core Pattern

1. Content-Hash Based Cache Key

2. File-Based Cache Storage

3. Service Layer Wrapper (SRP)

Key Design Decisions

Best Practices

Anti-Patterns

When to Use / When NOT to Use

相关技能 Related Skills

Hook System

PM2 Process Management

Package Manager Setup