Regex vs LLM Decision Framework for Structured Text

中级 Intermediate 参考型 Reference ⚡ Claude Code 专属 ⚡ Claude Code Optimized
2 min read · 84 lines

Regex handles 95% cheaply — reserve LLM calls for edge cases in structured text parsing

Regex vs LLM Decision Framework for Structured Text

Overview

A practical decision framework for parsing structured text. The key insight: regex handles 95-98% of cases cheaply and deterministically. Reserve expensive LLM calls for the remaining edge cases.

Decision Tree

Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│   ├── Regex handles 95%+ → Done, no LLM needed
│   └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly

Architecture

Source Text → [Regex Parser] → [Confidence Scorer]
                                    │
                    ├── High confidence (≥0.95) → Direct output
                    └── Low confidence (<0.95) → [LLM Validator] → Output

Implementation Pattern

1. Regex Parser (handles 95-98%)

def parse_structured_text(content: str) -> list[ParsedItem]:
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    # Extract matches into frozen dataclasses

2. Confidence Scoring

Flag items that may need LLM review based on heuristics (few choices, missing answers, short text).

3. LLM Validator (edge cases only)

Use the cheapest available model (Haiku-class) to fix low-confidence extractions.

4. Hybrid Pipeline

def process_document(content, *, llm_client=None, confidence_threshold=0.95):
    items = parse_structured_text(content)         # Step 1: Regex
    low_confidence = identify_low_confidence(items) # Step 2: Score
    # Step 3: LLM only for flagged items

Production Metrics (410 items)

Metric Value
Regex success rate 98.0%
Low confidence items 8 (2.0%)
LLM calls needed ~5
Cost savings vs all-LLM ~95%

Best Practices

  • Start with regex even if imperfect — it provides a baseline to improve
  • Use confidence scoring to programmatically identify what needs LLM help
  • Use the cheapest LLM for validation (Haiku-class models suffice)
  • Never mutate parsed items — return new instances
  • TDD works well for parsers — test known patterns first, then edge cases
  • Log metrics (regex success rate, LLM call count) for pipeline health

Anti-Patterns

  • Sending all text to an LLM when regex handles 95%+ (expensive and slow)
  • Using regex for free-form, highly variable text (LLM is better)
  • Skipping confidence scoring
  • Mutating parsed objects during validation

相关技能 Related Skills