Skip to content

The Terabyte Log Scanner (Mainframe Telemetry)

Bridging Static and Dynamic Analysis

The Terabyte Log Scanner (terabyte_log_scanner.py) is a specialized, high-speed parser built to handle massive, translated ASCII SMF (System Management Facility) logs from Mainframe environments.

Rather than just searching for random text, it acts as a dynamic validation engine. It takes the "dead code" hypotheses generated by GitGalaxy's static analysis and cross-references them against live execution logs to prove whether a program is actually running in production.

The Input Handshake (Dynamic Targeting)

Scanning a multi-terabyte log for every single word is computationally impossible. The Scanner utilizes an "Input Handshake" to surgically target its search:

  • Manual Keywords: Security engineers can pass specific targets manually via the -k flag to hunt for specific events.
  • Automated State Ingestion: The scanner can directly ingest an ir_state.json file generated by the broader GitGalaxy pipeline. It automatically extracts the list of known_programs and uses them as its search targets, allowing the engine to autonomously hunt for execution proof of thousands of files simultaneously.

Mainframe Execution Physics (SMF/JCL Traps)

Simply searching a log for the word "AUTH" will return millions of false positives. To ensure accuracy, the Scanner wraps the search targets in strict Mainframe Execution Regex.

It automatically prefixes targets with common SMF and JCL (Job Control Language) execution markers, specifically hunting for patterns like PGM=AUTH or STARTED AUTH. This guarantees that a "hit" represents an actual program execution, not just a random text string.

The Memory Shield (Binary Velocity)

Like the PII Leak Hunter, the Terabyte Scanner employs a strict Memory Shield to achieve massive processing velocity:

  1. Binary Compilation: The SMF/JCL regex patterns are compiled directly as raw bytes.
  2. Lazy Decoding: The engine streams the multi-terabyte file in pure binary (rb). It only decodes a line into UTF-8 text if the physical byte pattern confirms a direct hit. This eliminates the extreme CPU penalty of decoding gigabytes of irrelevant log data.

Time-Series Histograms & UX Constraints

As the scanner rips through the logs, it uses chronological regex to extract the timestamp of every single execution hit. It aggregates this data to build dynamic ASCII histograms directly in the terminal.

  • The Top 15 Filter (UX Fix): Rendering a full histogram for a log spanning six months would flood the terminal. The engine intelligently sorts the time buckets and displays only the Top 15 highest-volume spikes, resorted chronologically for readability.
  • Anomaly Highlighting: It calculates the average hit rate and automatically flags any time block that breaches a 3x anomaly threshold with an explicit <-- ANOMALY SPIKE warning.

The Output Handshake (JSON Sidecar)

The Scanner does not silo its findings to the terminal. Once the scan completes, it generates a dynamic_telemetry.json sidecar payload.

This JSON file maps the exact execution counts back to the program targets. This sidecar can then be re-ingested by the GitGalaxy Orchestrator to officially mark high-hit files as "Hot Producers" or flag 0-hit files as confirmed "Dead Code" that is ready for safe decommissioning.




🌌 Powered by the blAST Engine

This documentation is part of the GitGalaxy Ecosystem, an AST-free, LLM-free heuristic knowledge graph engine.