Cookbook: Data Lineage & DAG Architecture via Deterministic RAG Pipelines
1. Architecture of a Deterministic RAG Engine
In enterprise legacy modernization, migrating isolated COBOL programs is a relatively solved problem. The true operational risk lies in migrating the orchestration layer—the complex, intertwined execution graphs where Program A writes a dataset that Program B and Program C must read in a specific sequence.
When attempting to map these execution pipelines, standard Large Language Models (LLMs) fail. They suffer from context window fragmentation and frequently hallucinate dependencies, linking programs based on semantic proximity rather than actual data flow. More dangerously, an LLM will often read an OPEN OUTPUT statement inside a deprecated, unreachable block of COBOL (dead code) and falsely assert a dependency that hasn't existed in production for decades.
To resolve this, the GitGalaxy ecosystem relies on a deterministic function-level knowledge graph.
Instead of treating code as a probabilistic text-completion exercise, this engine utilizes the blAST (Bypassing LLMs and ASTs) paradigm. It scans the raw structural text using high-performance syntactic heuristics to mathematically map internal file pointers to physical external datasets. It identifies exactly where data is ingested, where it is mutated, and where it is expelled.
In a Retrieval-Augmented Generation (RAG) pipeline, this deterministic graph provides absolute ground truth. The RAG engine retrieves the mathematically proven Directed Acyclic Graph (DAG) of the application's data lineage, allowing automated migration tools to generate modern execution orchestrators (like Apache Airflow or AWS Step Functions) with zero-trust structural integrity.
2. The Data Lineage DAG Architect
The cobol_dag_architect.py script is a specialized pipeline mapping tool within the GitGalaxy ecosystem. It parses legacy COBOL to extract I/O intents and calculates the mathematically perfect execution sequence.
To prevent "Ghost Dependencies"—false positives generated by scanning dead code—this Architect integrates directly with the broader GitGalaxy knowledge graph. It systematically masks out unreachable paragraphs prior to intent extraction, ensuring that the resulting dependency graph represents only the active, living application state.
2.1 Information Flow & Processing Pipeline
The pipeline executes a highly specialized deterministic extraction to map data gravity before applying graph theory to resolve the execution sequence.
| Processing Stage | Syntactic Heuristic | Architectural Purpose | Legacy Modernization Value |
|---|---|---|---|
| Physical Boundary Mapping | SELECT\s+(.+)\s+ASSIGN |
Links internal COBOL file variables to external physical dataset names (DDs). | Establishes the physical boundaries of the application, critical for mapping cloud storage buckets. |
| Ghost Deflection | String substitution | Replaces strings inside known dead paragraphs (supplied via IR context) with empty spaces. | Prevents the regex engine from parsing OPEN statements inside dead code, eliminating hallucinated data lineage. |
| Intent Extraction | OPEN\s+(INPUT\|OUTPUT) |
Evaluates the PROCEDURE DIVISION to determine if a physical file is read (Input) or written/mutated (Output, I-O, Extend). |
Establishes the directional edges of the knowledge graph (Producer vs. Consumer). |
| Dynamic Execution Auditing | CALL\s+(?!['"]) |
Flags CALL statements executing dynamically assigned variables rather than static string targets. |
Highlights severe anti-patterns that require manual engineering intervention prior to automated microservice slicing. |
| Topological Sorting | Kahn's Algorithm | Processes the ingested Producers and Consumers to calculate a strict, linear execution hierarchy. | Mathematically proves the safe batch execution order and instantly detects catastrophic cyclic deadlocks. |
3. Notable Structures & Execution Logic
The script operates on two primary structural pillars: Lineage Extraction and Graph Resolution.
Lineage Extraction (extract_lineage)
This function acts as the deterministic I/O scanner. Unlike standard parsers that evaluate the entire file blindly, this function leverages cross-tool synergy. It ingests a dead_paras set (generated by the Graveyard Reaper) and actively masks out unreachable code blocks. Once the code is sanitized, it maps the SELECT statements to their physical targets and scans the remaining active PROCEDURE DIVISION for OPEN commands, categorizing them into strict inputs and outputs sets.
Graph Resolution (Kahn's Algorithm)
This segment translates the extracted lineage into a queryable Directed Acyclic Graph (DAG). It first maps which programs create which files (file_creators). It then iterates through the programs to see which consumers require those generated files, establishing the dependencies (edges) and in_degree (node weights). Finally, it applies Kahn's Algorithm to continuously pop nodes with an in-degree of zero. If the final sorted list does not match the total number of programs, the script deterministically proves the existence of a cyclic dependency (e.g., Program A waits on Program B, which waits on Program A) and locks the pipeline.
4. Execution Interface
The architect is executed via a headless CLI, designed to output execution sequences for CI/CD orchestrators or cloud migration planning.
# Execute a bulk lineage scan and topological sort across a legacy domain
python3 cobol_dag_architect.py src/legacy/batch_programs/
Sample Output:
==========================================================
⚡ ZERO-TRUST EXECUTION PIPELINE (TOPOLOGICAL SORT)
==========================================================
STEP 01: Run [GL-EXTRACT]
↳ Reads : None
↳ Writes: GL-MASTER-OUT
----------------------------------------------------------
STEP 02: Run [GL-TRANSFORM]
↳ Reads : GL-MASTER-OUT
↳ Writes: GL-REPORT-DATA
----------------------------------------------------------
5. Recommended Next Steps (Refactoring for Enterprise Scale)
To scale this pipeline mapping tool for enterprise-wide cloud orchestration, implement the following architectural enhancements:
- Airflow DAG Generation: The script currently outputs standard text to the console. Refactor the output layer to automatically generate native Python Apache Airflow DAG files, instantly translating COBOL batch sequences into modern cloud orchestration artifacts.
- Persistent Sub-Graph Injection: Rather than passing
dead_parasas an optional argument, integrate a direct SQLite query to the GitGalaxy core database. The Architect should autonomously fetch the dead code map for every evaluated program to guarantee Ghost Deflection is active across all pipeline runs. - Graphing Dynamic Memory Resolvers: The current implementation flags dynamic
CALLvariables but leaves them unresolved. Enhance the heuristic engine to trace the data flow of theCALLvariable backward to its lastMOVEorINITIALIZEassignment, allowing the engine to mathematically resolve dynamic execution targets and connect previously fragmented execution graphs.
this was accomplished by the blAST engine - - - -🌌 Powered by the blAST Engine This documentation is part of the GitGalaxy Ecosystem, an AST-free, LLM-free heuristic knowledge graph engine.
🪐 Explore the GitHub Repository for code, tools, and updates. 🔭 Visualize your own repository at GitGalaxy.io using our interactive 3D WebGPU dashboard.