Introduction

Validating the security consistency of the Matter protocol across its large and heterogeneous multi-language ecosystem—from the rigorously engineered official C++ reference implementation to community-driven Rust, TypeScript, and Python variants—is not merely a test of human effort. It is fundamentally a challenge in designing scalable and effective automated audit architectures.

While building my automated audit system, matter-auditor, I went through a substantial architectural evolution. The system progressed from an initial rule-based static filtering approach to a neuro-symbolic architecture that combines Abstract Syntax Trees (ASTs) with the ReAct paradigm. This was not a ground-up rewrite, but a precise refactoring driven by a clearer understanding of LLM capability boundaries, and by a deliberate separation between the system’s perception layer and reasoning layer.

In this article, I revisit the technical decisions behind that evolution and explain how insights from recent academic and industrial work helped us address core challenges such as cross-language context loss and logical hallucination.

First Exploration: A Rule-Based “Dual-Track” Design and Its Limits

In the early stages of the project, I adopted an intentionally pragmatic design philosophy. I started from the assumption that LLMs already possessed strong code understanding capabilities, and that the primary bottleneck lay in context window limitations. From this perspective, the most pressing need was an efficient funnel that could reduce massive codebases into small, high-value fragments suitable for LLM consumption.

Based on this assumption, we designed a classic dual-track architecture. Guided by the Matter specification, we decomposed the audit surface into independent domains such as Secure Channel, TLV, and ACL, and applied regular expressions and keyword matching as a first-stage filter. At the time, this appeared to be an engineering-level local optimum: strong, manually curated rules constrained the LLM’s attention and enabled rapid discovery of explicit code defects.

As the audit progressed, however, I found this approach gradually reaching its ceiling. We encountered a classic “streetlight effect,” where vulnerabilities could only be found in areas illuminated by predefined keywords. Implementations that relied on dynamic routing decorators, complex type aliases, or unconventional naming conventions were almost entirely invisible to text-based matching.

Context fragmentation proved even more problematic. When an isolated function was extracted and passed to an LLM, its connection to surrounding logic was often severed. The model could see the entry point, but without access to callee definitions it was forced to speculate. This hallucinated reasoning generated large numbers of false positives and made one conclusion unavoidable: text processing alone cannot solve code logic problems; deeper structural understanding is required.

Turning Point: Learning from State-of-the-Art Approaches

To break through this limitation, I turned to mature solutions from both academia and industry.

GitHub’s CodeQL reinforced the idea that code should be treated as a graph rather than a linear text stream. Meanwhile, the OOPSLA 2024 paper LLift empirically validated the intuition that a hybrid approach—lightweight static analysis for recall combined with LLM-based semantic filtering for precision—is both practical and effective. Its core insight, namely using static analysis tools to surface potential issues (despite high false-positive rates) and then leveraging LLMs to prune those false positives, became the theoretical foundation of the V2.0 architecture.

From this analysis, we converged on a neuro-symbolic technical direction: using symbolic AST representations to solve precise localization, and neural LLMs to perform semantic reasoning.

Evolution: Building an Agent That Understands Code

With this theoretical guidance, we refactored the system along three dimensions, transforming a collection of loosely coupled scripts into a cohesive intelligent audit system.

The first change was an upgrade to the perception layer, enabling the system to understand structure rather than text.

We abandoned brittle regular-expression matching and introduced Tree-sitter as the parsing engine. The system no longer searches for specific strings, but instead traverses syntax trees to identify AST substructures with defined properties, such as methods that inherit from BaseReader and contain exception-throwing logic. To preserve engineering robustness, we implemented a graceful degradation strategy: AST-based matching is used by default, but in rare edge cases where exotic syntax defeats parsing, the system automatically falls back to keyword-based heuristics to maintain baseline coverage.

The second challenge was eliminating contextual isolation.

Code logic is never self-contained. To provide LLMs with intact semantic environments, we developed a dynamic context construction mechanism inspired by interprocedural analysis. It resolves TypeScript tsconfig path mappings and Python import paths, crosses file boundaries, and stitches together function definitions scattered across the codebase into a coherent execution chain. As a result, we now have complete logical paths rather than fragmented snippets.

The final and most consequential shift was upgrading the LLM from a passive reader to an active investigator.

In complex protocol audits, predefined context is never sufficient. We therefore adopted the ReAct (Reasoning + Acting) paradigm. By embedding explicit reasoning-action constraints into system prompts, we force the agent into a loop of reasoning, action, and observation. When the agent encounters an unknown constant or unresolved reference, it no longer guesses. Instead, it actively invokes our semantic search tools to query the codebase. This transition from “being fed” to “self-service” dramatically reduces hallucination rates.

To identify vulnerabilities caused by missing logic rather than incorrect logic, we additionally introduced a golden reference comparison mechanism. The officially validated C++ implementation serves as the baseline. When auditing other language implementations, we perform blind comparisons to determine whether critical checks or state transitions are absent. From this, we can reliably conclude whether a vulnerability stems from incorrect code or missing code.

Results: From Scanning Code to Understanding Intent

The benefits of this architectural upgrade were immediate and substantial.

In recent audits, we observed insights beyond those of traditional static analysis tools. We can clearly see semantic traps such as Set.has(undefined) in a TypeScript implementation—syntactically valid yet logically flawed. Through reference comparison, we identified a high-severity vulnerability in a Python implementation where certificate chain signature verification was entirely missing during the CASE handshake phase. This is a concrete example of detecting absent code, indicating a higher-order understanding of protocol semantics.

In the Rust implementation, we can distinguish compilation macro contexts, separating intentionally safe placeholders from silently unsafe configurations, and thereby reaching into deeper layers of build-time security.

Conclusion

In retrospect, the V1.0 regex-based approach was sufficient to validate feasibility and surface shallow issues early. The V2.0 AST + ReAct architecture was a necessary evolution to address deeper semantic flaws. From this experience, we can draw a clear lesson: in domain-specific code auditing, LLMs should not be treated as omniscient “brains,” but as semantic reasoning engines embedded within a structured static analysis toolchain.

Only by combining precise symbolic structure with powerful neural semantic understanding can we fully unlock AI’s potential in security auditing. Looking forward, we plan to integrate data-flow analysis with AI-driven fuzzing, extending audit boundaries from static discovery to dynamic validation.


References

  1. CodeQL: GitHub’s semantic code analysis engine. https://codeql.github.com/
  2. LLift (OOPSLA ‘24): Li, H. et al. “Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach”. Proceedings of the ACM on Programming Languages (PACMPL), 2024. https://2024.splashcon.org/details/splash-2024-oopsla/18/Enhancing-Static-Analysis-for-Practical-Bug-Detection-An-LLM-Integrated-Approach
  3. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool. https://github.com/GreyDGL/PentestGPT