Measures

Tip

Every program evaluated by Topos is measured along three independent Quality Pillars. These pillars are the generators for the Quality Medals you can earn. Topos never collapses these into a single number — you always see which pillar is the problem.

1. The SIMPLE Pillar (Code Complexity)

Evaluates the internal quality of the code by analyzing the Control Flow Graph (CFG) and Abstract Syntax Tree (AST). The SIMPLE pillar always runs and maps to the SIMPLE badge outcome.

  • Cyclomatic Complexity (cfg.cyclomatic) Measures the number of linearly independent paths through the code. Branches, loops, and conditionals increase complexity. Higher values negatively impact the SIMPLE score.

  • Essential Complexity (cfg.essential) Counts “structured” vs. unstructured control flow. Complex nested conditions reduce this metric.

  • Nesting Depth (cfg.nesting_depth) Maximum nesting level of control structures. Deeper nesting is harder to reason about.

  • Longest Path (cfg.longest_path) Longest acyclic execution path through the CFG. Long paths correlate with high cognitive load.

  • Entropy (ast.entropy) A Kolmogorov-complexity proxy using compression ratios. It measures how predictable the code is. Very low entropy suggests excessive boilerplate; very high entropy signals chaotic or highly unusual structure (often seen in hallucinated code). The healthy range sits around 0.5.

2. The COMPOSABLE Pillar (Module Coupling)

Evaluates how a file fits into the broader repository by analyzing the module dependency graph. (Requires GitNexus) The COMPOSABLE pillar maps to the COMPOSABLE badge outcome.

  • Coupling (mdg.coupling) The total number of afferent (incoming) and efferent (outgoing) dependencies. High total coupling negatively impacts the COMPOSABLE score.

  • Instability (mdg.instability) Calculated as Efferent / (Afferent + Efferent).

    • Near 0: The module is a rigid dependency for many others and is hard to change safely.

    • Near 1: The module is highly unstable because it depends on many other parts of the system.

    • A balanced range (0.3 – 0.7) helps achieve a higher COMPOSABLE score.

  • Fan-in / Fan-out (mdg.fan_in, mdg.fan_out) Diagnostic metrics tracking explicit call edges. These are visible in detailed inspections but don’t strictly set the final verdict.

  • Dependency Depth (mdg.dep_depth) The longest dependency chain from this module. Shallow chains are easier to understand and refactor.

3. The SECURE Pillar (Vulnerability Analysis)

Evaluates whether the code flow can reach dangerous operations or untrusted data. Computed from the Code Property Graph (CPG) — derived intrinsically from the UAST, no external tooling required. The SECURE pillar maps to the SECURE badge outcome.

  • Dangerous Calls (cpg.dangerous_calls) Count of reachable call sites matching a per-language registry of dangerous APIs (Python: eval, exec, pickle.loads, …; C++: gets, strcpy, …). Lower counts improve the SECURE score.

  • Taint Flows (cpg.taint_flows) Source→sink data-flow paths along the CPG’s data-dependence edges, from untrusted sources (e.g. input, request.args) to dangerous sinks. Longer taint chains increase risk.

Scoring and Manager Priorities

Topos produces a continuous normalized score [0.0, 1.0] for each pillar. A pillar is achieved if its score meets or exceeds its calibrated threshold. These thresholds are tuned against real-world corpora (Experiment 4) to ensure the “Quality Medals” reflect empirical software engineering standards.

Pillar

Threshold

Raw Requirement (Policy Φᵢ)

SIMPLE

0.40

cyclomatic <= 15 AND max_func <= 10 AND entropy in [0.2, 0.8]

COMPOSABLE

0.60

instability in [0.3, 0.7] AND fan_in <= 15 AND fan_out <= 15

SECURE

1.00

Zero dangerous_calls AND zero taint_flows

Scores are reported as percentages (0–100%) in all CLI and MCP output. Note that while the thresholds are used for score-floor aggregation, the authoritative achievement of a pillar is determined by the independent AND of the raw metric requirements defined in each generator’s policy.

The weights (w_*) for each pillar’s internal components are controlled by the Priority (part of the Preference Ranking):

Priority

simple

composable

secure

Effect

simple

0.7

0.15

0.15

Upweights SIMPLE; rewards low-complexity code

composable

0.15

0.7

0.15

Upweights COMPOSABLE; rewards tightly-bounded modules

secure

0.15

0.15

0.7

Upweights SECURE; rewards low-risk data flows

Changing the priority does not change what is measured — it changes the weights within each generator’s scoring function.

Verdicts

The per-pillar scores map to an 8-valued Heyting algebra (free lattice on 3 generators), representing the Quality Medals:

  • SLOP (❌): No pillars achieved (all scores below threshold) or syntax error. No medal awarded.

  • SIMPLE: Only SIMPLE achieved (🥉 BRONZE).

  • COMPOSABLE: Only COMPOSABLE achieved (🥉 BRONZE; requires GitNexus; unreachable from SIMPLE alone).

  • SECURE: Only SECURE achieved (🥉 BRONZE).

  • SIMPLE_COMPOSABLE: Both SIMPLE and COMPOSABLE achieved (🥈 SILVER).

  • SIMPLE_SECURE: Both SIMPLE and SECURE achieved (🥈 SILVER).

  • COMPOSABLE_SECURE: Both COMPOSABLE and SECURE achieved (🥈 SILVER).

  • IDEAL (🥇): All three pillars achieved. Perfectly simple, composable, and secure. GOLD medal awarded.

The three pillars SIMPLE, COMPOSABLE, and SECURE are pairwise incomparable — a file can achieve any subset of them independently. The overall lattice_element in the response is determined by which combination of pillars scored ≥ their calibrated thresholds:

SIMPLE = 1, COMPOSABLE = 1, SECURE = 1  → IDEAL
SIMPLE = 1, COMPOSABLE = 1, SECURE = 0  → SIMPLE_COMPOSABLE
SIMPLE = 1, COMPOSABLE = 0, SECURE = 1  → SIMPLE_SECURE
SIMPLE = 0, COMPOSABLE = 1, SECURE = 1  → COMPOSABLE_SECURE
SIMPLE = 1, COMPOSABLE = 0, SECURE = 0  → SIMPLE
SIMPLE = 0, COMPOSABLE = 1, SECURE = 0  → COMPOSABLE
SIMPLE = 0, COMPOSABLE = 0, SECURE = 1  → SECURE
SIMPLE = 0, COMPOSABLE = 0, SECURE = 0  → SLOP

Comparing Programs (Profunctors)

While the three quality pillars define a program’s absolute placement on the evaluation lattice (the characteristic morphism), Topos also provides relational tools to measure the “distance” or “overlap” between two programs. In our category-theoretic model, these are Profunctors.

Note

Important: Profunctors are comparative metrics. They are highly useful for agent workflows (e.g., “did this refactor actually change the structure?”) but they do not influence the Quality Badges or the evaluation lattice.

Topos supports several relational metrics across its different graph representations:

  • CFG Comparison: Measures changes in cyclomatic complexity and edge distribution. (e.g., detecting if an agent added a new conditional branch).

  • CPG Comparison: Measures changes in dangerous API usage and taint flows, as well as general node-type overlap (Jaccard similarity).

  • MDG Comparison: Measures changes in coupling, fan-in/fan-out, and dependency depth.

  • PDG Comparison: Computes the Jaccard similarity of control and data dependencies between two versions of a function.

  • AST Edit Distance: Measures the topological drift between two programs using UAST edit distance.

Structural Test Coverage

Topos uses Declaration-level Bipartite Coverage to estimate how much of a program-under-test (PUT) appears in a test suite at the level of normalized UAST structure.

Unlike line or branch coverage, this method does not require code execution. It answers: does the test code contain similar structural shapes (kinds, control-flow nodes, kind paths) as the declarations in the PUT?

The CLI command is:

topos structural-test-coverage --tests tests/test_mod.py src/mod.py

How it works

  1. Extraction: Every FunctionDecl and MethodDecl is extracted from both the PUT and the test suite.

  2. Fingerprinting: Each declaration is fingerprinted by the multiset of UAST kinds (excluding the root declaration kind itself) in its body.

  3. Bipartite Matching: Each PUT declaration is matched against the best-matching declaration in the test suite using multiset recall.

  4. Scoring: - Mean Declaration Coverage: The average best-match recall across all

    PUT declarations.

    • F2 Score: A harmonic mean that combines declaration recall with test precision, biased heavily toward recall (F2). This penalizes bloated test suites that contain large amounts of code unrelated to the PUT.

    • Uncovered Declarations: The tool identifies specific locations in the source code that lack corresponding structural representation in the tests.

Interpretation

  • Higher mean coverage indicates more of the PUT’s structural declarations have matches in the test suite.

  • An F2 score significantly lower than mean coverage indicates a bloated test suite.

  • A low score suggests tests may be missing classes of syntax present in the PUT.