.. _measures: ======== Measures ======== .. tip:: Every program evaluated by Topos is measured along three independent **Quality Pillars**. These pillars are the generators for the **Quality Medals** you can earn. Topos never collapses these into a single number — you always see which pillar is the problem. 1. The SIMPLE Pillar (Code Complexity) ------------------------------------------ Evaluates the internal quality of the code by analyzing the Control Flow Graph (CFG) and Abstract Syntax Tree (AST). The SIMPLE pillar always runs and maps to the ``SIMPLE`` badge outcome. * **Cyclomatic Complexity** (``cfg.cyclomatic``) Measures the number of linearly independent paths through the code. Branches, loops, and conditionals increase complexity. Higher values negatively impact the SIMPLE score. * **Essential Complexity** (``cfg.essential``) Counts "structured" vs. unstructured control flow. Complex nested conditions reduce this metric. * **Nesting Depth** (``cfg.nesting_depth``) Maximum nesting level of control structures. Deeper nesting is harder to reason about. * **Longest Path** (``cfg.longest_path``) Longest acyclic execution path through the CFG. Long paths correlate with high cognitive load. * **Entropy** (``ast.entropy``) A Kolmogorov-complexity proxy using compression ratios. It measures how predictable the code is. Very low entropy suggests excessive boilerplate; very high entropy signals chaotic or highly unusual structure (often seen in hallucinated code). The healthy range sits around 0.5. 2. The COMPOSABLE Pillar (Module Coupling) ---------------------------------------------- Evaluates how a file fits into the broader repository by analyzing the module dependency graph. *(Requires GitNexus)* The COMPOSABLE pillar maps to the ``COMPOSABLE`` badge outcome. * **Coupling** (``mdg.coupling``) The total number of afferent (incoming) and efferent (outgoing) dependencies. High total coupling negatively impacts the COMPOSABLE score. * **Instability** (``mdg.instability``) Calculated as ``Efferent / (Afferent + Efferent)``. - Near 0: The module is a rigid dependency for many others and is hard to change safely. - Near 1: The module is highly unstable because it depends on many other parts of the system. - A balanced range (0.3 – 0.7) helps achieve a higher COMPOSABLE score. * **Fan-in / Fan-out** (``mdg.fan_in``, ``mdg.fan_out``) Diagnostic metrics tracking explicit call edges. These are visible in detailed inspections but don't strictly set the final verdict. * **Dependency Depth** (``mdg.dep_depth``) The longest dependency chain from this module. Shallow chains are easier to understand and refactor. 3. The SECURE Pillar (Vulnerability Analysis) ------------------------------------------------- Evaluates whether the code flow can reach dangerous operations or untrusted data. Computed from the Code Property Graph (CPG) — derived intrinsically from the UAST, no external tooling required. The SECURE pillar maps to the ``SECURE`` badge outcome. * **Dangerous Calls** (``cpg.dangerous_calls``) Count of reachable call sites matching a per-language registry of dangerous APIs (Python: ``eval``, ``exec``, ``pickle.loads``, …; C++: ``gets``, ``strcpy``, …). Lower counts improve the SECURE score. * **Taint Flows** (``cpg.taint_flows``) Source→sink data-flow paths along the CPG's data-dependence edges, from untrusted sources (e.g. ``input``, ``request.args``) to dangerous sinks. Longer taint chains increase risk. Scoring and Manager Priorities ------------------------------ Topos produces a continuous normalized score ``[0.0, 1.0]`` for each pillar. A pillar is **achieved** if its score meets or exceeds its **calibrated threshold**. These thresholds are tuned against real-world corpora (Experiment 4) to ensure the "Quality Medals" reflect empirical software engineering standards. .. list-table:: :widths: 20 20 60 :header-rows: 1 * - Pillar - Threshold - Raw Requirement (Policy Φᵢ) * - **SIMPLE** - ``0.40`` - ``cyclomatic <= 15`` AND ``max_func <= 10`` AND ``entropy in [0.2, 0.8]`` * - **COMPOSABLE** - ``0.60`` - ``instability in [0.3, 0.7]`` AND ``fan_in <= 15`` AND ``fan_out <= 15`` * - **SECURE** - ``1.00`` - Zero ``dangerous_calls`` AND zero ``taint_flows`` Scores are reported as percentages (0–100%) in all CLI and MCP output. Note that while the thresholds are used for score-floor aggregation, the authoritative achievement of a pillar is determined by the independent AND of the raw metric requirements defined in each generator's policy. The weights (``w_*``) for each pillar's internal components are controlled by the **Priority** (part of the **Preference Ranking**): .. list-table:: :widths: 15 15 15 15 40 :header-rows: 1 * - Priority - ``simple`` - ``composable`` - ``secure`` - Effect * - ``simple`` - 0.7 - 0.15 - 0.15 - Upweights SIMPLE; rewards low-complexity code * - ``composable`` - 0.15 - 0.7 - 0.15 - Upweights COMPOSABLE; rewards tightly-bounded modules * - ``secure`` - 0.15 - 0.15 - 0.7 - Upweights SECURE; rewards low-risk data flows Changing the priority does not change what is measured — it changes the weights within each generator's scoring function. Verdicts -------- The per-pillar scores map to an 8-valued Heyting algebra (free lattice on 3 generators), representing the **Quality Medals**: * ``SLOP`` (❌): No pillars achieved (all scores below threshold) or syntax error. No medal awarded. * ``SIMPLE``: Only SIMPLE achieved (🥉 BRONZE). * ``COMPOSABLE``: Only COMPOSABLE achieved (🥉 BRONZE; requires GitNexus; unreachable from SIMPLE alone). * ``SECURE``: Only SECURE achieved (🥉 BRONZE). * ``SIMPLE_COMPOSABLE``: Both SIMPLE and COMPOSABLE achieved (🥈 SILVER). * ``SIMPLE_SECURE``: Both SIMPLE and SECURE achieved (🥈 SILVER). * ``COMPOSABLE_SECURE``: Both COMPOSABLE and SECURE achieved (🥈 SILVER). * ``IDEAL`` (🥇): All three pillars achieved. Perfectly simple, composable, and secure. GOLD medal awarded. The three pillars ``SIMPLE``, ``COMPOSABLE``, and ``SECURE`` are **pairwise incomparable** — a file can achieve any subset of them independently. The overall ``lattice_element`` in the response is determined by which combination of pillars scored ≥ their calibrated thresholds: .. code-block:: text SIMPLE = 1, COMPOSABLE = 1, SECURE = 1 → IDEAL SIMPLE = 1, COMPOSABLE = 1, SECURE = 0 → SIMPLE_COMPOSABLE SIMPLE = 1, COMPOSABLE = 0, SECURE = 1 → SIMPLE_SECURE SIMPLE = 0, COMPOSABLE = 1, SECURE = 1 → COMPOSABLE_SECURE SIMPLE = 1, COMPOSABLE = 0, SECURE = 0 → SIMPLE SIMPLE = 0, COMPOSABLE = 1, SECURE = 0 → COMPOSABLE SIMPLE = 0, COMPOSABLE = 0, SECURE = 1 → SECURE SIMPLE = 0, COMPOSABLE = 0, SECURE = 0 → SLOP Comparing Programs (Profunctors) -------------------------------- While the three quality pillars define a program's absolute placement on the evaluation lattice (the characteristic morphism), Topos also provides relational tools to measure the "distance" or "overlap" between two programs. In our category-theoretic model, these are **Profunctors**. .. note:: **Important:** Profunctors are comparative metrics. They are highly useful for agent workflows (e.g., "did this refactor actually change the structure?") but they **do not** influence the Quality Badges or the evaluation lattice. Topos supports several relational metrics across its different graph representations: * **CFG Comparison:** Measures changes in cyclomatic complexity and edge distribution. (e.g., detecting if an agent added a new conditional branch). * **CPG Comparison:** Measures changes in dangerous API usage and taint flows, as well as general node-type overlap (Jaccard similarity). * **MDG Comparison:** Measures changes in coupling, fan-in/fan-out, and dependency depth. * **PDG Comparison:** Computes the Jaccard similarity of control and data dependencies between two versions of a function. * **AST Edit Distance:** Measures the topological drift between two programs using UAST edit distance. Structural Test Coverage ~~~~~~~~~~~~~~~~~~~~~~~~ Topos uses **Declaration-level Bipartite Coverage** to estimate how much of a **program-under-test (PUT)** appears in a **test suite** at the level of normalized UAST structure. Unlike line or branch coverage, this method does not require code execution. It answers: *does the test code contain similar structural shapes (kinds, control-flow nodes, kind paths) as the declarations in the PUT?* The CLI command is: .. code-block:: bash topos structural-test-coverage --tests tests/test_mod.py src/mod.py **How it works** 1. **Extraction:** Every ``FunctionDecl`` and ``MethodDecl`` is extracted from both the PUT and the test suite. 2. **Fingerprinting:** Each declaration is fingerprinted by the multiset of UAST kinds (excluding the root declaration kind itself) in its body. 3. **Bipartite Matching:** Each PUT declaration is matched against the best-matching declaration in the test suite using multiset recall. 4. **Scoring:** - **Mean Declaration Coverage:** The average best-match recall across all PUT declarations. - **F2 Score:** A harmonic mean that combines declaration recall with **test precision**, biased heavily toward recall (F2). This penalizes bloated test suites that contain large amounts of code unrelated to the PUT. - **Uncovered Declarations:** The tool identifies specific locations in the source code that lack corresponding structural representation in the tests. **Interpretation** - Higher mean coverage indicates more of the PUT’s structural declarations have matches in the test suite. - An F2 score significantly lower than mean coverage indicates a bloated test suite. - A **low** score suggests tests may be missing classes of syntax present in the PUT.