topos.functors.profunctors.ast.compare¶
Distance Module¶
Provides two structural distance metrics between program ASTs.
1. Tree Edit Distance (TED) — ``calculate_ast_distance``
- Mathematical Inspiration:
In topology, distance measures how ‘far apart’ two points are in a space. For programs, we define distance over their AST structures—two programs are ‘close’ if their syntax trees are similar.
This metric is useful for: 1. Detecting code clones (near-zero distance) 2. Measuring refactoring impact (how much structure changed) 3. Comparing LLM outputs to reference implementations
We implement a Tree Edit Distance (TED) algorithm that counts the minimum number of node insertions, deletions, and relabelings needed to transform one tree into another. The implementation uses the Wagner-Fischer algorithm on node-type sequences (DFS order), which is an approximation of true structural TED optimized for speed.
2. Gromov-Wasserstein Distance (GHW) — ``calculate_ghw_distance``
- Mathematical Inspiration:
The Gromov-Hausdorff distance measures how far two metric spaces are from being isometric — that is, how much distortion is needed to embed one space into the other. The Wasserstein (optimal transport) variant equips each space with a probability measure and finds the coupling that minimizes expected pairwise distance distortion.
For ASTs this captures structural topology that TED misses: two programs with identical node-type sequences but different tree shapes will have low TED but high GHW distance.
- Algorithm (Frank-Wolfe with Sinkhorn projection):
Model each AST as a metric measure space (X, d_X, μ_X): - X = set of AST nodes - d_X(u, v) = number of edges on the unique tree path u → v - μ_X = uniform probability measure over X
- Find the coupling T ∈ Π(μ_X, μ_Y) minimizing the GW cost:
GW = Σ_{i,j,k,l} (d_X(i,k) − d_Y(j,l))² T[i,j] T[k,l]
Iterate: compute gradient M, Sinkhorn-project to update T.
Normalize by the sum of within-tree second moments to yield [0, 1].
- Potential speedups (not yet implemented):
Linearized GW (node eccentricity signatures): represent each node by its sorted distance vector to all others, build cost matrix C[i,j] = ‖sig(u) − sig(v)‖, run a single Sinkhorn pass. ~10× faster; no outer loop needed.
1D Wasserstein on distance distributions: compute empirical distributions of all pairwise tree-path distances, then take the 1D Wasserstein (sort + mean absolute difference). O(n² log n), no coupling matrix; good for large-scale clone screening.
Landmark subsampling: instead of DFS-prefix subsampling, select semantically important nodes (function/class definitions) as landmarks, reducing n without losing structural signal.
Depth-based LCA approximation: replace full BFS distance matrices with d(u,v) ≈ depth(u) + depth(v) − 2·depth(LCA(u,v)) via a sparse LCA structure, eliminating the O(n²) BFS pass.
- class topos.functors.profunctors.ast.compare.DistanceResult(raw_distance, normalized_distance, operations)[source]
Bases:
objectThe result of computing AST edit distance.
- raw_distance
The absolute edit distance (number of operations).
- Type:
- normalized_distance
Distance normalized by tree sizes (0-1).
- Type:
- raw_distance
- normalized_distance
- operations
- topos.functors.profunctors.ast.compare.calculate_ast_distance(source, target)[source]
Compute the tree edit distance between two program ASTs.
Uses a simplified tree edit distance algorithm based on node type comparison and structural alignment.
- Parameters:
source – The source ProgramObject.
target – The target ProgramObject.
- Returns:
A DistanceResult containing raw and normalized distances.
Note
This is an approximation of true tree edit distance, optimized for speed over exactness. For small trees, it’s quite accurate; for large trees, it provides a reasonable upper bound.
- topos.functors.profunctors.ast.compare.calculate_similarity(source, target)[source]
Compute structural similarity between two programs.
Similarity = 1 - normalized_distance
- Parameters:
source – The source ProgramObject.
target – The target ProgramObject.
- Returns:
A similarity score in [0, 1], where 1 means identical structure.
- topos.functors.profunctors.ast.compare.structural_distance(source, target)[source]
Normalized AST edit distance between two program morphisms.
Convenience wrapper around
calculate_ast_distance(): extracts the AST from each morphism and returns the normalized result in[0, 1]. Returns1.0if either morphism is unparseable.
- class topos.functors.profunctors.ast.compare.GHWDistanceResult(gw_distance, raw_gw_cost, n_nodes_source, n_nodes_target, n_iterations, converged)[source]
Bases:
objectThe result of computing Gromov-Wasserstein distance between two ASTs.
- gw_distance
Normalized GW cost in [0, 1]. Zero means the trees are isometric under the uniform measure; one means no structural correspondence was found.
- Type:
- raw_gw_cost
Unnormalized GW cost, useful for comparisons at a fixed scale (e.g. when both trees have the same number of nodes).
- Type:
- n_nodes_source
Number of nodes used from the source tree (after any subsampling).
- Type:
- n_nodes_target
Number of nodes used from the target tree.
- Type:
- n_iterations
Number of outer GW iterations executed.
- Type:
- converged
True if the coupling change fell below
tolbeforen_iterwas exhausted.- Type:
- gw_distance
- raw_gw_cost
- n_nodes_source
- n_nodes_target
- n_iterations
- converged
- topos.functors.profunctors.ast.compare.calculate_ghw_distance(source, target, max_nodes=200, n_iter=50, reg=0.5, tol=1e-06, return_coupling=False)[source]
Compute the Gromov-Wasserstein distance between two program ASTs.
Models each AST as a metric measure space: nodes as points, tree-path length (edge hops) as the metric, and a uniform probability measure over nodes. Finds the optimal coupling T ∈ Π(μ, ν) that minimizes the GW cost:
GW = Σ_{i,j,k,l} (d_X(i,k) − d_Y(j,l))² T[i,j] T[k,l]
via a Frank-Wolfe loop with entropic (Sinkhorn) projection.
When to use this over
calculate_ast_distance: The TED implementation compares sequences of node types and ignores tree topology. Two programs with identical node-type multisets but different nesting structure will score near-zero TED but high GHW distance. Use GHW when structural shape — depth, branching, subtree distribution — matters.- Normalization:
gw_distance = raw_gw_cost / (μᵀD1²μ + νᵀD2²ν)
The denominator is the maximum possible GW cost (when the cross-term vanishes), so
gw_distancelies in [0, 1] with 0 meaning the trees are metrically isometric under the uniform measure.- Parameters:
source – The source ProgramObject.
target – The target ProgramObject.
max_nodes – Subsampling cap. If a tree has more than
max_nodesnodes, the firstmax_nodesin DFS order are used. Structural (non-leaf) nodes appear early in DFS, so this retains the most topologically informative portion of the tree.n_iter – Maximum outer GW iterations (Frank-Wolfe steps).
reg – Sinkhorn entropy regularization. Smaller values yield sharper couplings but may require more Sinkhorn iterations to converge.
tol – Convergence tolerance on ‖T_new − T‖_F. Iteration stops early when the coupling update is below this threshold.
return_coupling – If True, the optimal coupling matrix T (shape n×m, dtype float64) is attached as
result.coupling. Off by default; only needed for inspection or visualization.
- Returns:
GHWDistanceResult with
gw_distancein [0, 1].
- Performance note:
This is the full GW formulation: O(n²) for distance matrices via BFS, and O(n²m + nm²) per Frank-Wolfe iteration. For large codebases or latency-sensitive pipelines, consider these faster approximations (not yet implemented):
Linearized GW (node eccentricity signatures): represent each node by its sorted distance vector to all others and build cost matrix C[i,j] = ‖sig(u) − sig(v)‖. A single Sinkhorn pass then gives an approximate coupling without an outer loop (~10× faster).
1D Wasserstein on distance distributions: collect all pairwise tree-path distances as an empirical distribution per tree and compute their 1D Wasserstein distance (sort + mean absolute difference). O(n² log n), no coupling matrix; suitable for large-scale clone screening.
Landmark subsampling: replace DFS-prefix subsampling with semantically motivated landmarks (e.g. function / class definition nodes), reducing n while preserving structural signal.
Depth-based LCA approximation: avoid the O(n²) full BFS pass by using d(u,v) ≈ depth(u) + depth(v) − 2·depth(LCA(u,v)) via a sparse LCA structure.