topos.functors.profunctors.ast.compare¶

Distance Module¶

Provides two structural distance metrics between program ASTs.

1. Tree Edit Distance (TED) — ``calculate_ast_distance``

Mathematical Inspiration:
In topology, distance measures how ‘far apart’ two points are in a space. For programs, we define distance over their AST structures—two programs are ‘close’ if their syntax trees are similar.

This metric is useful for: 1. Detecting code clones (near-zero distance) 2. Measuring refactoring impact (how much structure changed) 3. Comparing LLM outputs to reference implementations

We implement a Tree Edit Distance (TED) algorithm that counts the minimum number of node insertions, deletions, and relabelings needed to transform one tree into another. The implementation uses the Wagner-Fischer algorithm on node-type sequences (DFS order), which is an approximation of true structural TED optimized for speed.

2. Gromov-Wasserstein Distance (GHW) — ``calculate_ghw_distance``

Mathematical Inspiration:
The Gromov-Hausdorff distance measures how far two metric spaces are from being isometric — that is, how much distortion is needed to embed one space into the other. The Wasserstein (optimal transport) variant equips each space with a probability measure and finds the coupling that minimizes expected pairwise distance distortion.

For ASTs this captures structural topology that TED misses: two programs with identical node-type sequences but different tree shapes will have low TED but high GHW distance.

Algorithm (Frank-Wolfe with Sinkhorn projection):

Model each AST as a metric measure space (X, d_X, μ_X): - X = set of AST nodes - d_X(u, v) = number of edges on the unique tree path u → v - μ_X = uniform probability measure over X

Find the coupling T ∈ Π(μ_X, μ_Y) minimizing the GW cost:
GW = Σ_{i,j,k,l} (d_X(i,k) − d_Y(j,l))² T[i,j] T[k,l]

Iterate: compute gradient M, Sinkhorn-project to update T.

Normalize by the sum of within-tree second moments to yield [0, 1].

Potential speedups (not yet implemented):

Linearized GW (node eccentricity signatures): represent each node by its sorted distance vector to all others, build cost matrix C[i,j] = ‖sig(u) − sig(v)‖, run a single Sinkhorn pass. ~10× faster; no outer loop needed.

1D Wasserstein on distance distributions: compute empirical distributions of all pairwise tree-path distances, then take the 1D Wasserstein (sort + mean absolute difference). O(n² log n), no coupling matrix; good for large-scale clone screening.

Landmark subsampling: instead of DFS-prefix subsampling, select semantically important nodes (function/class definitions) as landmarks, reducing n without losing structural signal.

Depth-based LCA approximation: replace full BFS distance matrices with d(u,v) ≈ depth(u) + depth(v) − 2·depth(LCA(u,v)) via a sparse LCA structure, eliminating the O(n²) BFS pass.

class topos.functors.profunctors.ast.compare.DistanceResult(raw_distance, normalized_distance, operations)[source]

Bases: object

The result of computing AST edit distance.

raw_distance

The absolute edit distance (number of operations).

Type:: int

normalized_distance

Distance normalized by tree sizes (0-1).

Type:: float

operations

Breakdown of edit operations.

Type:: dict[str, int]

raw_distance

normalized_distance

operations

topos.functors.profunctors.ast.compare.calculate_ast_distance(source, target)[source]

Compute the tree edit distance between two program ASTs.

Uses a simplified tree edit distance algorithm based on node type comparison and structural alignment.

Parameters:

source – The source ProgramObject.
target – The target ProgramObject.

Returns:

A DistanceResult containing raw and normalized distances.

Note

This is an approximation of true tree edit distance, optimized for speed over exactness. For small trees, it’s quite accurate; for large trees, it provides a reasonable upper bound.

topos.functors.profunctors.ast.compare.calculate_similarity(source, target)[source]

Compute structural similarity between two programs.

Similarity = 1 - normalized_distance

Parameters:

source – The source ProgramObject.
target – The target ProgramObject.

Returns:

A similarity score in [0, 1], where 1 means identical structure.

topos.functors.profunctors.ast.compare.structural_distance(source, target)[source]

Normalized AST edit distance between two program morphisms.

Convenience wrapper around calculate_ast_distance(): extracts the AST from each morphism and returns the normalized result in [0, 1]. Returns 1.0 if either morphism is unparseable.

class topos.functors.profunctors.ast.compare.GHWDistanceResult(gw_distance, raw_gw_cost, n_nodes_source, n_nodes_target, n_iterations, converged)[source]

Bases: object

The result of computing Gromov-Wasserstein distance between two ASTs.

gw_distance

Normalized GW cost in [0, 1]. Zero means the trees are isometric under the uniform measure; one means no structural correspondence was found.

Type:: float

raw_gw_cost

Unnormalized GW cost, useful for comparisons at a fixed scale (e.g. when both trees have the same number of nodes).

Type:: float

n_nodes_source

Number of nodes used from the source tree (after any subsampling).

Type:: int

n_nodes_target

Number of nodes used from the target tree.

Type:: int

n_iterations

Number of outer GW iterations executed.

Type:: int

converged

True if the coupling change fell below tol before n_iter was exhausted.

Type:: bool

gw_distance

raw_gw_cost

n_nodes_source

n_nodes_target

n_iterations

converged

topos.functors.profunctors.ast.compare.calculate_ghw_distance(source, target, max_nodes=200, n_iter=50, reg=0.5, tol=1e-06, return_coupling=False)[source]

Compute the Gromov-Wasserstein distance between two program ASTs.

Models each AST as a metric measure space: nodes as points, tree-path length (edge hops) as the metric, and a uniform probability measure over nodes. Finds the optimal coupling T ∈ Π(μ, ν) that minimizes the GW cost:

GW = Σ_{i,j,k,l} (d_X(i,k) − d_Y(j,l))² T[i,j] T[k,l]

via a Frank-Wolfe loop with entropic (Sinkhorn) projection.

When to use this over calculate_ast_distance: The TED implementation compares sequences of node types and ignores tree topology. Two programs with identical node-type multisets but different nesting structure will score near-zero TED but high GHW distance. Use GHW when structural shape — depth, branching, subtree distribution — matters.

Normalization:: gw_distance = raw_gw_cost / (μᵀD1²μ + νᵀD2²ν)

The denominator is the maximum possible GW cost (when the cross-term vanishes), so gw_distance lies in [0, 1] with 0 meaning the trees are metrically isometric under the uniform measure.

Parameters:

source – The source ProgramObject.
target – The target ProgramObject.
max_nodes – Subsampling cap. If a tree has more than max_nodes nodes, the first max_nodes in DFS order are used. Structural (non-leaf) nodes appear early in DFS, so this retains the most topologically informative portion of the tree.
n_iter – Maximum outer GW iterations (Frank-Wolfe steps).
reg – Sinkhorn entropy regularization. Smaller values yield sharper couplings but may require more Sinkhorn iterations to converge.
tol – Convergence tolerance on ‖T_new − T‖_F. Iteration stops early when the coupling update is below this threshold.
return_coupling – If True, the optimal coupling matrix T (shape n×m, dtype float64) is attached as result.coupling. Off by default; only needed for inspection or visualization.

Returns:

GHWDistanceResult with gw_distance in [0, 1].

Performance note:

This is the full GW formulation: O(n²) for distance matrices via BFS, and O(n²m + nm²) per Frank-Wolfe iteration. For large codebases or latency-sensitive pipelines, consider these faster approximations (not yet implemented):

Linearized GW (node eccentricity signatures): represent each node by its sorted distance vector to all others and build cost matrix C[i,j] = ‖sig(u) − sig(v)‖. A single Sinkhorn pass then gives an approximate coupling without an outer loop (~10× faster).
1D Wasserstein on distance distributions: collect all pairwise tree-path distances as an empirical distribution per tree and compute their 1D Wasserstein distance (sort + mean absolute difference). O(n² log n), no coupling matrix; suitable for large-scale clone screening.
Landmark subsampling: replace DFS-prefix subsampling with semantically motivated landmarks (e.g. function / class definition nodes), reducing n while preserving structural signal.
Depth-based LCA approximation: avoid the O(n²) full BFS pass by using d(u,v) ≈ depth(u) + depth(v) − 2·depth(LCA(u,v)) via a sparse LCA structure.