Overview

Phil provides representation-guided imputation: instead of choosing a single imputation strategy, Phil explores many possibilities and uses topological descriptors to select the most representative result.

The Problem

Imputation is a critical step in data preprocessing, but the choice of strategy significantly impacts downstream analysis:

  • Mean imputation may distort variance

  • KNN imputation depends on distance metrics and neighborhood size

  • Iterative imputation depends on model choice and convergence criteria

How do you know which imputation is “correct”? Phil answers this by treating imputation as an ensemble selection problem.

Phil’s Approach

        graph TB
   subgraph "1. Generate Candidates"
      A[DataFrame with missing values]
      B1["Mean imputation"]
      B2["KNN imputation"]
      B3["Iterative imputation"]
      B4["Custom strategy"]
   end

   subgraph "2. Compute Descriptors"
      C["ECT computation"]
      D1["Descriptor 1"]
      D2["Descriptor 2"]
      D3["Descriptor 3"]
      D4["Descriptor 4"]
   end

   subgraph "3. Select Representative"
      E["Distance matrix"]
      F["Centroid selection"]
      G["Representative dataset"]
   end

   A --> B1
   A --> B2
   A --> B3
   A --> B4

   B1 --> C
   B2 --> C
   B3 --> C
   B4 --> C

   C --> D1
   C --> D2
   C --> D3
   C --> D4

   D1 --> E
   D2 --> E
   D3 --> E
   D4 --> E

   E --> F
   F --> G

   style A fill:#f9f9f9,stroke:#999
   style C fill:#D9EDF7,stroke:#31708F,stroke-width:2px
   style F fill:#D9EDF7,stroke:#31708F,stroke-width:2px
   style G fill:#DFF0D8,stroke:#3C763D,stroke-width:2px
    

Key Concepts

Candidate Generation

Phil uses sklearn’s imputation methods to generate multiple completed datasets from the same input. Each candidate represents a different “version” of the data.

ECT Descriptors

The Euler Characteristic Transform captures topological structure of each candidate. This provides a principled way to compare imputations based on their geometric properties rather than arbitrary metrics.

Representative Selection

Phil computes pairwise distances between candidate descriptors and selects the one closest to the ensemble mean. This is analogous to selecting a “medoid” in clustering.

Why Topological Descriptors?

Traditional comparison methods (e.g., comparing filled values directly) are sensitive to:

  • Scale of individual features

  • Arbitrary ordering of rows

  • Local perturbations

ECT descriptors are:

  • Invariant to permutations of data points

  • Robust to small perturbations

  • Informative about global structure

Integration Options

Phil supports two integration patterns:

Standalone Usage

from phil import Phil

completed_df = Phil(samples=25).fit(df_with_missing)

sklearn Pipeline

from phil.transformers import PhilTransformer

pipe = Pipeline([
    ("impute", PhilTransformer(samples=25)),
    ("model", YourModel()),
])

Next Steps