Overview¶
Phil provides representation-guided imputation: instead of choosing a single imputation strategy, Phil explores many possibilities and uses topological descriptors to select the most representative result.
The Problem¶
Imputation is a critical step in data preprocessing, but the choice of strategy significantly impacts downstream analysis:
Mean imputation may distort variance
KNN imputation depends on distance metrics and neighborhood size
Iterative imputation depends on model choice and convergence criteria
How do you know which imputation is “correct”? Phil answers this by treating imputation as an ensemble selection problem.
Phil’s Approach¶
graph TB
subgraph "1. Generate Candidates"
A[DataFrame with missing values]
B1["Mean imputation"]
B2["KNN imputation"]
B3["Iterative imputation"]
B4["Custom strategy"]
end
subgraph "2. Compute Descriptors"
C["ECT computation"]
D1["Descriptor 1"]
D2["Descriptor 2"]
D3["Descriptor 3"]
D4["Descriptor 4"]
end
subgraph "3. Select Representative"
E["Distance matrix"]
F["Centroid selection"]
G["Representative dataset"]
end
A --> B1
A --> B2
A --> B3
A --> B4
B1 --> C
B2 --> C
B3 --> C
B4 --> C
C --> D1
C --> D2
C --> D3
C --> D4
D1 --> E
D2 --> E
D3 --> E
D4 --> E
E --> F
F --> G
style A fill:#f9f9f9,stroke:#999
style C fill:#D9EDF7,stroke:#31708F,stroke-width:2px
style F fill:#D9EDF7,stroke:#31708F,stroke-width:2px
style G fill:#DFF0D8,stroke:#3C763D,stroke-width:2px
Key Concepts¶
- Candidate Generation
Phil uses sklearn’s imputation methods to generate multiple completed datasets from the same input. Each candidate represents a different “version” of the data.
- ECT Descriptors
The Euler Characteristic Transform captures topological structure of each candidate. This provides a principled way to compare imputations based on their geometric properties rather than arbitrary metrics.
- Representative Selection
Phil computes pairwise distances between candidate descriptors and selects the one closest to the ensemble mean. This is analogous to selecting a “medoid” in clustering.
Why Topological Descriptors?¶
Traditional comparison methods (e.g., comparing filled values directly) are sensitive to:
Scale of individual features
Arbitrary ordering of rows
Local perturbations
ECT descriptors are:
Invariant to permutations of data points
Robust to small perturbations
Informative about global structure
Integration Options¶
Phil supports two integration patterns:
Standalone Usage
from phil import Phil
completed_df = Phil(samples=25).fit(df_with_missing)
sklearn Pipeline
from phil.transformers import PhilTransformer
pipe = Pipeline([
("impute", PhilTransformer(samples=25)),
("model", YourModel()),
])
Next Steps¶
Quickstart - Get started quickly
User Guide - Detailed configuration
API Reference - Full documentation