Quickstart

This guide gets you from zero to imputed dataset in under 5 minutes.

Prerequisites

  • Python 3.10+

  • Phil installed (pip install phil)

Basic Usage

Phil generates multiple candidate imputations and selects the most representative one using topological descriptors.

Step 1: Create a DataFrame with missing values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "age": [25.0, 30.0, np.nan, 45.0, np.nan],
    "income": [50000, np.nan, 75000, 80000, 65000],
    "category": ["A", "B", "A", np.nan, "B"],
})
print(f"Missing values: {df.isna().sum().sum()}")

Step 2: Fit Phil and get the representative imputation

from phil import Phil

imputer = Phil(samples=25, random_state=42)
completed = imputer.fit(df)

print(completed)
print(f"Missing values after: {completed.isna().sum().sum()}")

That’s it! Phil generates 25 candidate imputations, computes ECT descriptors for each, and returns the candidate closest to the ensemble’s center.

Understanding the Output

Phil returns a fully imputed DataFrame. The imputation strategy is chosen automatically based on which candidate best represents the ensemble:

# Access metadata about the selection
print(f"Number of candidates generated: {imputer.n_candidates_}")
print(f"Selected candidate index: {imputer.selected_idx_}")

Pipeline Integration

Use PhilTransformer for sklearn pipelines:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from phil.transformers import PhilTransformer

pipe = Pipeline([
    ("impute", PhilTransformer(samples=25, random_state=42)),
    ("clf", RandomForestClassifier()),
])

# Fit on data with missing values
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

Configuring Imputation Strategies

By default, Phil explores multiple imputation strategies. You can customize which strategies to include:

imputer = Phil(
    samples=50,
    strategies=["mean", "median", "knn", "iterative"],
    random_state=42,
)

Next Steps