Quickstart¶
This guide gets you from zero to imputed dataset in under 5 minutes.
Prerequisites¶
Python 3.10+
Phil installed (
pip install phil)
Basic Usage¶
Phil generates multiple candidate imputations and selects the most representative one using topological descriptors.
Step 1: Create a DataFrame with missing values
import pandas as pd
import numpy as np
df = pd.DataFrame({
"age": [25.0, 30.0, np.nan, 45.0, np.nan],
"income": [50000, np.nan, 75000, 80000, 65000],
"category": ["A", "B", "A", np.nan, "B"],
})
print(f"Missing values: {df.isna().sum().sum()}")
Step 2: Fit Phil and get the representative imputation
from phil import Phil
imputer = Phil(samples=25, random_state=42)
completed = imputer.fit(df)
print(completed)
print(f"Missing values after: {completed.isna().sum().sum()}")
That’s it! Phil generates 25 candidate imputations, computes ECT descriptors for each, and returns the candidate closest to the ensemble’s center.
Understanding the Output¶
Phil returns a fully imputed DataFrame. The imputation strategy is chosen automatically based on which candidate best represents the ensemble:
# Access metadata about the selection
print(f"Number of candidates generated: {imputer.n_candidates_}")
print(f"Selected candidate index: {imputer.selected_idx_}")
Pipeline Integration¶
Use PhilTransformer for sklearn pipelines:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from phil.transformers import PhilTransformer
pipe = Pipeline([
("impute", PhilTransformer(samples=25, random_state=42)),
("clf", RandomForestClassifier()),
])
# Fit on data with missing values
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
Configuring Imputation Strategies¶
By default, Phil explores multiple imputation strategies. You can customize which strategies to include:
imputer = Phil(
samples=50,
strategies=["mean", "median", "knn", "iterative"],
random_state=42,
)
Next Steps¶
Programmatic - Detailed API usage
Intermediate - Tuning and configuration options
Advanced - Custom strategies and descriptor options
API Reference - Full documentation