MCP Server¶
No Code. Just a DataFrame and a Question.
The Phil MCP (Model Context Protocol) server lets AI clients — Claude, Gemini, Cursor, and others — run topology-guided imputation sweeps on your pandas or polars dataframes without you writing any Python. Point the agent at a CSV or Parquet file, ask for the best imputation, and let it pick the candidate that best represents the ensemble.
This guide is for practitioners who know their data has missing values and want a reproducible imputation rather than a single hand-picked strategy.
Workflow Comparison¶
Approach |
You Do |
AI Does |
Speed |
|---|---|---|---|
Programmatic (Python) |
Write Python, configure grids |
(nothing) |
Depends on grid size |
PhilTransformer (sklearn) |
Wire into a Pipeline |
(nothing) |
Depends on grid size |
MCP + Agent (recommended) |
Hand AI a file path, ask question |
Entire sweep workflow |
~5–60s (automated tuning) |
The Value Prop¶
Traditional imputation: - You guess a strategy (mean, KNN, iterative regressors) - You run it once, hope it’s “good enough” - No principled way to compare alternatives
Phil via the MCP: - The agent generates a grid of candidate imputations - Each candidate is scored with an ECT topological descriptor - The candidate closest to the ensemble centroid is selected - The agent can iterate on the grid if descriptor spread looks degenerate - You read a structured summary, not a chart of distance metrics
Setup¶
Phil ships an MCP server entry point (phil-mcp) via the mcp extra
of the published philler package. You do not need to clone the
repo — uvx (or pipx) can
launch it directly from PyPI.
Note
Phil works with any MCP-capable client, including Cursor and Gemini CLI, where you can add Phil as an MCP server/tool.
Open ~/Library/Application Support/Claude/claude_desktop_config.json
(macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows)
and add:
{
"mcpServers": {
"phil": {
"command": "uv tool run",
"args": ["--from", "philler[mcp]", "phil-mcp"]
}
}
}
Restart Claude Desktop. A hammer icon in new chats confirms the tools loaded.
Note
GUI-launched apps on macOS often don’t inherit your shell
PATH. If Claude can’t find uvx, replace
"command": "uv tool run" with its absolute path (find it
with which uvx, e.g. /Users/yourname/.local/bin/uvx).
gemini mcp add phil uv tool run --from "philler[mcp]" phil-mcp
claude mcp add phil -- uv tool run --from "philler[mcp]" phil-mcp
Open Settings → Features → MCP → Add new MCP server:
Name:
philType:
commandCommand:
uv tool run --from "philler[mcp]" phil-mcp
Alternative install methods¶
If you prefer a persistent install over ephemeral uv invocations:
pipx install "philler[mcp]" # then use command: phil-mcp
# or
pip install "philler[mcp]" # in any venv
Developing against a local clone¶
Contributors working on Phil’s source can launch the server from a checkout instead:
uv sync --group mcp
uv run phil-mcp
Point your MCP client at uv run --group mcp phil-mcp (with cwd
set to the clone) for live-edit development.
Workflow¶
Once connected, give the agent a goal rather than instructions. The agent already knows the technical steps.
The recommended prompt:
“I have a dataset with missing values at
path/to/data.csv. Use Phil to run an imputation sweep and pick the most representative completed dataset. Export it topath/to/imputed.csv.”
Under the hood the agent will:
Ingest — register the file as a stable
dataset_idhandleCharacterize — summarize missingness, dtypes, and unique counts
List grids — pick the named
GridGallerystrategy that fits the data (default,finance,healthcare, …)Create a config — generate canonical YAML with sensible defaults
Validate — confirm the config parses and the dataset is reachable
Run the sweep — fit each candidate imputation and score with ECT
Diagnose — inspect descriptor spread and method counts; iterate if the grid collapsed
Export — write the selected imputed DataFrame to disk
Pandas and polars¶
Phil’s pipeline runs on pandas internally, but the MCP server accepts any file format pandas or pyarrow can read. To use a polars frame, write it to Parquet first and ingest the path:
import polars as pl
df = pl.read_csv("raw.csv")
df.write_parquet("for_phil.parquet")
# then in your agent chat:
# "Run a Phil sweep on /abs/path/to/for_phil.parquet"
Available MCP Tools¶
The server exposes these tools to the AI client. The agent automatically chains them together:
Tool |
What It Does |
|---|---|
get_workflow_guide |
Returns the opinionated, phase-by-phase Phil workflow as markdown. Opt-in — agents that prefer their own plan can ignore it. |
get_runtime_context |
Reports the server cache directory, session id, and path-visibility guidance — useful when the agent needs to ferry sandboxed files into a host-readable location. |
ingest_dataset |
Registers a CSV or Parquet path and returns a stable |
begin_dataset_upload / append_dataset_chunk / finalize_dataset_upload |
Chunked base64 upload pipeline for clients that cannot share a filesystem with the server. Use only when path-based ingest is impossible. |
characterize_dataset |
Sparse per-column schema: dtype, n_unique, missing percent, plus aggregate row/column counts. Cheap and safe to call on wide datasets. |
probe_columns |
Deep per-column inspection for up to 20 columns at a time: sample values, top frequencies, basic numeric statistics. |
list_grids |
Enumerates the named |
create_config |
Materializes a canonical YAML config tailored to a |
validate_config |
Validates and normalizes a config YAML, returning structured issues if anything is off. Rejects fenced Markdown blocks to prevent silent parse failures. |
refine_config / refine_active_config |
Apply dotted-path overrides (e.g. |
get_active_config |
Returns the in-session config YAML, useful for inspection before running a sweep. |
run_imputation_sweep |
The headline tool: fits the candidate grid, scores each with the ECT magic method, selects the representative, and persists a |
diagnose_sweep |
Inspects a saved run’s descriptor spread, selected index, and per-method candidate counts. Use to decide whether to broaden the grid or raise |
get_candidate_descriptors |
Returns the top-k candidates ranked by closeness to the mean descriptor, including the selected index. |
compare_sweeps |
Side-by-side comparison of two persisted runs by config and descriptor statistics. |
get_experiment_history |
Markdown table of every sweep run in the current session — handy for telling the story of how the agent iterated. |
get_sweep_summary |
Returns the full persisted |
export_imputed_data |
Writes the selected imputed DataFrame to disk; CSV / Parquet / Feather inferred from the extension. |
Example: A Mixed-Type Frame with Missing Values¶
Suppose demo.csv has 10 rows, two numeric columns, and one categorical
column, with about 20% missingness:
age,income,category
25,50000,A
30,,B
,75000,A
45,80000,
...
A successful agent dialog looks like:
ingest_dataset("/data/demo.csv")→dataset_id="ds_abc123"characterize_dataset("ds_abc123")→ reports 4 missing inincome, 1 missing incategory, etc.list_grids()→ agent picksdefaultcreate_config("ds_abc123", grid="default", samples=20)run_imputation_sweep→ returnsselected_index=7,descriptor_stats.mean_pairwise_l2=0.14export_imputed_data("/data/demo_imputed.csv")
The resulting CSV is the candidate Phil considered most representative of the imputation ensemble — not the highest-likelihood, not the lowest-loss, but the one closest to the centroid of the ECT descriptor cloud.
Bringing Your Own Data¶
Ensure your CSV or Parquet file is accessible on the machine running the MCP server.
Connect the server using the setup steps above.
Ask: “Run a Phil imputation sweep on
my_data.csvand export the chosen imputation. Use thefinancegrid.”
The agent handles missingness analysis, grid selection, descriptor scoring, and selection. Your job is to interpret the resulting imputed dataset using domain knowledge.
Local Medical Demo¶
For a concrete local workflow (including medical CSV generation and MCP test
prompts), see demos/medical/README.md in the repository root.
See also
Configuration — programmatic YAML/config reference
Programmatic — equivalent Python workflow