MCP Server¶

No Code. Just a DataFrame and a Question.

The Phil MCP (Model Context Protocol) server lets AI clients — Claude, Gemini, Cursor, and others — run topology-guided imputation sweeps on your pandas or polars dataframes without you writing any Python. Point the agent at a CSV or Parquet file, ask for the best imputation, and let it pick the candidate that best represents the ensemble.

This guide is for practitioners who know their data has missing values and want a reproducible imputation rather than a single hand-picked strategy.

Workflow Comparison¶

Approach	You Do	AI Does	Speed
Programmatic (Python)	Write Python, configure grids	(nothing)	Depends on grid size
PhilTransformer (sklearn)	Wire into a Pipeline	(nothing)	Depends on grid size
MCP + Agent (recommended)	Hand AI a file path, ask question	Entire sweep workflow	~5–60s (automated tuning)

The Value Prop¶

Traditional imputation: - You guess a strategy (mean, KNN, iterative regressors) - You run it once, hope it’s “good enough” - No principled way to compare alternatives

Phil via the MCP: - The agent generates a grid of candidate imputations - Each candidate is scored with an ECT topological descriptor - The candidate closest to the ensemble centroid is selected - The agent can iterate on the grid if descriptor spread looks degenerate - You read a structured summary, not a chart of distance metrics

Setup¶

Phil ships an MCP server entry point (phil-mcp) via the mcp extra of the published philler package. You do not need to clone the repo — uvx (or pipx) can launch it directly from PyPI.

Note

Phil works with any MCP-capable client, including Cursor and Gemini CLI, where you can add Phil as an MCP server/tool.

Claude Desktop

Open ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows) and add:

{
  "mcpServers": {
    "phil": {
      "command": "uv tool run",
      "args": ["--from", "philler[mcp]", "phil-mcp"]
    }
  }
}

Restart Claude Desktop. A hammer icon in new chats confirms the tools loaded.

Note

GUI-launched apps on macOS often don’t inherit your shell PATH. If Claude can’t find uvx, replace "command": "uv tool run" with its absolute path (find it with which uvx, e.g. /Users/yourname/.local/bin/uvx).

Gemini CLI

gemini mcp add phil uv tool run --from "philler[mcp]" phil-mcp

Claude Code

claude mcp add phil -- uv tool run --from "philler[mcp]" phil-mcp

Cursor / Windsurf

Open Settings → Features → MCP → Add new MCP server:

Name: phil
Type: command
Command: uv tool run --from "philler[mcp]" phil-mcp

Alternative install methods¶

If you prefer a persistent install over ephemeral uv invocations:

pipx install "philler[mcp]"      # then use command: phil-mcp
# or
pip install "philler[mcp]"       # in any venv

Developing against a local clone¶

Contributors working on Phil’s source can launch the server from a checkout instead:

uv sync --group mcp
uv run phil-mcp

Point your MCP client at uv run --group mcp phil-mcp (with cwd set to the clone) for live-edit development.

Workflow¶

Once connected, give the agent a goal rather than instructions. The agent already knows the technical steps.

The recommended prompt:

“I have a dataset with missing values at path/to/data.csv. Use Phil to run an imputation sweep and pick the most representative completed dataset. Export it to path/to/imputed.csv.”

Under the hood the agent will:

Ingest — register the file as a stable dataset_id handle
Characterize — summarize missingness, dtypes, and unique counts
List grids — pick the named GridGallery strategy that fits the data (default, finance, healthcare, …)
Create a config — generate canonical YAML with sensible defaults
Validate — confirm the config parses and the dataset is reachable
Run the sweep — fit each candidate imputation and score with ECT
Diagnose — inspect descriptor spread and method counts; iterate if the grid collapsed
Export — write the selected imputed DataFrame to disk

Pandas and polars¶

Phil’s pipeline runs on pandas internally, but the MCP server accepts any file format pandas or pyarrow can read. To use a polars frame, write it to Parquet first and ingest the path:

import polars as pl

df = pl.read_csv("raw.csv")
df.write_parquet("for_phil.parquet")
# then in your agent chat:
#   "Run a Phil sweep on /abs/path/to/for_phil.parquet"

Available MCP Tools¶

The server exposes these tools to the AI client. The agent automatically chains them together:

Tool	What It Does
get_workflow_guide	Returns the opinionated, phase-by-phase Phil workflow as markdown. Opt-in — agents that prefer their own plan can ignore it.
get_runtime_context	Reports the server cache directory, session id, and path-visibility guidance — useful when the agent needs to ferry sandboxed files into a host-readable location.
ingest_dataset	Registers a CSV or Parquet path and returns a stable `dataset_id` handle. Pass that handle to every downstream tool.
begin_dataset_upload / append_dataset_chunk / finalize_dataset_upload	Chunked base64 upload pipeline for clients that cannot share a filesystem with the server. Use only when path-based ingest is impossible.
characterize_dataset	Sparse per-column schema: dtype, n_unique, missing percent, plus aggregate row/column counts. Cheap and safe to call on wide datasets.
probe_columns	Deep per-column inspection for up to 20 columns at a time: sample values, top frequencies, basic numeric statistics.
list_grids	Enumerates the named `GridGallery` entries (`default`, `sampling`, `finance`, `healthcare`, `marketing`, `engineering`) with method lists and intent blurbs.
create_config	Materializes a canonical YAML config tailored to a `dataset_id` and grid choice. Stores it on the session so subsequent `refine_active_config` / `run_imputation_sweep` calls can omit it.
validate_config	Validates and normalizes a config YAML, returning structured issues if anything is off. Rejects fenced Markdown blocks to prevent silent parse failures.
refine_config / refine_active_config	Apply dotted-path overrides (e.g. `imputation.samples=50`) to an explicit or session config. Unknown keys raise structured errors with the valid path list.
get_active_config	Returns the in-session config YAML, useful for inspection before running a sweep.
run_imputation_sweep	The headline tool: fits the candidate grid, scores each with the ECT magic method, selects the representative, and persists a `RunRecord` plus a markdown diff against the previous run.
diagnose_sweep	Inspects a saved run’s descriptor spread, selected index, and per-method candidate counts. Use to decide whether to broaden the grid or raise `samples`.
get_candidate_descriptors	Returns the top-k candidates ranked by closeness to the mean descriptor, including the selected index.
compare_sweeps	Side-by-side comparison of two persisted runs by config and descriptor statistics.
get_experiment_history	Markdown table of every sweep run in the current session — handy for telling the story of how the agent iterated.
get_sweep_summary	Returns the full persisted `RunRecord` for a given `run_id`.
export_imputed_data	Writes the selected imputed DataFrame to disk; CSV / Parquet / Feather inferred from the extension.

Example: A Mixed-Type Frame with Missing Values¶

Suppose demo.csv has 10 rows, two numeric columns, and one categorical column, with about 20% missingness:

age,income,category
25,50000,A
30,,B
,75000,A
45,80000,
...

A successful agent dialog looks like:

ingest_dataset("/data/demo.csv") → dataset_id="ds_abc123"
characterize_dataset("ds_abc123") → reports 4 missing in income, 1 missing in category, etc.
list_grids() → agent picks default
create_config("ds_abc123", grid="default", samples=20)
run_imputation_sweep → returns selected_index=7, descriptor_stats.mean_pairwise_l2=0.14
export_imputed_data("/data/demo_imputed.csv")

The resulting CSV is the candidate Phil considered most representative of the imputation ensemble — not the highest-likelihood, not the lowest-loss, but the one closest to the centroid of the ECT descriptor cloud.

Bringing Your Own Data¶

Ensure your CSV or Parquet file is accessible on the machine running the MCP server.
Connect the server using the setup steps above.
Ask: “Run a Phil imputation sweep on my_data.csv and export the chosen imputation. Use the finance grid.”

The agent handles missingness analysis, grid selection, descriptor scoring, and selection. Your job is to interpret the resulting imputed dataset using domain knowledge.

Local Medical Demo¶

For a concrete local workflow (including medical CSV generation and MCP test prompts), see demos/medical/README.md in the repository root.