.. _demos:
=================================
Demos
=================================
Pulsar shines when you have real data and real questions. Below are five production demos that showcase different aspects of topological data analysis — from recovering hidden biology to revealing benchmark structure to analyzing clinical trajectories.
Each demo is self-contained and runnable in minutes. Pick one that matches your domain and see the insights Pulsar reveals.
---
1. Palmer Penguins: Recovering Biology Without Labels
=====================================================
.. raw:: html
The Hook: Can topology rediscover penguin species without looking at species labels? Or discover that habitat and sex are equally important structurally?
**The Data**
The `Palmer Penguins `_ dataset contains 333 penguins from three species (Adelie, Chinstrap, Gentoo) with 8 morphological measurements: bill length, bill depth, flipper length, body mass, and more. It's the ideal educational dataset — real biology, no missing structure, universally understood.
**The Discovery**
After dropping species labels entirely and letting Pulsar discover structure in the remaining 5-dimensional feature space:
- **The Gentoos**: Completely isolated on Biscoe Island, then perfectly separated by sex. (They are chunky birds with distinctive morphology.)
- **The Adelies**: Fragmented by island of origin — the structural variation within the species is as important as the species itself.
- **The Chinstraps**: Indistinguishable from Dream Island Adelies. They share the same morphological envelope, so the math doesn't lie.
**Key Insight**: Topology reveals that **habitat and biological sex are as structurally important as species itself**. Traditional clustering (K-means) would force three spheres; topology shows the actual complexity.
**Try It Now**
This is the fastest way to see Pulsar in action. No dataset to download.
.. code-block:: bash
# Option 1: Use Pulsar with Claude AI (recommended)
# Install Pulsar MCP server in Claude Desktop (see :ref:`mcp` guide)
# Then ask Claude: "Use Pulsar to analyze the penguin data at demos/penguins/penguins.csv"
# Option 2: Run directly with Python
cd /path/to/pulsar
uv sync
uv run maturin develop --release
uv run python -c "
from pulsar.pipeline import ThemaRS
config = {'run': {'name': 'penguins', 'data': 'demos/penguins/penguins.csv'}}
model = ThemaRS.from_dict(config)
model.fit()
print(f'Discovered {len(model.cosmic_graph.nodes())} nodes and {len(model.cosmic_graph.edges())} edges')
"
**Deep Dive**
- Full walkthrough: `demos/penguins/README.md `_
- YAML configuration: `demos/penguins/params.yaml `_
- Notebook: The penguins example is also the starting point in the :ref:`mcp` guide
---
2. MMLU Benchmark Topology: 57 Subjects, 12 True Clusters
==========================================================
.. raw:: html
The Hook: MMLU is the standard LLM benchmark: 57 subjects, one leaderboard number. What if the real structure doesn't match those labels?
**The Data**
MMLU consists of ~14,000 test questions across 57 administrative subjects (professional medicine, history, chemistry, law, etc.). We embed all questions using `bge-small-en-v1.5` (384-dimensional sentence embeddings) and run Pulsar's topological sweep.
**The Discovery**
The geometric structure in embedding space reveals **12 distinct regions** that cut across subject boundaries:
.. list-table:: MMLU's Hidden Structure
:widths: 10 50 40
:header-rows: 1
* - Region
- Theme
- Top Subjects
* - 0
- Psychology / Behavioral
- professional_psychology, hs_psychology
* - 1
- Medicine / Health
- professional_medicine, nutrition, clinical_knowledge
* - 2
- Mathematics / Quantitative
- elementary_math, hs_math, hs_statistics
* - 3
- Moral Reasoning
- **moral_scenarios (100% isolated)**
* - 5
- Law
- **professional_law (87% of region)**
* - 8
- History
- hs_world_history, hs_us_history
**Key Insights**:
- `moral_scenarios` forms a completely isolated island — structurally alien to the rest of MMLU
- `professional_law` is the tightest cluster (87% of Region 5)
- Psychology splits: behavioral questions in Region 0, philosophical in Region 7
- **Leaderboard blind spot**: Different LLMs have vastly different accuracy across regions. The single benchmark number hides this variation.
- Random sampling needs **3x more questions** than topology-aware sampling to cover all 12 regions
**Try It Now**
Jupyter notebook with full analysis and per-model evaluation:
.. code-block:: bash
cd demos/mmlu
uv sync --group demos
uv run maturin develop --release
jupyter notebook mmlu_topology_demo.ipynb
First run downloads and embeds ~14k questions (~2 min on Apple Silicon). Subsequent runs use cached data.
**Deep Dive**
- Full README with calibration details: `demos/mmlu/README.md `_
- Jupyter notebook: `mmlu_topology_demo.ipynb `_
- Configuration: `mmlu_params.yaml `_
---
3. Clinical Trajectories: PhysioNet ICU Vitals Over Time
========================================================
.. raw:: html
The Hook: Two patients with identical vital signs right now might have completely different futures. Can topology reveal their trajectory archetypes?
**The Data**
The demo simulates 500 ICU patients over 72 hours with 8 vital signs: heart rate, systolic/diastolic BP, MAP, respiratory rate, temperature, SpO₂, lactate, glucose. Five distinct clinical archetypes are embedded in the synthetic trajectories (sepsis progression, recovery, decline, stable, recovery-plateau).
This demonstrates **TemporalCosmicGraph** — a 3D tensor approach (patient × feature × time) that captures patient-level temporal patterns, not just snapshots.
**The Discovery**
- Patients cluster by **trajectory type**, not current state. A recovering patient and a declining patient may have identical vitals right now but opposite futures.
- Multiple aggregation modes reveal different groupings:
- **Persistence** → stable vs. volatile patients
- **Trend** → improving vs. worsening trajectories
- **Volatility** → high-risk vs. stable
- **Change point** → when trajectory shifts occur
- Early warning signals emerge from trajectory clustering, not from any single vital.
**Try It Now**
With synthetic data (no real PHI):
.. code-block:: bash
cd /path/to/pulsar
uv sync
uv run maturin develop --release
uv run python demos/ehr/physionet.py --synthetic --n-patients 500
With real eICU data (if you have access via PhysioNet):
.. code-block:: bash
# First download eICU from https://physionet.org
uv run python demos/ehr/physionet.py --data /path/to/eicu.csv
**Deep Dive**
- Script: `demos/ehr/physionet.py `_
- Configuration: `physionet_params.yaml `_
- Configuration: `physionet_params.yaml `_
---
4. ECG Arrhythmia Classification via Feature Extraction
=======================================================
.. raw:: html
The Hook: 60,000 raw ECG samples per patient. Can a compact feature vector capture enough to cluster arrhythmias?
**The Data**
ECG (electrocardiogram) signals from the `PhysioNet Arrhythmia Database `_: 12-lead recordings at 500 Hz, 10-second windows = 5,000 samples per lead, per patient. The demo extracts ~80 summary features per ECG:
- Statistical: mean, std, min, max, median, skewness, kurtosis
- Frequency: FFT peaks, power spectral density
- Morphological: zero crossings, rate of change statistics
**The Discovery**
- Topology reveals clusters that **align with SNOMED-CT arrhythmia diagnoses** better than K-means or traditional clustering
- Different leads emphasize different diagnostic features — combining all 12 leads captures the full arrhythmia signature
- Trade-off: Feature extraction is computationally efficient vs. true temporal modeling (TemporalCosmicGraph), with minimal loss in structure discovery
**Try It Now**
With synthetic ECG patterns:
.. code-block:: bash
uv run python demos/ehr/ecg_arrhythmia.py --synthetic
With real PhysioNet data:
.. code-block:: bash
# Download from https://physionet.org (requires registration)
uv run python demos/ehr/ecg_arrhythmia.py --data /path/to/ecg_data
**Deep Dive**
- Script: `demos/ehr/ecg_arrhythmia.py `_
- Configuration: Hardcoded in the script; adjust PCA dimensions and epsilon range as needed
---
5. US Coal Plants: Production-Scale Grid Sweep
==============================================
.. raw:: html
The Hook: Real infrastructure data at scale. How do operational coal plants cluster when you account for location, capacity, age, emissions, and status?
**The Data**
147 US coal power plants with features: latitude, longitude, capacity (MW), age, emissions (CO₂, NOx, SO₂), operational status, retire year (if planned). Dataset is automatically downloaded from the `retire `_ project. Real-world, production-scale problem.
**The Discovery**
- Plants cluster by **operational region** and **capacity tier**, not administrative ownership
- Age and emissions profiles separate active vs. retiring cohorts
- Geographic clustering aligns with grid topology and energy markets
- The full sweep (4 PCA dims × 8 seeds × 50 epsilons = 4,000 ball maps) approximates the cosmic graph from the original `Pulsar Nature paper `_
**Try It Now**
Automatic dataset download, grid search, and timing report:
.. code-block:: bash
uv run python demos/energy/coal.py
The demo prints per-stage wall-clock timings (preprocessing, PCA, Ball Mapper, graph accumulation, thresholding) and the final graph size. On a modern machine: ~2–5 seconds for the full 4,000-map sweep.
**Deep Dive**
- Script: `demos/energy/coal.py `_
- Configuration: `coal_params.yaml `_
- Data: automatically downloaded on first run from `retire project `_
---
Choosing Your Demo
==================
.. list-table::
:widths: 25 25 50
:header-rows: 1
* - Domain
- Demo
- Why Choose It
* - Education / Getting Started
- Palmer Penguins
- Fastest, most intuitive
* - Research / Benchmarks
- MMLU
- Reveals hidden structure
* - Healthcare / Trajectories
- PhysioNet (Clinical)
- Time-series aware
* - Healthcare / Signals
- ECG Arrhythmia
- Feature engineering
* - Infrastructure / Scale
- Coal Plants
- Real-world, production-ready
---
Next Steps
==========
Once you've explored a demo:
1. **Use with Claude AI**: Set up the :ref:`mcp` server and point Claude at your own data. The AI will handle parameter tuning and generate statistical dossiers.
2. **Adapt for your data**: Copy the nearest demo's YAML config and adjust for your feature scales and desired PCA dimensions.
3. **Deep dive on parameters**: See :ref:`intermediate` for guidance on tuning epsilon ranges and dimension selection.
4. **Deploy to production**: The coal demo shows how to instrument timing and validation. See :ref:`intermediate` for configuration and parameter guidance.