Pipeline

From a benchmark to a diagnosis, in three independent stages — profile, evaluate, analyze.

An exploratory-QA task is a tuple τ = (Q, D, Dgold, T, A★, B): a question Q, a data lake D, the minimal gold sources Dgold sufficient to answer it, a tool set T (dataset search f(q,k) and data analysis g(c,d)), the gold answer A★, and a tool-call budget B (we use B = 30). SANA runs as three phases you can use independently:

1 sana-profiling

Profiling mirrors a benchmark task into a LakeQA-style EQA task and a matching runtime profile — the artifact that powers every idealized tool. A profile records:

A SANA task mirrored into its runtime profile: source sequence, sanitized subquestions, and per-node execution records with verified answers.
A task (top) mirrored into the SANA profile it produces (bottom).

The profile is what each ideal mode draws on at evaluation time:

--profile ideal

Exposes the gold reasoning chain as an explicit, answer-safe decomposition.

--search ideal

Returns exactly the gold sources each query needs from the source sequence — nothing irrelevant.

--compute ideal

Runs the profile's ideal SQL/Python — no implementation error, exactly as the agent intended.

LakeQA and KramaBench ship ready-made tasks, profiles, and artifacts. Converting a new benchmark is report-first: sample examples, scaffold a transform skill, then run it.

2 sana-evaluation

Evaluation runs a fixed agent runtime against the profiles while you choose which axis to ablate. A diagnostic run flips one axis off ideal and holds the rest idealized, so there are no conflicting bottlenecks. Each axis is a CLI flag:

FlagModesWhat it controls
--profilenaive · standard · idealHow much runtime-profile guidance (the decomposition) the agent is given.
--searchnaive · standard · ideal · preloadedBM25 (naive), hybrid RRF + LLM descriptions (standard), gold sources (ideal), or Dgold placed in context (preloaded).
--resultsnaive · idealLive retrieved results vs. profile-backed ideal result payloads.
--computestandard · idealOrdinary execution tools vs. the profile's ideal SQL/Python.
--skillson · offThe AgentSkills plugin.

Each axis is chosen per run, alongside the benchmark, model, search-result limit, and task scope — a lightweight smoke sample or the full maintained task set.

Bring your own tool or variant

Every phase is replaceable: swap in your own data lake, retrieval backend, or benchmark artifacts and keep SANA's artifact contracts. Measuring a new retriever or executor is just a run with that axis set to its mode, compared against the idealized upper bound and the realistic baselines.

3 sana-analysis

Analysis scores the runs and reproduces the paper's diagnosis. The primary metric is semantic match (SM) — an LLM-as-judge crediting answers equivalent to the gold answer up to formatting, aliases, units, or phrasing — alongside two discovery-recall metrics over the gold sources:

SM

Semantic match against the gold answer (↑), the headline metric.

D_ret

Retrieval recall: gold sources surfaced by the search tools (↑).

D_acc

Access recall: gold sources actually opened / queried (↑).

From the scored runs, analysis produces:

If idealizing an axis closes a large gap, that component is the bottleneck. The gap that remains when all three are idealized is attributable to the agent's policy — evidence tracking, source commitment, intermediate validation, and stopping criteria.

Cite

@inproceedings{sana2026,
  title     = {SANA: What Matters for QA Agents over Massive Data Lakes?},
  author    = {Wijaya, Austin Senna and Liu, Jiaxiang and Wang, Haonan and Wu, Eugene},
  booktitle = {VLDB 2026 Workshop on Systems for Data-centric Agents with Human-in-the-loop (DASHSys)},
  year      = {2026}
}

See the eval repo for the maintained CLI, the full flag set, and runnable artifacts.