Pipeline — SANA

An exploratory-QA task is a tuple τ = (Q, D, D_gold, T, A★, B): a question Q, a data lake D, the minimal gold sources D_gold sufficient to answer it, a tool set T (dataset search f(q,k) and data analysis g(c,d)), the gold answer A★, and a tool-call budget B (we use B = 30). SANA runs as three phases you can use independently:

1 sana-profiling → 2 sana-evaluation → 3 sana-analysis

1 sana-profiling

Profiling mirrors a benchmark task into a LakeQA-style EQA task and a matching runtime profile — the artifact that powers every idealized tool. A profile records:

source_sequence / dataset_sequence — the ordered gold sources each step needs.
reasoning_chain_text — the answer-safe decomposition (sub-questions that don't leak which dataset to fetch).
ideal_query / ideal_code — per node: the intent, the verified SQL or Python, and the gold intermediate answer.

A task (top) mirrored into the SANA profile it produces (bottom).

The profile is what each ideal mode draws on at evaluation time:

`--profile ideal`

Exposes the gold reasoning chain as an explicit, answer-safe decomposition.

`--search ideal`

Returns exactly the gold sources each query needs from the source sequence — nothing irrelevant.

`--compute ideal`

Runs the profile's ideal SQL/Python — no implementation error, exactly as the agent intended.

LakeQA and KramaBench ship ready-made tasks, profiles, and artifacts. Converting a new benchmark is report-first: sample examples, scaffold a transform skill, then run it.

2 sana-evaluation

Evaluation runs a fixed agent runtime against the profiles while you choose which axis to ablate. A diagnostic run flips one axis off ideal and holds the rest idealized, so there are no conflicting bottlenecks. Each axis is a CLI flag:

Flag	Modes	What it controls
`--profile`	naive · standard · ideal	How much runtime-profile guidance (the decomposition) the agent is given.
`--search`	naive · standard · ideal · preloaded	BM25 (naive), hybrid RRF + LLM descriptions (standard), gold sources (ideal), or `D_gold` placed in context (preloaded).
`--results`	naive · ideal	Live retrieved results vs. profile-backed ideal result payloads.
`--compute`	standard · ideal	Ordinary execution tools vs. the profile's ideal SQL/Python.
`--skills`	on · off	The AgentSkills plugin.

Each axis is chosen per run, alongside the benchmark, model, search-result limit, and task scope — a lightweight smoke sample or the full maintained task set.

Bring your own tool or variant

Every phase is replaceable: swap in your own data lake, retrieval backend, or benchmark artifacts and keep SANA's artifact contracts. Measuring a new retriever or executor is just a run with that axis set to its mode, compared against the idealized upper bound and the realistic baselines.

3 sana-analysis

Analysis scores the runs and reproduces the paper's diagnosis. The primary metric is semantic match (SM) — an LLM-as-judge crediting answers equivalent to the gold answer up to formatting, aliases, units, or phrasing — alongside two discovery-recall metrics over the gold sources:

SM

Semantic match against the gold answer (↑), the headline metric.

D_ret

Retrieval recall: gold sources surfaced by the search tools (↑).

D_acc

Access recall: gold sources actually opened / queried (↑).

From the scored runs, analysis produces:

Per-component deltas — the SM gain from idealizing each axis (charted on the home page).
End-to-end comparison — Naive vs. Standard vs. all-Ideal runs.
Failure taxonomy — where incorrect runs break (source targeting, computation, finalization, …).
Plan-trajectory audit — whether the agent actually follows the plan it wrote.

If idealizing an axis closes a large gap, that component is the bottleneck. The gap that remains when all three are idealized is attributable to the agent's policy — evidence tracking, source commitment, intermediate validation, and stopping criteria.

Cite

@inproceedings{sana2026,
  title     = {SANA: What Matters for QA Agents over Massive Data Lakes?},
  author    = {Wijaya, Austin Senna and Liu, Jiaxiang and Wang, Haonan and Wu, Eugene},
  booktitle = {VLDB 2026 Workshop on Systems for Data-centric Agents with Human-in-the-loop (DASHSys)},
  year      = {2026}
}

See the eval repo for the maintained CLI, the full flag set, and runnable artifacts.