SANA

How SANA works

When an LLM agent fails at Exploratory QA over a data lake, which part of its runtime is to blame? An agent answers in three stages. SANA replaces one stage at a time with an oracle built from the task's ground truth — the accuracy it gains pinpoints the bottleneck.

1

Planning

Breaks the question into an ordered list of sub-questions.

oracle hands over the correct sub-questions directly.
2

Search

Hunts a ~40M-file lake for the datasets each sub-question needs.

oracle returns only the datasets the task actually requires.
3

Data Analysis

Writes and runs SQL/Python to compute each intermediate answer.

oracle executes the agent's intent with zero implementation bugs.

Swap any oracle for your own implementation and watch the delta move — see the Pipeline.

Paper (arXiv) Code (GitHub)

The paper link is a placeholder pending the arXiv release.

Citation

To appear at the VLDB 2026 Workshop on Systems for Data-centric Agents with Human-in-the-loop (DASHSys).

@inproceedings{sana2026,
  title     = {SANA: What Matters for QA Agents over Massive Data Lakes?},
  author    = {Wijaya, Austin Senna and Liu, Jiaxiang and Wang, Haonan and Wu, Eugene},
  booktitle = {VLDB 2026 Workshop on Systems for Data-centric Agents with Human-in-the-loop (DASHSys)},
  year      = {2026}
}

Ablation deltas

Semantic-match gain from idealizing each component (baseline = first bar of each group).

What we found

Data analysis is the consistent bottleneck

Idealizing execution gives large gains on both benchmarks (up to +24.1%), even when sources are already found.

Search dominates on the large lake

On LakeQA's ~40M-file lake, ideal search beats BM25 by +13–14%; on the small KramaBench it matters far less.

Plans are written, not followed

Agents produce near-gold decompositions (~78–82% match) yet only follow them ~28–57% of the time.