Testing AI Analytics: A TEA-Driven Approach

Testing AI-Powered Analytics: A TEA-Driven Approach for Ameya Analyst

Deterministic software testing standards fall apart when applied to analytics AI. You need a new approach to build production trust.

Ameya Analyst, our AI-powered analytics platform, translates natural language questions into data-driven insights. It's more than just a query interface; it's a multi-step reasoning engine. It combines NL to SQL generation, query execution, insight derivation, anomaly detection, and narrative generation in a single user interaction. I'll walk through how the TEA Quality Principles (Deterministic, Isolated, Explicit, Focused, Fast) apply to Ameya Analyst, where they break down, and what additional principles analytics AI demands.

Understanding Ameya Analyst's Capabilities

Before writing tests, you must understand the system's capability layers. Each layer has different testability, failure modes, and stakes.

Capability	Description	Testability
Natural Language → Query	Translates user questions into SQL or API calls	High: output is structured
Query Execution	Runs generated queries against data sources	High: result is deterministic given frozen data
Insight Generation	Derives meaning from result sets	Medium: structured output, but reasoning varies
Anomaly Detection	Identifies spikes, outliers, trend breaks	Medium: precision/recall measurable
Predictive Analytics	Forecasting, churn signals, demand	Low: probabilistic outputs require tolerance-based assertions
Data Storytelling	Narrative summaries of dashboards	Low: semantic evaluation required
Agentic Analysis	Multi-step: drill down → correlate → conclude	Very Low: compound non-determinism

This understanding is the basis for any testing strategy.

The Silent Failure Problem in Analytics AI

Analytics AI has a unique risk: the silent failure. In document AI, a wrong invoice total is obvious. In analytics AI, wrong answers are credible.

Imagine this scenario: "Your top customer segment grew 12% YoY, driven by strong performance in the enterprise tier." This could be mathematically wrong, reference the wrong time window, exclude a key filter, or hallucinate a trend. It sounds confident, yet it is incorrect. This credible wrongness is the central risk in Ameya Analyst testing. Every principle is in service of catching these silent failures before they reach users.

Adapting TEA Quality Principles

Here's how we adapt the TEA principles for Ameya Analyst:

1. Deterministic → Probabilistic Acceptance with Frozen Data

Traditional principle: Same result every run.

Ameya Analyst has three sources of non-determinism: LLM variance, data variance, and reasoning variance.

To adapt, first freeze test data. Analytics tests must never run against live data. Use versioned dataset snapshots as test fixtures, like code. Version-control and never mutate them.

# Bad: test depends on live data state
result = ameya.query("Top vendors by spend last quarter")
assert result.top_vendor == "Vendor A"  # Will break when data changes

# Good: test runs against frozen fixture
with frozen_dataset("fixtures/vendor_spend_Q3_2024.parquet"):
    result = ameya.query("Top vendors by spend last quarter")
    assert result.top_vendor == "Vendor A"

For LLM variance, use temperature=0 for query generation. Shift assertions from exact outputs to structural and semantic properties:

# Don't assert on the generated text
# assert insight.narrative == "Revenue grew 12% in Q3"  ← Fragile

# Assert on the structured insight properties
assert insight.metric == "revenue"
assert insight.direction == "increase"
assert insight.magnitude_pct == approx(12.0, tolerance=0.5)
assert insight.period == "Q3-2024"
assert insight.data_source == "sales_transactions"

2. Isolated → Layer-by-Layer Boundary Testing

Traditional principle: No dependencies on other tests.

Ameya Analyst is a pipeline. One user question touches multiple systems: data warehouse, query engine, LLM reasoning, visualization, and external data. Testing the full pipeline in every test is fragile, slow, and expensive.

To adapt, test each layer in isolation:

┌─────────────────────────────────────────────────────────────┐
│  Layer 1: NL → Query Generation                             │
│  Input:   Natural language + schema context                 │
│  Mock:    Database (don't execute queries)                  │
│  Assert:  Is the generated SQL semantically correct?        │
├─────────────────────────────────────────────────────────────┤
│  Layer 2: Query → Result Set                                │
│  Input:   SQL against frozen dataset snapshot               │
│  Mock:    LLM (feed pre-generated query)                    │
│  Assert:  Does the result match expected rows/aggregates?   │
├─────────────────────────────────────────────────────────────┤
│  Layer 3: Result Set → Insight Generation                   │
│  Input:   Fixed structured data result                      │
│  Mock:    Query layer (inject canned result sets)           │
│  Assert:  Does insight correctly interpret the numbers?     │
├─────────────────────────────────────────────────────────────┤
│  Layer 4: Insight → Narrative                               │
│  Input:   Structured insight object                         │
│  Mock:    Upstream reasoning                                │
│  Assert:  Is the narrative faithful to the insight object?  │
├─────────────────────────────────────────────────────────────┤
│  Layer 5: End-to-End                                        │
│  Input:   Full question against seeded dataset              │
│  Mock:    Nothing                                           │
│  Assert:  Complete user journey (run sparingly)             │
└─────────────────────────────────────────────────────────────┘

The golden rule: A test must never depend on the data's current state.

3. Explicit: A New Assertion Vocabulary

Traditional principle: Assertions visible in the test body.

For generative analytics narratives, you cannot write assert output == "Revenue grew 12%". Standard assertions are insufficient.

To adapt, use three assertion categories:

Category A: Numerical Assertions with Tolerance

# Tolerance-based numerical assertion
assert_approx(result.total_spend, 4_250_000, tolerance_pct=1.0)

# Ranking assertion
assert result.rankings[0].vendor == "Vendor A"
assert result.rankings[0].rank == 1

# Trend assertions
assert result.trend.direction == "increasing"
assert result.trend.slope > 0
assert result.trend.r_squared > 0.85  # Is this a statistically real trend?

Category B: Semantic Assertions via LLM-as-Judge

judge_prompt = """
Given this underlying data: {actual_data}
And this generated insight: {generated_insight}

Evaluate the insight on these criteria:
1. Numerically accurate: does the text match the data?
2. Directionally correct: does the trend direction match?
3. Scope adherent: does it respect the requested filters and time window?
4. Hallucination-free: are all claims supported by the data?

Return JSON only:
{
  "accurate": true/false,
  "issues": ["list of specific problems if any"]
}
"""

evaluation = judge_model.evaluate(generated_insight, actual_data)
assert evaluation["accurate"] == True, f"Insight failed: {evaluation['issues']}"

Category C: Provenance Assertions

# Every insight object should carry full provenance
insight = {
    "claim": "Revenue grew 12% YoY",
    "value": 0.12,
    "source_query": "SELECT SUM(revenue) FROM sales WHERE ...",
    "source_row_count": 1842,
    "calculation_method": "((current - prior) / prior)",
    "time_window": {
        "current": "Q3-2024",
        "prior":   "Q3-2023"
    }
}

# Test: independently verify the claim from source
assert verify_from_source(insight) == True

4. Focused: Natural Layer Boundaries

Traditional principle: Single responsibility, appropriate size.

Ameya Analyst has clear boundaries that map to focused tests. Resist the temptation to test everything in one scenario.

Test Focus	What It Tests	What It Explicitly Does NOT Test
Query Accuracy	NL → correct SQL semantics	Database performance
Calculation Accuracy	Math correctness with right base period	Narrative quality
Insight Relevance	Right insight for the question asked	Calculation method
Anomaly Precision	Seeded anomalies correctly flagged	Narrative explanation
Anomaly Recall	No seeded anomalies missed	Visualization output
Narrative Faithfulness	Text accurately reflects numbers	Whether numbers are correct
Hallucination Detection	No invented claims or metrics	Whether claims are insightful
Scope Adherence	Filters and date ranges respected	Downstream formatting

Define and enforce these boundaries in your test architecture.

5. Fast: The Compound Latency Problem

Traditional principle: Execute in reasonable time.

Analytics AI has stacked latency:

NL → Query Generation:       2–5s    (LLM inference)
Query Execution:              1–30s   (depends on data volume and warehouse)
Insight Generation:           3–8s    (LLM inference)
Narrative Generation:         2–5s    (LLM inference)
Multi-step agentic chain:     × N     (for each reasoning step)
────────────────────────────────────
Single E2E test, worst case:  ~60s+

A full test suite of 200 end-to-end scenarios could take hours.

To adapt, use a three-tier execution model:

┌─────────────────────────────────────────────────────────────┐
│  TIER 1: Every Commit (~30s total)                         │
│  • SQL generation correctness (no DB execution, parse only) │
│  • Schema validation of insight JSON                        │
│  • Null and type checks on structured outputs               │
│  • Fast: no LLM calls if possible, mock everything          │
├─────────────────────────────────────────────────────────────┤
│  TIER 2: Every Pull Request (~5 minutes total)             │
│  • 20–30 NL→Query→Result tests against in-memory SQLite     │
│  • Insight accuracy on 15–20 canned result sets             │
│  • Hallucination detection on known-answer questions         │
│  • Numerical correctness on critical metric calculations     │
├─────────────────────────────────────────────────────────────┤
│  TIER 3: Nightly / Pre-Release (30-60 minutes)            │
│  • Full eval suite: 200+ questions across all analytics      │
│  • Anomaly detection precision/recall on seeded datasets    │
│  • Multi-step agentic analysis end-to-end                   │
│  • Adversarial inputs (ambiguous, trick, edge-case)         │
│  • Regression: insight quality vs. previous model version   │
└─────────────────────────────────────────────────────────────┘

Also consider these fast-test strategies:

Use in-memory SQLite with seeded data for query execution tests
Cache model outputs for known inputs during development iteration
Use a smaller/faster model for regression checks
Track latency and cost as first-class test metrics. An analysis taking 45 seconds is a failing test.

Additional Principles for Ameya Analyst

TEA's five principles aren't enough for analytics AI. You need three more.

6. Traceable

An analytics insight like "Vendor concentration risk increased 23% this quarter" must be traceable to:

The exact dataset snapshot used
The exact query generated and executed
The model and prompt version
The calculation method and intermediate values

Without traceability, debugging is impossible.

Implementation: Every test result and production insight should carry a trace ID linking back to all of the above. Tools like LangSmith, Braintrust, and Weave are purpose-built for this.

7. Grounded

Grounding is the analytics equivalent of citation. Every claim must be independently verifiable from source data.

def test_insight_is_fully_grounded(insight, source_data):
    for claim in insight.parsed_claims:
        assert claim.value in source_data.compute(claim.metric, claim.filters), \
            f"Claim '{claim.text}' is not grounded in source data"

Grounding tests are high-value for Ameya Analyst.

8. Adversarially Tested

Analytics AI faces adversarial inputs that expose systemic reasoning failures.

Adversarial Input Type	Example	Risk
Ambiguous metric	"Show me performance"	Wrong metric chosen silently
Conflicting time filters	"Q3 vs last year Q3"	Wrong date math
Missing data segment	Asking about a category with no data	Hallucinated answer
Leading questions	"Why did sales drop?" (when they rose)	Confabulated explanation
Large number edge cases	Values in billions with 8+ decimal precision	Rounding/formatting errors
Non-existent schema references	Asking about a column that doesn't exist	Hallucination risk
Double-negation queries	"Show non-enterprise vendors excluding inactive"	Filter logic errors

Build a dedicated adversarial test suite for these scenarios.

Complete Testing Principles

Principle	Origin	Description
Deterministic	TEA	Use frozen data; assert on properties, not values
Isolated	TEA	Test each pipeline layer independently
Explicit	TEA	Use numerical, semantic, and provenance assertions
Focused	TEA	One test, one layer, one responsibility
Fast	TEA	Three-tier execution model; in-memory fixtures
Traceable	Extended	Every result links to model, prompt, data, and query versions
Grounded	Extended	Every claim verifiable from source data
Adversarially Tested	Extended	Explicit coverage of ambiguous, misleading, and edge-case inputs

Does TEA Work?

TEA Principle	Applicability	Gap and Mitigation
Deterministic	Low natively	Freeze data + structured output assertions + temperature=0
Isolated	Medium	Achievable with layer architecture but requires upfront design
Explicit	Medium	Requires new assertion vocabulary: numerical, semantic, provenance
Focused	High	Natural layer boundaries make this achievable
Fast	Low natively	Requires three-tier execution and in-memory fixtures

The Mindset Shift

Traditional testing asks: "Is it correct?"

Ameya Analyst testing asks: "Can we prove it's correct, and do we know when it isn't?"

The second leads to measurement systems: distributions, thresholds, trends, and provenance chains.

Analytics AI requires frozen data, layered isolation, provenance tracking, semantic evaluation, and adversarial coverage working together. That is the bar Ameya Analyst needs to be trustworthy. TEA, extended, is the framework to get there.

Quick Reference Checklist

□ Are all tests running against frozen, versioned dataset snapshots?
□ Is each test scoped to a single pipeline layer?
□ Are numerical assertions using tolerance-based comparisons?
□ Do insight objects carry full provenance (query, method, time window)?
□ Is there a grounding check that verifies claims against source data?
□ Is there an LLM-as-judge eval for narrative outputs?
□ Are tests tiered across commit / PR / nightly execution?
□ Is there an adversarial test suite covering ambiguous and edge-case inputs?
□ Are model version, prompt version, and dataset version logged with every test run?
□ Are latency and cost tracked as first-class test metrics?

This post is part of the Ameya AI Engineering series on building trustworthy AI systems. For questions on the testing framework, reach out to the QA & AI Engineering team.

I'm Gangadhar Neeli from Ameya - Engineering. I hope this helps you on your AI journey. We've seen these principles reduce processing time by 40% internally. If you're interested in how Ameya Extract, Ameya AI Agents, or Ameya Analyst can help your organization, please contact us.

Gangadhar Neeli

Ameya - Engineering

Visionary technology leader with 26+ years of experience driving strategic initiatives across Enterprise IT, with deep expertise in application rationalization, AI-led modernization, and enterprise platform architecture.