Automation

Testing AI-Powered Analytics: A TEA-Driven Approach for Ameya Analyst

TEA Quality Principles: what additional principles analytics AI demands

gangadhar-neeli
12 min read
Illustration for: Testing AI-Powered Analytics: A TEA-Driven Approach for Ameya Analyst

Testing AI-Powered Analytics: A TEA-Driven Approach for Ameya Analyst

Deterministic software testing standards fall apart when applied to analytics AI. You need a new approach to build production trust.

Ameya Analyst, our AI-powered analytics platform, translates natural language questions into data-driven insights. It's more than just a query interface; it's a multi-step reasoning engine. It combines NL to SQL generation, query execution, insight derivation, anomaly detection, and narrative generation in a single user interaction. I'll walk through how the TEA Quality Principles (Deterministic, Isolated, Explicit, Focused, Fast) apply to Ameya Analyst, where they break down, and what additional principles analytics AI demands.

Understanding Ameya Analyst's Capabilities

Before writing tests, you must understand the system's capability layers. Each layer has different testability, failure modes, and stakes.

Capability Description Testability
Natural Language β†’ Query Translates user questions into SQL or API calls High: output is structured
Query Execution Runs generated queries against data sources High: result is deterministic given frozen data
Insight Generation Derives meaning from result sets Medium: structured output, but reasoning varies
Anomaly Detection Identifies spikes, outliers, trend breaks Medium: precision/recall measurable
Predictive Analytics Forecasting, churn signals, demand Low: probabilistic outputs require tolerance-based assertions
Data Storytelling Narrative summaries of dashboards Low: semantic evaluation required
Agentic Analysis Multi-step: drill down β†’ correlate β†’ conclude Very Low: compound non-determinism

This understanding is the basis for any testing strategy.

The Silent Failure Problem in Analytics AI

Analytics AI has a unique risk: the silent failure. In document AI, a wrong invoice total is obvious. In analytics AI, wrong answers are credible.

Imagine this scenario: "Your top customer segment grew 12% YoY, driven by strong performance in the enterprise tier." This could be mathematically wrong, reference the wrong time window, exclude a key filter, or hallucinate a trend. It sounds confident, yet it is incorrect. This credible wrongness is the central risk in Ameya Analyst testing. Every principle is in service of catching these silent failures before they reach users.

Adapting TEA Quality Principles

Here's how we adapt the TEA principles for Ameya Analyst:

1. Deterministic β†’ Probabilistic Acceptance with Frozen Data

Traditional principle: Same result every run.

Ameya Analyst has three sources of non-determinism: LLM variance, data variance, and reasoning variance.

To adapt, first freeze test data. Analytics tests must never run against live data. Use versioned dataset snapshots as test fixtures, like code. Version-control and never mutate them.

# Bad: test depends on live data state
result = ameya.query("Top vendors by spend last quarter")
assert result.top_vendor == "Vendor A"  # Will break when data changes

# Good: test runs against frozen fixture
with frozen_dataset("fixtures/vendor_spend_Q3_2024.parquet"):
    result = ameya.query("Top vendors by spend last quarter")
    assert result.top_vendor == "Vendor A"

For LLM variance, use temperature=0 for query generation. Shift assertions from exact outputs to structural and semantic properties:

# Don't assert on the generated text
# assert insight.narrative == "Revenue grew 12% in Q3"  ← Fragile

# Assert on the structured insight properties
assert insight.metric == "revenue"
assert insight.direction == "increase"
assert insight.magnitude_pct == approx(12.0, tolerance=0.5)
assert insight.period == "Q3-2024"
assert insight.data_source == "sales_transactions"

2. Isolated β†’ Layer-by-Layer Boundary Testing

Traditional principle: No dependencies on other tests.

Ameya Analyst is a pipeline. One user question touches multiple systems: data warehouse, query engine, LLM reasoning, visualization, and external data. Testing the full pipeline in every test is fragile, slow, and expensive.

To adapt, test each layer in isolation:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Layer 1: NL β†’ Query Generation                             β”‚
β”‚  Input:   Natural language + schema context                 β”‚
β”‚  Mock:    Database (don't execute queries)                  β”‚
β”‚  Assert:  Is the generated SQL semantically correct?        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 2: Query β†’ Result Set                                β”‚
β”‚  Input:   SQL against frozen dataset snapshot               β”‚
β”‚  Mock:    LLM (feed pre-generated query)                    β”‚
β”‚  Assert:  Does the result match expected rows/aggregates?   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 3: Result Set β†’ Insight Generation                   β”‚
β”‚  Input:   Fixed structured data result                      β”‚
β”‚  Mock:    Query layer (inject canned result sets)           β”‚
β”‚  Assert:  Does insight correctly interpret the numbers?     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 4: Insight β†’ Narrative                               β”‚
β”‚  Input:   Structured insight object                         β”‚
β”‚  Mock:    Upstream reasoning                                β”‚
β”‚  Assert:  Is the narrative faithful to the insight object?  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 5: End-to-End                                        β”‚
β”‚  Input:   Full question against seeded dataset              β”‚
β”‚  Mock:    Nothing                                           β”‚
β”‚  Assert:  Complete user journey (run sparingly)             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The golden rule: A test must never depend on the data's current state.

3. Explicit: A New Assertion Vocabulary

Traditional principle: Assertions visible in the test body.

For generative analytics narratives, you cannot write assert output == "Revenue grew 12%". Standard assertions are insufficient.

To adapt, use three assertion categories:

Category A: Numerical Assertions with Tolerance

# Tolerance-based numerical assertion
assert_approx(result.total_spend, 4_250_000, tolerance_pct=1.0)

# Ranking assertion
assert result.rankings[0].vendor == "Vendor A"
assert result.rankings[0].rank == 1

# Trend assertions
assert result.trend.direction == "increasing"
assert result.trend.slope > 0
assert result.trend.r_squared > 0.85  # Is this a statistically real trend?

Category B: Semantic Assertions via LLM-as-Judge

judge_prompt = """
Given this underlying data: {actual_data}
And this generated insight: {generated_insight}

Evaluate the insight on these criteria:
1. Numerically accurate: does the text match the data?
2. Directionally correct: does the trend direction match?
3. Scope adherent: does it respect the requested filters and time window?
4. Hallucination-free: are all claims supported by the data?

Return JSON only:
{
  "accurate": true/false,
  "issues": ["list of specific problems if any"]
}
"""

evaluation = judge_model.evaluate(generated_insight, actual_data)
assert evaluation["accurate"] == True, f"Insight failed: {evaluation['issues']}"

Category C: Provenance Assertions

# Every insight object should carry full provenance
insight = {
    "claim": "Revenue grew 12% YoY",
    "value": 0.12,
    "source_query": "SELECT SUM(revenue) FROM sales WHERE ...",
    "source_row_count": 1842,
    "calculation_method": "((current - prior) / prior)",
    "time_window": {
        "current": "Q3-2024",
        "prior":   "Q3-2023"
    }
}

# Test: independently verify the claim from source
assert verify_from_source(insight) == True

4. Focused: Natural Layer Boundaries

Traditional principle: Single responsibility, appropriate size.

Ameya Analyst has clear boundaries that map to focused tests. Resist the temptation to test everything in one scenario.

Test Focus What It Tests What It Explicitly Does NOT Test
Query Accuracy NL β†’ correct SQL semantics Database performance
Calculation Accuracy Math correctness with right base period Narrative quality
Insight Relevance Right insight for the question asked Calculation method
Anomaly Precision Seeded anomalies correctly flagged Narrative explanation
Anomaly Recall No seeded anomalies missed Visualization output
Narrative Faithfulness Text accurately reflects numbers Whether numbers are correct
Hallucination Detection No invented claims or metrics Whether claims are insightful
Scope Adherence Filters and date ranges respected Downstream formatting

Define and enforce these boundaries in your test architecture.

5. Fast: The Compound Latency Problem

Traditional principle: Execute in reasonable time.

Analytics AI has stacked latency:

NL β†’ Query Generation:       2–5s    (LLM inference)
Query Execution:              1–30s   (depends on data volume and warehouse)
Insight Generation:           3–8s    (LLM inference)
Narrative Generation:         2–5s    (LLM inference)
Multi-step agentic chain:     Γ— N     (for each reasoning step)
────────────────────────────────────
Single E2E test, worst case:  ~60s+

A full test suite of 200 end-to-end scenarios could take hours.

To adapt, use a three-tier execution model:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  TIER 1: Every Commit (~30s total)                         β”‚
β”‚  β€’ SQL generation correctness (no DB execution, parse only) β”‚
β”‚  β€’ Schema validation of insight JSON                        β”‚
β”‚  β€’ Null and type checks on structured outputs               β”‚
β”‚  β€’ Fast: no LLM calls if possible, mock everything          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  TIER 2: Every Pull Request (~5 minutes total)             β”‚
β”‚  β€’ 20–30 NLβ†’Queryβ†’Result tests against in-memory SQLite     β”‚
β”‚  β€’ Insight accuracy on 15–20 canned result sets             β”‚
β”‚  β€’ Hallucination detection on known-answer questions         β”‚
β”‚  β€’ Numerical correctness on critical metric calculations     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  TIER 3: Nightly / Pre-Release (30-60 minutes)            β”‚
β”‚  β€’ Full eval suite: 200+ questions across all analytics      β”‚
β”‚  β€’ Anomaly detection precision/recall on seeded datasets    β”‚
β”‚  β€’ Multi-step agentic analysis end-to-end                   β”‚
β”‚  β€’ Adversarial inputs (ambiguous, trick, edge-case)         β”‚
β”‚  β€’ Regression: insight quality vs. previous model version   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Also consider these fast-test strategies:

  • Use in-memory SQLite with seeded data for query execution tests
  • Cache model outputs for known inputs during development iteration
  • Use a smaller/faster model for regression checks
  • Track latency and cost as first-class test metrics. An analysis taking 45 seconds is a failing test.

Additional Principles for Ameya Analyst

TEA's five principles aren't enough for analytics AI. You need three more.

6. Traceable

An analytics insight like "Vendor concentration risk increased 23% this quarter" must be traceable to:

  • The exact dataset snapshot used
  • The exact query generated and executed
  • The model and prompt version
  • The calculation method and intermediate values

Without traceability, debugging is impossible.

Implementation: Every test result and production insight should carry a trace ID linking back to all of the above. Tools like LangSmith, Braintrust, and Weave are purpose-built for this.

7. Grounded

Grounding is the analytics equivalent of citation. Every claim must be independently verifiable from source data.

def test_insight_is_fully_grounded(insight, source_data):
    for claim in insight.parsed_claims:
        assert claim.value in source_data.compute(claim.metric, claim.filters), \
            f"Claim '{claim.text}' is not grounded in source data"

Grounding tests are high-value for Ameya Analyst.

8. Adversarially Tested

Analytics AI faces adversarial inputs that expose systemic reasoning failures.

Adversarial Input Type Example Risk
Ambiguous metric "Show me performance" Wrong metric chosen silently
Conflicting time filters "Q3 vs last year Q3" Wrong date math
Missing data segment Asking about a category with no data Hallucinated answer
Leading questions "Why did sales drop?" (when they rose) Confabulated explanation
Large number edge cases Values in billions with 8+ decimal precision Rounding/formatting errors
Non-existent schema references Asking about a column that doesn't exist Hallucination risk
Double-negation queries "Show non-enterprise vendors excluding inactive" Filter logic errors

Build a dedicated adversarial test suite for these scenarios.

Complete Testing Principles

Principle Origin Description
Deterministic TEA Use frozen data; assert on properties, not values
Isolated TEA Test each pipeline layer independently
Explicit TEA Use numerical, semantic, and provenance assertions
Focused TEA One test, one layer, one responsibility
Fast TEA Three-tier execution model; in-memory fixtures
Traceable Extended Every result links to model, prompt, data, and query versions
Grounded Extended Every claim verifiable from source data
Adversarially Tested Extended Explicit coverage of ambiguous, misleading, and edge-case inputs

Does TEA Work?

TEA Principle Applicability Gap and Mitigation
Deterministic Low natively Freeze data + structured output assertions + temperature=0
Isolated Medium Achievable with layer architecture but requires upfront design
Explicit Medium Requires new assertion vocabulary: numerical, semantic, provenance
Focused High Natural layer boundaries make this achievable
Fast Low natively Requires three-tier execution and in-memory fixtures

The Mindset Shift

Traditional testing asks: "Is it correct?"

Ameya Analyst testing asks: "Can we prove it's correct, and do we know when it isn't?"

The second leads to measurement systems: distributions, thresholds, trends, and provenance chains.

Analytics AI requires frozen data, layered isolation, provenance tracking, semantic evaluation, and adversarial coverage working together. That is the bar Ameya Analyst needs to be trustworthy. TEA, extended, is the framework to get there.

Quick Reference Checklist

β–‘ Are all tests running against frozen, versioned dataset snapshots?
β–‘ Is each test scoped to a single pipeline layer?
β–‘ Are numerical assertions using tolerance-based comparisons?
β–‘ Do insight objects carry full provenance (query, method, time window)?
β–‘ Is there a grounding check that verifies claims against source data?
β–‘ Is there an LLM-as-judge eval for narrative outputs?
β–‘ Are tests tiered across commit / PR / nightly execution?
β–‘ Is there an adversarial test suite covering ambiguous and edge-case inputs?
β–‘ Are model version, prompt version, and dataset version logged with every test run?
β–‘ Are latency and cost tracked as first-class test metrics?

This post is part of the Ameya AI Engineering series on building trustworthy AI systems. For questions on the testing framework, reach out to the QA & AI Engineering team.

I'm Gangadhar Neeli from Ameya - Engineering. I hope this helps you on your AI journey. We've seen these principles reduce processing time by 40% internally. If you're interested in how Ameya Extract, Ameya AI Agents, or Ameya Analyst can help your organization, please contact us.

Share:

Gangadhar Neeli

Ameya - Engineering

Visionary technology leader with 26+ years of experience driving strategic initiatives across Enterprise IT, with deep expertise in application rationalization, AI-led modernization, and enterprise platform architecture.

I've seen firsthand how challenging it is to test AI analytics. If you're struggling to build trust in your AI insights, I'd be happy to share how Ameya Analyst can help. Let's book a demo and discuss your specific needs.

Learn More →