The Veterinary AI Transparency Gap: What a New Audit Actually Found

Seventy-one commercially available veterinary AI products. A standardized 25-point transparency framework adapted from FDA and Good Machine Learning Practice guidelines. A mean transparency score of 6.4 percent. Nearly two-thirds of vendors failed to disclose a single validation metric.

Those are the headline numbers from a new systematic audit published in Frontiers in Veterinary Science on March 5, 2026, by David Brundage at the University of Wisconsin-Madison School of Veterinary Medicine (DOI: 10.3389/fvets.2026.1761038). It is, to my knowledge, the first systematic audit of transparency in the commercial veterinary AI market. The results deserve the attention of every veterinarian currently using or evaluating these tools.

What the audit actually measured

Brundage built the Veterinary AI Transparency Index (VATI), a 25-point scorecard adapted from human medicine frameworks including the FDA's Good Machine Learning Practice guiding principles, CONSORT-AI, and the CHAI Model Card framework. The instrument evaluates four domains: data provenance and composition, performance and validation, safety and risk management, and usability and documentation.

The search was deliberately narrow and reproducible. Over a seven-day window in November 2025, Brundage queried conference exhibitor lists from VMX, WVC, and AVMA Convention 2025, business aggregators like Crunchbase and LinkedIn, mobile app stores, and structured web search with terms excluding academic sources. From 1,353 initial records, 71 commercial products met inclusion criteria. A forensic web archiving protocol captured time-stamped offline mirrors of every vendor site to prevent stealth editing during the review period.

Vendors were stratified into three functional domains: Diagnostic Imaging AI (pixel-based computer vision), Generative and Ambient AI (NLP and large language models), and Specialized tools. Generative and Ambient dominated the market at 66.2 percent of products, followed by Diagnostic Imaging at 26.8 percent.

The Validation Gap between imaging and generative tools

The most polarizing finding is between modalities. Diagnostic Imaging vendors achieved a mean risk-weighted transparency score of 13.1 percent. Generative and Ambient vendors averaged 1.8 percent. The disparity was statistically significant under Kruskal-Wallis testing, with 36.8 percent of imaging vendors providing peer-reviewed evidence or specific accuracy metrics compared to only 2.1 percent of generative vendors.

Confidence intervals, which matter enormously for clinical decision-making under uncertainty, were reported by 15.8 percent of imaging vendors and zero percent of generative vendors. Independent test sets were disclosed by 26.3 percent of imaging vendors and 2.1 percent of generative vendors. Across the entire cohort of 71 products, exactly one vendor disclosed the signalment distribution of its training data. One.

The Stage-4-on-Stage-1 problem

Brundage's most pointed critique borrows a staged-maturation framework from human primary care AI integration. In that framework, technology must prove reliable at Stage 1 automated documentation, then Stage 2 administrative workflows, before attempting Stage 3 reactive or Stage 4 proactive clinical decision support. Recent human primary care evaluations have documented a 70 percent error rate in AI-generated draft notes and frequent omissions in social history and critical details, suggesting Stage 1 scribes are not yet consistently reliable even in well-regulated environments.

Brundage argues the veterinary market has inverted this progression. Veterinary vendors routinely ship Stage 4 "Proactive Consultant" features on top of Stage 1 "Passive Scribe" technology that human medicine has deemed insufficiently reliable for unmonitored use. Automated differential diagnosis generation, triage urgency scoring, and treatment plan recommendations are widely marketed without corresponding validation evidence. The audit describes this as leapfrogging the validation phase and embedding unquantified risks into complex care delivery systems.

Why this matters for practitioners, not just researchers

The regulatory vacuum here is genuine. In North America there are currently no federal pre-market approval requirements for AI tools used in veterinary medicine. The accountability shifts to the licensed practitioner, who is expected to validate the tools they use but is structurally denied the data required to do so. The audit's language is direct: veterinarians bear the legal and ethical burden of validation without access to necessary performance data.

This aligns with a second February 2026 Frontiers paper. Li and colleagues surveyed 455 Chinese veterinary professionals and documented an "adoption paradox" (DOI: 10.3389/fvets.2026.1727001): 71.0 percent of respondents had already incorporated AI into their clinical workflow, yet 44.6 percent of those active users reported low familiarity with the technology they were relying on. Concern about AI reliability was the top barrier to adoption at 54.3 percent, and 93.8 percent supported mandatory regulatory oversight. Veterinarians know something is off. They are using the tools anyway.

What I would ask before deploying a veterinary AI tool

Three practical questions, drawn directly from the VATI framework and the kind of due diligence I would apply to my own practice:

First, is there a published, independent test set? A metric on training data is not a performance claim. The Brundage audit shows that only about one in four imaging vendors and one in fifty generative vendors disclose this.

Second, what is the signalment distribution of the training data? A model trained predominantly on adult Golden Retrievers is not the same model for a geriatric Chihuahua or for Pancake, Gigi, and Roger. One vendor in 71 discloses this, which renders independent assessment of algorithmic bias effectively impossible across the current market.

Third, what does the vendor disclose about failure modes and out-of-distribution handling? A system that cannot tell you when it does not know is a system that will confidently mislead you at the exact moments that matter most.

The counterexample worth noting

Commercial vendors are not the only players here. Sonus Health, a Cambridge-based startup from Decorte Future Industries that launched pet cardiac screening in January 2026, published its validation data on arXiv as Jose et al., 2601.13593: 91.63 percent mean heart rate accuracy on sixty-second smartphone recordings, with 38 of the recordings annotated by board-certified veterinary cardiologists. That is not a perfect design, and the company bundles board-certified cardiologist review into every full report for exactly that reason. But it does what Brundage's audit says most vendors do not: publish the numbers, describe the dataset, name the failure modes.

The Transparency Gap is not unfixable. It is a choice by a market whose customers have not yet pressed hard enough to change it.