A new peer-reviewed paper just landed that is worth careful attention from anyone evaluating commercial veterinary AI for dermatology workflows. It is one of the cleaner examples I have seen of an academic-industry collaboration shipping with both the methodology and the per-lesion performance numbers in the same publication, which is exactly the kind of release the radiology side of veterinary AI has been pushed toward for the past year.

The paper is Kang et al., "Artificial Intelligence-Based Identification of Common Canine Skin Lesions From Clinical Images," published in Veterinary Dermatology on May 1, 2026 (DOI: 10.1111/vde.70083, PMID 42067986). The work was a joint effort between the Laboratory of Veterinary Dermatology at the Seoul National University College of Veterinary Medicine, AIFORPET Corporation (a Korean veterinary AI company), and clinicians in Korea and India.

What the study did

The team trained four independent convolutional neural networks, one per lesion type, using the EfficientNet architecture. The four lesion classes are the bread and butter of small-animal dermatology: erythema, lichenification, alopecia, and erosion or ulcer. Clinical images were collected from dogs presented to a veterinary medical teaching hospital, labeled by veterinary surgeons, and used as the ground truth.

Each model was scored on six metrics: accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score. Per-lesion performance from the abstract:

The alopecia model performed best, with 98.12 percent accuracy and an F1 score of 98.18 percent. The erythema and erosion-or-ulcer models showed balanced performance across all six metrics, with accuracy above 90 percent. The lichenification model had the weakest sensitivity at 87.02 percent and an F1 score of 89.88 percent, which the authors flag explicitly as the model that needs more work.

All four models cleared the 90 percent accuracy bar.

Why the architecture choice matters

EfficientNet is a known-quantity backbone for image classification that scales depth, width, and resolution together. It is a sensible default for veterinary imaging because it tends to produce strong results without the parameter count and training cost of larger transformer-based models. The Kang group used a separate model per lesion class rather than a single multi-class model. That choice has tradeoffs. A per-class architecture is simpler to validate and easier to update one lesion at a time, but it can miss correlations between lesions that often co-occur clinically (an atopic dog might present with erythema and lichenification together, for instance). The paper does not appear to address joint-lesion performance directly in the abstract, which is something I would want to see in any follow-up.

Where this fits in the broader picture

For context, a January 2026 audit by Brundage at the University of Wisconsin-Madison of 71 commercially available veterinary AI products found a mean transparency score of 6.4 percent, with nearly two-thirds of vendors failing to disclose a single validation metric. The Kang et al. paper is the kind of disclosure pattern that audit was implicitly asking for: model architecture named, per-lesion accuracy tables published, weakest model identified by the authors themselves, and the data source described.

Vetology Innovations followed a similar transparency approach in January 2026 when they released sensitivity and specificity numbers for all 89-plus classifiers across their imaging platform. The Kang paper is a different mechanism, peer-reviewed publication versus public dashboard, but the direction is the same.

What this does not yet tell us

Three limitations worth flagging before any clinic considers this kind of tool for live workflow.

First, the paper evaluates lesion-level identification on clinical images rather than primary diagnosis. A model that correctly labels alopecia at 98 percent does not, by itself, tell you whether the alopecia is endocrine, allergic, parasitic, or psychogenic. The authors are clear about this. The framing is decision support for objective lesion documentation and treatment monitoring, not autonomous diagnosis.

Second, breed and skin-color distribution in the training set will matter for generalization. The Seoul collection is likely weighted toward small-to-medium breeds favored in Asian markets, which echoes the prior Ryu et al. atopic dermatitis distribution study from the same lab. Performance on lichenification in pigmented or coated breeds outside that distribution is not yet established.

Third, the validation cohort size and prospective external validation status are not visible in the abstract. Retrospective single-site training plus same-site testing is the starting point, not the finish line, for any tool that might end up in clinical workflow.

A practical note from my own observations

My senior dog Roger has chronic skin issues. Over the past two years I have learned that lesion documentation is one of the things even attentive owners get wrong, because what looks like the same red patch over six weeks often is not. A reliable, consistent, labeled photo trail would have changed two of his treatment cycles. That is not a hypothetical use case for this kind of model. It is the use case that brings the most measurable value to a real clinic in 2026: longitudinal lesion tracking that does not depend on one tech remembering how the lesion looked last visit.

What I am watching going into the rest of 2026: whether any veterinary AI dermatology tool publishes prospective external validation, whether tools that pass that bar earn integration with the major practice information management systems, and whether the per-lesion model approach the Kang group used or a multi-task unified model wins on real-world deployment metrics. The Kang paper is a meaningful evidence point. It is not yet a clinical product story.

For clinicians evaluating commercial tools that claim dermatology AI capability this quarter, the right question to ask the vendor is the same one Brundage's audit framework uses: per-condition sensitivity and specificity numbers, training-set composition, and prospective validation status. Vendors that can answer all three deserve a pilot. Vendors that cannot, do not.

If your clinic's "before and after" photo files are six different angles, three different lighting setups, and one cat that escaped twice, you are exactly the audience this kind of model was built for.

Reply

Avatar

or to participate

Keep Reading