When a "Working" AI Model Fails in Production

This case study is based on a real engagement. I have explicit permission to share the lessons learned on the condition that the organisation remains anonymous. For that reason, identifying details have been deliberately obscured.

The client and I agreed that the value of this case lay not in attribution, but in transparency. The findings are not unusual. In fact, they are uncomfortably common and align closely with widely reported AI initiative failure rates — often cited at around 70–80% in industry research, including studies published by firms such as McKinsey.

This case study was presented last week at the Investigo Group AI Failure event (see previous post), where the discussion focused on a central theme: AI projects rarely fail because of model performance. They fail because the enterprise system — governance, architecture, operating model, behavioural adoption — was not designed to sustain probabilistic technologies.

If the lessons resonate, that is precisely the point.

Syndicate 4827, a mid-sized Lloyd’s market player specialising in Marine Cargo, Specialty Property and Political Risk, faced a structural challenge familiar across the London Market.

An experienced underwriting team with deep broker relationships
Decision-making often driven by trust built over decades
Increasing market pressure to underwrite complex risks within minutes
Rising use of AI by competitors to price and triage faster

Underwriters were sceptical of AI — some feared replacement. Yet their bonus structure, tied directly to underwriting profit, created a paradox: if AI improved loss ratios and productivity, it could materially increase personal reward.

This tension set the stage for Project Argus.

The Ambition

Argus was designed to:

Reduce underwriting decision time by 30%
Improve risk selection accuracy
Enhance portfolio loss ratio by 2–3% and arrest the gradual erosion of underwriting returns driven by adverse selection, as competitors in the London Market increasingly leveraged more advanced analytics and AI-enabled underwriting to make superior risk selection decisions.

A vendor-developed machine learning model showed promising results during pilot:

AUC: 0.84
12% improvement over legacy rating heuristics
Back-tested across five years of historical data

It “worked in the notebook.” Proof of Concept was achieved.

Executive approval for production followed.

Phase 1 – Pilot Success (In a Safe Bubble)

The pilot environment was controlled:

Cleaned historical data
Manual feature engineering
Batch scoring
Offline evaluation
Narrow product scope

Underwriters were impressed. Bonus projections were quietly discussed.

But what worked in a static analytical environment was about to meet a live insurance estate.

Phase 2 – Production Reality

Deployment was IT-led.

The model was:

Containerised
Integrated into the underwriting platform (Guidewire, on-premise)
Connected via hard-coded API endpoint
Fed by nightly ETL extracts

Critically:

No architectural redesign occurred.
No operating model evolved.
No governance model changed.

The AI was inserted into a deterministic IT estate that had never supported probabilistic systems.

Failure Symptoms

1 – Data Pipeline Fragility

Feature engineering relied on:

Hard-coded schemas
Static field mappings
Version-bound transformations

A minor schema update in the policy administration system (new endorsement fields) triggered:

Feature misalignment
Silent null inflation
Input distribution drift

The model did not crash.

It degraded.

No drift monitoring existed.

2 – No Ownership Model

Ambiguity paralysed response:

IT viewed it as a business tool
Underwriting viewed it as an IT service
Data Science assumed IT would monitor performance

There was:

No L2 model support function
No defined retraining triggers
No performance threshold governance
No CAB classification for model updates

The model had no accountable owner.

3 – SLA ≠ Model Integrity

IT monitored:

API uptime
Latency
Infrastructure health

All green.

But:

Predictive power declined
False positives increased
Underwriters began overriding recommendations

Technically operational.
Commercially failing.

4 – Behavioural Reversion

Trust eroded.

Underwriters reverted to:

Manual heuristics
Spreadsheet shadow models
Relationship-led fast-tracking

The AI score became background noise.

Shadow processes re-emerged — a classic failure signal in transformation programmes.

5 – CAB Friction

When retraining was proposed:

It was classified as a major system change
Full regression testing required
Deployment tied to monthly CAB cycles

AI iteration speed collapsed.

Continuous model evolution became impossible.

Root Causes (Not Model Quality)

The model was not the failure.

The estate was.

Structural Deficiencies

No feature store abstraction
No drift detection (PSI, KS statistics, performance monitoring)
No MLOps pipeline (manual retraining process)
No defined L2 model observability capability
Hard-coded integration between schema and endpoint
No AI-specific change classification
No data governance council, domain ownership, or stewardship model

AI was deployed into a platform with no absorptive capacity.

Business Impact (12 Months Later)

Model usage dropped below 25%
Underwriters informally ignored recommendations
Projected ROI evaporated
Executive confidence in AI initiatives weakened

The post-implementation review concluded:

“The model was technically sound but operationally unsustainable.”

Strategic Diagnosis

Syndicate 4827 attempted:

Technical augmentation
Without platform modernisation
Without operating model redesign
Without governance evolution

They inserted a probabilistic system into a deterministic IT change framework.

The result was inevitable.

The Intervention

Argus was withdrawn.

A structured AI Readiness Strategy and Roadmap was commissioned before any reintroduction.

Key components included:

Dedicated AI platform layer
Feature store abstraction
Automated drift detection
Defined retraining governance
AI-specific CAB pathway
L2 model observability function
Data Council with domain ownership and stewardship
Behavioural adoption programme for underwriters

External data sources were also introduced strategically:

Moody’s RMS analytics
Experian credit risk data
Bloomberg political risk scoring
Curated internet risk intelligence feeds

This significantly increased the projected cost of delivering a scalable, sustainable, and fully operationalised AI risk rating capability. As a result, active underwriters and members of the managing agency initially moved to halt the restart of Argus.

This is when an inflection point came and the framing changed.

Rather than positioning Argus as a single model redevelopment effort, the scope was expanded across all lead classes of business. The initiative evolved into a structured portfolio of AI-enabled transformation programmes, embedded within a broader syndicate-wide business transformation strategy.

AI was no longer presented as a discretionary optimisation opportunity. It was repositioned as a strategic enabler — potentially critical to protecting underwriting discipline, restoring competitive positioning, and safeguarding the future of a syndicate with over 100 years of market heritage.

The conversation shifted from cost containment to long-term survival and return on capital resilience.

The question shifted from:

“Does the model work?”

To:

“Can our enterprise absorb and sustain probabilistic systems?”

The Real Lesson

AI failure in insurance is rarely algorithmic.

It is architectural.
It is governance-based.
It is behavioural.

Project Argus was not a model failure.

It was a transformation design failure.

If you are introducing AI into underwriting, pricing, or portfolio optimisation:

The first question is not model accuracy.

It is:

Is your operating model designed for AI systems?

Without that foundation, even a high-performing model will quietly degrade into irrelevance.

Hide comments

When a “Working” AI Model Fails in Production