If you're building an AI-powered digital health product for the NHS, Standard 4 of the NICE Evidence Standards Framework is the one most likely to expose gaps in how your product was built — not just how it's documented. Most founders read it as a box to tick. They're wrong, and the consequences of getting it wrong are both legal and commercial.
Here's the argument this post makes: Standard 4 isn't a values statement from NICE about diversity. It's an engineering requirement with legal force, and the companies that treat it as paperwork are building products that will underperform for the patients who need them most — and will be found out when they try to enter the NHS market.
What Standard 4 Actually Requires
Standard 4 sits in the Design Factors group of the NICE Evidence Standards Framework — Standards 1 to 9 — and applies across all tiers. The first thing founders miss is that it has two distinct layers.
The base layer applies to every DHT, regardless of whether it uses AI. Every manufacturer must describe how health and care inequalities were considered during design, and what steps were taken to ensure the product doesn't worsen existing disparities. This isn't about having a diversity policy. It's about documenting design decisions: who the product was built for, at what digital literacy level, in which language, on what devices, and whether the clinical logic of the product works equally across the populations it claims to serve.
The second layer applies to any DHT incorporating data-driven algorithms or machine learning. Here, the requirement sharpens considerably. Manufacturers must describe what active steps were taken to detect and mitigate algorithmic bias across different demographic groups. That means showing how training data was assembled, how bias was tested across protected characteristic subgroups, and what governance processes exist to monitor for emerging disparities after deployment. The legal anchor is the Equality Act 2010 and the NHS's Public Sector Equality Duty, which covers nine protected characteristics — age, disability, race, sex, and five others — and obligates NHS commissioners to actively advance equality of opportunity, not merely avoid discrimination.
The framing matters: the NHS cannot knowingly procure a tool that delivers inferior care to already-disadvantaged populations. Standard 4 is the mechanism through which that commitment becomes your problem as a founder, not just a commissioner's.
Why the Risk Is Bigger Than Most Founders Realise
The scale of health inequality in England is the context that gives Standard 4 its urgency. Men in the most deprived areas of England have a life expectancy 10.4 years shorter than those in the least deprived areas; for healthy life expectancy — years lived in good health — that gap stretches to 19.3 years for males and 20.1 years for females, according to ONS data. Digital health has the potential to narrow that gap, or to widen it.
The risk of widening it is most acute in AI, and the evidence base for this is now extensive. A landmark study published in the New England Journal of Medicine examined pulse oximetry — the fingertip device used in almost every remote monitoring pathway — and found that Black patients had nearly three times the rate of occult hypoxemia (dangerously low oxygen saturation that the device failed to detect) compared to White patients. This was not a new product with untested technology. It was a device that had been in clinical use for decades, with a bias that went undetected because nobody had looked for it systematically.
Algorithmic bias in clinical decision support tools follows the same pattern. A widely cited study published in Science found that a health risk algorithm used to identify patients who would benefit from additional care — affecting millions of patients — systematically underestimated illness severity in Black patients. The consequence: at a given risk score, Black patients were considerably sicker than White patients. Fixing the disparity would have increased the percentage of Black patients receiving additional help from 17.7% to 46.5%. The algorithm wasn't designed to be discriminatory. It was trained on healthcare utilisation data, which reflected a healthcare system that had already provided Black patients with less care. The AI learned the bias; the bias was laundered into clinical practice as an objective score.
In dermatology AI — one of the most commercially active areas of clinical AI development — a study published in Science Advances found diagnostic performance was measurably lower in patients with darker skin tones (Fitzpatrick IV–VI: AUROC 0.82) compared to lighter skin tones (I–III: AUROC 0.89). And the pipeline feeding these models isn't improving fast enough: a separate analysis of 106,000 clinical dermatology images found that only 11 represented patients with darker skin from African, African-Caribbean, or South Asian populations.
The Five Ways Companies Get This Wrong
-
Treating the Equality Impact Assessment as a procurement document.
Many companies run an EqIA at the point of NHS procurement because a commissioner asks for one — by which point the design is fixed and the findings can't change anything. Standard 4 requires inequalities consideration at the design stage, where it can actually influence decisions. An EqIA that can't change the product is a compliance document, not a compliance act.
-
Separating user diversity from clinical diversity.
A company might recruit ethnically diverse participants for usability testing and legitimately claim they considered user acceptability (Standard 2). But whether ethnically diverse users find the interface accessible is a different question from whether the clinical algorithms underpinning the product work equally well across those demographics. These are different problems requiring different methods, and founders regularly conflate them.
-
Not knowing what's in their training data.
For AI products, the most common gap encountered in evaluations is that companies can describe the volume of records in their training dataset but not the demographic composition. They know they have 500,000 patient records; they can't tell you what proportion are female, or what age distribution they represent, or whether any deprivation index mapping was conducted. Without this, any claim of bias mitigation is foundationally weak.
-
Missing proxy variable bias.
Algorithmic bias rarely travels through protected characteristic variables directly. It hides in proxies: postcode, prior healthcare utilisation, language of clinical notes, insurance status in any international datasets. These variables correlate with protected characteristics and can introduce or amplify bias without a demographic flag ever appearing in the model. Bias audits that only check for direct demographic disparities miss the mechanism entirely.
-
Treating bias testing as a launch activity, not an ongoing one.
Algorithmic drift — where model performance degrades over time as real-world data patterns shift away from the training distribution — can introduce new biases post-deployment. Standard 4 has an ongoing monitoring requirement, and it connects directly to Standard 16 (performance monitoring). Companies whose bias governance ends at launch are out of compliance before they've finished onboarding their first NHS trust.
What Good Looks Like in Practice
Strong Standard 4 submissions treat equity as an engineering discipline, not a values statement. The outputs are measurable, documented, and defensible.
At the design stage, this means an explicit specification of the intended population — including populations at elevated risk of health inequality — and design decisions that address their needs specifically. For AI products, it means producing a training data card that documents dataset composition, known demographic gaps, and the steps taken to address underrepresentation. It means calculating subgroup performance metrics across protected characteristics before release, not only aggregate accuracy measures.
For ongoing governance, it means a defined process for monitoring performance across protected characteristic groups post-deployment, with predefined thresholds for intervention. The question "what happens if we detect performance degradation for a specific demographic group six months post-launch?" should have a documented answer before you go live, not after.
The most credible submissions also include third-party bias auditing — independent technical review of the model's fairness properties. This is emerging as best practice in regulated sectors. For high-risk clinical AI, it will likely become a procurement expectation for NHS commissioners within the next few years, and companies that have already built this into their governance process will move faster than those that haven't.
The MHRA and NICE are increasingly aligned with the WHO's Ethics and Governance of AI for Health framework, which treats fairness documentation and bias monitoring as core, not optional. The UK AI Assurance Roadmap points in the same direction. The regulatory environment is tightening; the question is whether your product governance is already ahead of it or running behind.
The Commercial Argument for Getting This Right
Standard 4 is sometimes framed internally as a compliance cost. That's the wrong frame. A product that works well across the full demographic diversity of the NHS patient population is a better product — with a stronger adoption case — than one optimised for a narrow demographic. The NHS serves one of the most demographically diverse patient populations in the world. A tool that demonstrably performs equitably across that diversity is a tool commissioners can procure with confidence. One that can't demonstrate equity is one that generates procurement risk.
There's also a growing legal dimension that founders should understand. The Equality and Human Rights Commission has explicitly identified AI bias as a priority enforcement area and is developing formal guidance. Under the Equality Act 2010, NHS commissioners who knowingly procure DHTs delivering systematically inferior care to protected characteristic groups face potential discrimination liability. As AI becomes more central to clinical decision-making, the question of whether algorithmic outputs are discriminatory will move from academic discussion to legal proceedings.
The companies that will navigate this transition well are not the ones that respond to Standard 4 with the minimum required documentation. They're the ones that have made algorithmic fairness an engineering discipline — with training data cards, subgroup performance metrics, bias monitoring dashboards, and third-party audit on their roadmaps. That infrastructure doesn't just satisfy NICE. It's also the infrastructure you'll need when the Equality and Human Rights Commission asks questions, when a major NHS trust requests a bias audit as part of procurement, or when a journalist investigates whether your dermatology AI works equally well for all skin tones.
Standard 4 is not just a compliance requirement. It's also a product quality requirement.
A Practical Compliance Checklist
The following applies to any DHT with AI/ML components submitting against the NICE ESF:
Where to Go Next
Standard 4 doesn't sit alone. The standards it connects to most directly are Standard 2 (user acceptability — inclusive research at the design stage), Standard 5 (data practices — how training data is assembled and governed), Standard 14 (effectiveness evidence — subgroup performance as part of clinical validation), and Standard 16 (performance monitoring — ongoing equity surveillance post-deployment). Understanding how these standards form a coherent system is the difference between meeting each one in isolation and building a product that holds up under real NHS scrutiny.
If you're working through the NICE ESF and want to understand where your current submission is strong and where it has gaps, get in touch. This is precisely the kind of analysis that Healthonomix is built for.



