About

Methodology

Structured dataset of NICE Technology Appraisals — the UK's health technology assessments that determine NHS drug funding. 826 appraisals indexed with 3,307 source documents. 555 have structured entity extraction from Final Appraisal Documents.

What's in the graph

  • Interventions (572 drugs) — generic name, drug class, mechanism, route, duration type
  • Conditions (1,298) — name, therapeutic area, disease setting, biomarker
  • Methodological decisions (9,509 across 23 categories) — company position, ERG position, committee preference, ICER impact. Categories include survival extrapolation, utility source, comparator selection, crossover adjustment, model structure, and 18 others.
  • ICER bands (926) — below £20k, £20–30k, £30–50k, £50–100k, above £100k, dominant, confidential
  • Clinical trials (2,206) — design, phase, blinding, crossover, generalisability
  • Comparators (2,769) — type, established practice, committee preferred
  • Economic models (675) — type, time horizon, health states
  • Commercial arrangements (413) — PAS, MAA, CAA
  • Evidence gaps (3,658) — type and description
  • Cross-references (952) — TA-to-TA citations with relationship type

Pipeline

  1. Scraped NICE website for all TA listings via the Next.js JSON API (894 TAs indexed)
  2. Downloaded 4,522 PDFs (7.3 GB)
  3. Converted to markdown with page-level markers using pymupdf4llm (3,307 files, 380K pages)
  4. Chunked into 10-page windows with 2-page overlap (2,314 FAD chunks)
  5. Extracted entities using Claude Haiku 4.5 with tool calling against a purpose-built ontology. 2,314/2,314 chunks processed, 0 errors.
  6. Normalised enums, deduplicated entities across overlapping chunks, built 14 typed SQLite tables
  7. Built FTS5 full-text search index across all 3,307 documents

Ontology development

The extraction schema was developed iteratively: two independent AI agents each proposed an ontology from 20 diverse TAs, then proposals were merged and refined over 5 rounds on 50 additional TAs. Stabilised at 9 entity types, 23 methodological decision categories, and 8 ICER bands, with a 3.5% “other” rate for decision categories. See ontology/methods.md.

Coverage and limitations

  • 826 TAs have source documents; 555 have structured extraction
  • Extraction targets Final Appraisal Documents only — not ERG reports, scope comments, or committee papers
  • Numerical results (ICERs, QALYs, costs, hazard ratios) are deliberately not extracted — stored as categorical bands instead
  • Older TAs (pre-2006) have lower extraction quality due to different document formats
  • Extraction uses AI and is not perfect — verify against source documents for critical decisions

API

EndpointDescription
/llms.txtMachine-readable project overview and API guide
/llms-full.txtFull corpus index (all TAs with document counts)
/api/search?q=...&format=plainFull-text search
/api/corpus/ta{N}/Document listing for a TA
/api/corpus/ta{N}/{doc}.mdRaw document markdown
POST /api/chatNatural language → SQL → answer (SSE stream)

Source

GitHub · Data from NICE · Built by Shoulders