Bookmarks
University College London Hospitals NHS OMOP dataset
Population Size
1,200,000
People
Years
2019
Associated BioSamples
None/not available
Geographic coverage
United Kingdom
Lead time
Summary
Documentation
UCLH has an OMOP extraction system (omop_es) that connects our Electronic Health Record (EHR) to an architecture that delivers high quality, standardised extracts meeting the OMOP CDM standards. Our EHR contains records for 6 million patients, 13 million diagnoses and 50 million medication events. These derive from the UCLH patient population which includes national referrals for tertiary and quaternary services (cancer, neurology etc.) and general medical admissions from an inner city teaching hospital that treats >1m outpatients per year, and has >100k inpatient admissions.
UCLH has invested efforts and expertise to align international terminology systems e.g. SNOMED CT, LOINC, UCUM with NHS data standards, during EHR system build and post implementation. Our standardisation work has covered clinical domains i.e. Diagnosis and past medical history, Surgical and Ambulatory procedures, Diagnostic Imaging, Cardiac Echo, Lab Medicine including Biochemistry, Haematology, Microbiology, Immunology, Virology, Allergens, Medications (including route of administration); and Demographic information like Religion, Ethnicity. For some domains (e.g. diagnosis and surgical procedures) we have achieved 100% standardisation, others are an ongoing task.
Our data pipeline, the OMOP-Extraction System (OMOP-ES) is a modular, re-usable architecture written in over 20,000 lines of R. Extractions proceed through four stages.
- Standardisation - translates source data to OMOP concepts at full fidelity
- Projection - applies rules to redact, filter, transform & link
- Post-processing - allows linking of de-identified non-OMOP data
- Output - multiple formats & destinations incl. CSV, Parquet or SQLite for direct use or import in a TRE
The system is ● configurable to a variety of OMOP projects via a settings file ● reproducible and automated ● queries EPIC EHR and other sources ● automates filtering of sensitive data with safe defaults and ability for Information Governance teams to inspect settings before & after running ● tests and reports quality of standardisation ● being extended both by the 'core' team and by other trusts in an inner source fashion ● has a small mock database for system development and testing
Dataset type
Dataset sub-type
Dataset population size
Associated media
Keywords
Observations
Observed Node | Disambiguating Description | Measured Value | Measured Property | Observation Date |
---|---|---|---|---|
Persons | 1200000 | count | 30 Apr 2025 |
Provenance
Purpose of dataset collection
Source of data extraction
Collection source setting
Image contrast
Biological sample availability
Details
Publishing frequency
Version
Modified
27/05/2025
Coverage
Start date
01/04/2019
Time lag
Geographic coverage
Maximum age range
Accessibility
Language
Alignment with standardised data models
Controlled vocabulary
Format
Data Access Request
Dataset pipeline status
Access rights
Jurisdiction
Data use limitation
Data use requirements
Data Controller
Data Processor