HDR Gateway logo
HDR Gateway logo

Bookmarks

A synthetic dataset of 15,000 "patients" with Community Acquired Pneumonia (CAP)

Population Size

15,000

People

Years

2018 - 2021

Associated BioSamples

None/not available

Geographic coverage

United Kingdom

England

Lead time

1-2 months

Summary

CAP is common, has variable outcomes and a complex management pathway. Hospital-based decision support algorithms would be highly valuable. This is a diverse and realistic synthetic dataset of 15,000 “CAP patients” to facilitate algorithm development.

Documentation

Community Acquired Pneumonia (CAP) is the leading cause of infectious death and the third leading cause of death globally. Disease severity and outcomes are highly variable, dependent on host factors (such as age, smoking history, frailty and comorbidities), microbial factors (the causative organism) and what treatments are given. Clinical decision pathways are complex and despite guidelines, there is significant national variability in how guidelines are adhered to and patient outcomes.

For clinicians treating pneumonia in the hospital setting, care of these patients can be challenging. Key decisions include the type of antibiotics (oral or intravenous), the appropriate place of care (home, hospital or intensive care), and when it is appropriate to stop antibiotics. Decision support tools to help inform clinical management would be highly valuable to the clinical community.

This dataset is synthetic, formed from statistical modelling using real patient data, and represents a population with significant diversity in terms of patient demography, socio-economic status, CAP severity, treatments and outcomes. It can be used to develop code for deployment on real data, train data analysts and increase familiarity with this disease and its management.

PIONEER geography: The West Midlands (WM) has a population of 5.9 million & includes a diverse ethnic & socio-economic mix.

EHR. UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & an expanded 250 ITU bed capacity during COVID. UHB runs a fully electronic healthcare record (EHR) (PICS; Birmingham Systems), a shared primary & secondary care record (Your Care Connected) & a patient portal “My Health”. This synthetic dataset has been modelled to reflect data collected from this EHR.

Scope: A synthetic dataset which has been statistically modelled on all hospitalised patients admitted to UHB with Community Acquired Pneumonia. The dataset includes highly granular patient demographics & co-morbidities taken from ICD-10 & SNOMED-CT codes. Serial, structured data pertaining to process of care including timings, admissions, escalation of care to ITU, discharge outcomes, physiology readings (heart rate, blood pressure, AVPU score and others), blood results and drug prescribing and administration.

Available supplementary data: Matched synthetic controls; ambulance, OMOP data, real patient CAP data. Available supplementary support: Analytics, Model build, validation & refinement; A.I.; Data partner support for ETL (extract, transform & load) process, Clinical expertise, Patient & end-user access, Purchaser access, Regulatory requirements, Data-driven trials, “fast screen” services.

Dataset type
Health and disease, Treatments/Interventions
Dataset sub-type
Respiratory
Dataset population size
15000

Keywords

Synthetic data, age, Ethnicity, socioeconomic data, multi-morbidity, comorbidity, severity, CURB-65, Oxygenation, pulse, oxygen saturations, blood pressure, hospital length of stay, ward admission, ITU, treatments, antibiotics, Outcomes, Mortality, morbidity, machine learning, community acquired pneumonia, offline reinforcement learning, Critical Care, CAP, Artificial Intelligence, Healthcare, Algorithm, diagnosis, Hospitalisation, smoking status

Observations

Observed Node
Disambiguating Description
Measured Value
Measured Property
Observation Date

Persons

15,000 synthetic admissions between 01/01/2018 and 08/06/2021

15000

Count

11 Sep 2021

Provenance

Purpose of dataset collection
Study
Source of data extraction
Other
Collection source setting
Other
Patient pathway description
Data is representative of the multi-ethnicity population within the West Midlands (42% non white). Data includes all patients admitted during this timeframe, with National data Opt Outs applied, and therefore is representative of admissions to secondary care. Data focuses on in-patient stay in hospital during the acute episode but can be supplemented on request to include previous and subsequent hospital contacts (including outpatient appointments) and ambulance, 111, 999 data.
Image contrast
Not stated
Biological sample availability
None/not available

Structural Metadata

Details

Publishing frequency
Static
Version
1.0.0
Modified

08/10/2024

Distribution release date

09/12/2021

Citation Requirements
Data is representative of the multi-ethnicity population within the West Midlands (42% non white). Data includes all patients admitted during this timeframe, with National data Opt Outs applied, and therefore is representative of admissions to secondary care. Data focuses on in-patient stay in hospital during the acute episode but can be supplemented on request to include previous and subsequent hospital contacts (including outpatient appointments) and ambulance, 111, 999 data.

Coverage

Start date

01/01/2018

End date

07/06/2021

Time lag
Other
Geographic coverage
United Kingdom, England, West Midlands
Minimum age range
18
Maximum age range
110
Follow-up
0 - 6 Months

Accessibility

Language
en
Alignment with standardised data models
LOCAL
Controlled vocabulary
SNOMED CT, ICD10
Format
CSV

Data Access Request

Dataset pipeline status
Available
Time to dataset access
1-2 months
Access request cost
www.pioneerdatahub.co.uk/data/data-services-costs/
Access method category
TRE/SDE
Access service description

Trusted Research Environments (TRE) are built using Microsoft Azure services and hosted in the UK to provide research teams a safe, secure and agile environment which allows users to quickly analyse, interpret and form an enriched view of primary care information through a range of integrated datasets.

Health data collated from multiple sources is ingested into a secure data lake which will then allow subsets of data to be made available to research teams on approval of a data request. Once approved a customer specific TRE is made available with a standard set of leading analytical tools from Microsoft including Azure Databricks, Azure Machine Learning, Azure SQL and Azure Synapse (for large-scale data warehouses). Specific tools can be provided at an additional cost over the standard platform data access charge and the PIONEER team will work with you to determine your exact needs.

Access to the TRE is managed using the latest virtual desktop technology to provide a safe and secure end-user experience. By utilising leading edge design PIONEER are able to create TREs rapidly to enable us to service any customer requirement.

Jurisdiction
GB-ENG
Data use limitation
General research use,Commercial research use
Data use requirements
Project-specific restrictions
Data Controller
University Hospitals Birmingham NHS Foundation Trust

Dataset Types: Health and disease, Treatments/Interventions

Dataset Sub-types: Respiratory


Collection Sources: Other