Cohort Discovery

Learn about Cohort Discovery and gain access to the service

About Cohort Discovery How to request access Security and confidentiality NHS Research SDE Network

How Cohort Discovery keeps data secure

Designed for privacy and trust

The Cohort Discovery Service has been built with security and confidentiality at its core. It allows researchers to discover data cohorts safely, without ever accessing or exposing patient-level information.

How a query works

Build a query
Researchers can describe the patient population required for their research using the intuitive query builder to define a cohort
The query is securely processed
Researchers’ queries are run securely in real time across multiple pseudonymised datasets.
The query runs locally
Queries are run inside each data custodian’s secure environment on a separate secure network. They run on a prepared de-identified subset of the dataset, not on patient identifiable data.
The results are returned
Only an aggregated, rounded count of matching individuals is returned. No individual-level data ever leaves any of the secure data environments

How data is protected

No identifiable data is used
Personal details like names, addresses, and exact dates are removed (pseudonymised data) and the service only processes a summary count showing how many records match the search criteria – no other data.
Rounding and low number suppression
Aggregate results are rounded, and low number cohorts suppressed to help protect confidentiality.
Secure environments with separated network areas
All queries run within each data custodian’s secure infrastructure, on secure networks separate from where patient data is held. Data never moves or leaves the host organisation.
Outbound only connections
Software inside the secure data environment checks for available queries within the service. There are no inbound connections enabled, keeping the network secure.
Standardised and comparable data
Datasets are harmonised using the OMOP Common Data Model so queries can run across multiple datasets comparably.

Built on the Five Safes Framework
The Cohort Discovery Service follows the Five Safes Framework to ensure safe and secure access to information about the available data across a range of secure environments:

Safe People: Only approved users that have a verified institutional email address and appropriate roles within those organisations can access Cohort Discovery
Safe Projects: The terms and conditions require acceptance that any use of Cohort Discovery will be for the sole purpose of understanding the size of populations relevant to a particular research question and that any resulting study or clinical trial will be for public benefit as defined by the National Data Guardian.
Safe Settings: Each secure system involved in the Cohort Discovery process complies with data protection regulations, and regular security audits, assessments and accreditations. Queries using Cohort Discovery are sent securely and are built from a set of predefined fields.
Safe Data: The data queried by Cohort Discovery is a separate subset of the full patient data that has been pseudonymised. The data is held in isolated secure environments and not transferred outside of this environment.
Safe Outputs: Data Custodians retain full control of the patient data they hold, and only the minimum information required to answer a user query is securely transferred back to the user. Cohort Discovery only returns aggregate and rounded counts of how many records match the search criteria posed by the user (the cohort size). No data itself is ever shared or transferred outside the secure environment. To reduce the risk of identification, data custodians apply privacy controls including low number suppression and result rounding. In most cases, results are only returned when more than 10 people match the criteria and counts may be rounded to avoid revealing exact numbers. For example, a result may show 110 matches instead of the exact number of 107. All activity in Cohort Discovery is also logged through an audit trail to monitor usage and prevent misuse.

Cohort Discovery was developed, and continues to be enhanced, with the ethical and secure use of sensitive data at the heart of its design. User queries are constructed and submitted in a platform that does not hold any sensitive data, and the queries can only be constructed from approved pre-defined fields. Once ready, the query is encrypted and sent securely to separate isolated secure environments. The query is not run against the source patient data, rather it is run against a subset of data specially prepared for Cohort Discovery. Not only has the data been formatted and harmonised to make Cohort Discovery queries possible, but it has also been deidentified, to remove personal information such as names, addresses or dates or birth, and pseudonymised which replaces this personal information with unique identifiers that cannot be linked back to the individual that the data belongs to. This data subset remains securely within a computer managed by the Data Custodian, and instead, only a count of the number of records in that cohort (or group) that meet the criteria of the user’s query are returned.

Cohort Discovery follows the five safe principles to ensure the safe and secure access to information about the available data across a range of research environments. The framework focusses on five key areas:

Safe People: Only registered users that have a verified institutional email addresses and appropriate roles within those organisations can access Cohort Discovery.

Safe Projects: The terms and conditions require acceptance that any use of Cohort Discovery will be for the sole purpose of understanding the size of populations relevant to a particular research question and that any resulting study or clinical trial will be for public benefit as defined by the National Data Guardian.

Safe Settings: Each secure system involved in the Cohort Discovery process complies with data protection regulations, and regular security audits, assessments and accreditations. Queries using Cohort Discovery are sent securely and are built from a set of predefined fields.

Safe Data: The data that is queried by Cohort Discovery is a separate subset of the full patient data that has been pseudonymised. The data is held in isolated secure environments, and not transferred outside of this environment.

Safe Outputs: The Data Custodians retain full control over the sensitive patient data, and this process ensures only the minimum information required to answer the users query is securely transferred back to the user. Cohort Discovery only returns a count of the size of population in a cohort (group) that matches the users posed criteria, no data itselfis transferred. To protect against identification by a unique set of characteristics, each Data Custodian must set a threshold in which no results are returned, for most this means only results with more than 10 people are returned. Also, to protect against someone asking multiple questions and subtracting counts to identify someone, each Data Custodian can enable rounding on results, which ensures they are never exact results but sufficiently precise to allow the researcher to know the scale of the dataset. For example, it might therefore say there are 110 people in the dataset instead of the exact number of 102. An audit trail further tracks usage to prevent misuse.

Cohort Discovery, originally developed as part of the CO-CONNECT programme, supported researchers to find and access COVID-19 data at pace, while ensuring peoples information was kept private and secure. If you would like to learn more about the CO-CONNECT development for the rapid discovery and data access to COVID-19 data, please read the CO-CONNECT Hybrid Architecture paper.