Cohort Discovery
About Cohort Discovery
Cohort Discovery helps researchers and innovators to rapidly determine the size of the population which may be relevant to their research question within different datasets, without having to directly contact the individual organisations that hold the data.
Users can specify defined characteristics relevant to their proposed analysis (e.g. the number of female asthmatics between the ages of 25 and 35) through the Cohort Discovery user interface. These search terms are then sent as a real-time query to multiple pseudonymised datasets across multiple Data Custodians, with results returned in the form of a numerical count of individuals that meet those specific criteria.
Researchers can then understand whether a dataset contains a cohort (or group) of interest and if so, contact the Data Custodian to find out more or submit a Data Access Request. The Gateway enables researchers and innovators to easily submit standardised requests to multiple custodians identified through Cohort Discovery. This saves time and effort for both the researcher and the Data Custodian.
How you can request access to Cohort Discovery
We use a proportionate governance approach based on the Five Safes Framework. To access Cohort Discovery, you must demonstrate your Safe People status either as a researcher, NHS analyst or equivalent. This will be assessed based on your Gateway registered user profile and if relevant, your ORCID record.
Your application will confirm that your use of Cohort Discovery will be for the sole purpose of understanding the size of populations relevant to a particular research question (i.e. that it is a Safe Project) and that any resulting study or clinical trial will be for public benefit as defined by the National Data Guardian for health and social care.
If, after your application, your Safe People or Safe Project status is indeterminate, we will contact you for further information and reserve the right not to provide access.
How Cohort Discovery works and keeps data secure and confidential
A Cohort Discovery Data Custodian is an organisation who holds data and has enabled querying of that data for the purposes of Cohort Discovery. Within each Data Custodian environment, a secure network area is set up which is separate from where identifiable data is stored, but still within the Data Custodian’s secure environment. The Data Custodian creates a copy of relevant data – and removes anything they deem to be sensitive, identifiable or surplus to requirements.
This is known as pseudonymous data. For example, information like names, addresses, and specific dates of birth, dates of testing or care are removed, and identifiers are converted into new pseudonymous codes.
The data is converted into the OMOP data standard, so that data from one Data Custodian is comparable to another, thus enabling multiple Data Custodians to be queried simultaneously. The Data Custodian then transfers this OMOP pseudonymous data on to the secure network area that is separated from identifiable data locations.
Software within the secure network area of the Data Custodian sends a message out to the Cohort Discovery application within the Gateway which retrieves any questions which need to be run on the data. An example question could be “How many people in the dataset are female asthmatics between the ages of 25 and 35?” The query is constructed securely from a set of predefined fields. The user’s query is processed by the Cohort Discovery software within the Data Custodian secure network area. Based on the results the user can then decide to contact the data custodians that had cohorts of interest for more information, submit a feasibility enquiry or a Data Access Request, all from within the Gateway interface.
Cohort Discovery was developed, and continues to be enhanced, with the ethical and secure use of sensitive data at the heart of its design. User queries are constructed and submitted in a platform that does not hold any sensitive data, and the queries can only be constructed from approved pre-defined fields. Once ready, the query is encrypted and sent securely to separate isolated secure environments. The query is not run against the source patient data, rather it is run against a subset of data specially prepared for Cohort Discovery. Not only has the data been formatted and harmonised to make Cohort Discovery queries possible, but it has also been deidentified, to remove personal information such as names, addresses or dates or birth, and pseudonymised which replaces this personal information with unique identifiers that cannot be linked back to the individual that the data belongs to. This data subset remains securely within a computer managed by the Data Custodian, and instead, only a count of the number of records in that cohort (or group) that meet the criteria of the user’s query are returned.
Cohort Discovery follows the five safe principles to ensure the safe and secure access to information about the available data across a range of research environments. The framework focusses on five key areas:
Safe People: Only registered users that have a verified institutional email addresses and appropriate roles within those organisations can access Cohort Discovery.
Safe Projects: The terms and conditions require acceptance that any use of Cohort Discovery will be for the sole purpose of understanding the size of populations relevant to a particular research question and that any resulting study or clinical trial will be for public benefit as defined by the National Data Guardian.
Safe Settings: Each secure system involved in the Cohort Discovery process complies with data protection regulations, and regular security audits, assessments and accreditations. Queries using Cohort Discovery are sent securely and are built from a set of predefined fields.
Safe Data: The data that is queried by Cohort Discovery is a separate subset of the full patient data that has been pseudonymised. The data is held in isolated secure environments, and not transferred outside of this environment.
Safe Outputs: The Data Custodians retain full control over the sensitive patient data, and this process ensures only the minimum information required to answer the users query is securely transferred back to the user. Cohort Discovery only returns a count of the size of population in a cohort (group) that matches the users posed criteria, no data itselfis transferred. To protect against identification by a unique set of characteristics, each Data Custodian must set a threshold in which no results are returned, for most this means only results with more than 10 people are returned. Also, to protect against someone asking multiple questions and subtracting counts to identify someone, each Data Custodian can enable rounding on results, which ensures they are never exact results but sufficiently precise to allow the researcher to know the scale of the dataset. For example, it might therefore say there are 110 people in the dataset instead of the exact number of 102. An audit trail further tracks usage to prevent misuse.
The Gateway, originally developed as part of the CO-CONNECT programme, supported researchers to find and access COVID-19 data at pace, while ensuring peoples information was kept private and secure. If you would like to learn more about the CO-CONNECT development for the rapid discovery and data access to COVID-19 data, please read the CO-CONNECT Hybrid Architecture paper.