Using Cohort Discovery

Using Cohort Discovery


Documentation

What is Cohort Discovery? 

Cohort Discovery is a tool that helps researchers quickly assess whether datasets held by trusted Data Custodians contain the patient populations relevant to their research. It securely sends queries to pseudonymised datasets behind each custodian’s firewall and returns aggregate counts. This allows researchers to understand dataset suitability before submitting a formal access request or more specific feasibility queries. 

Log In 

To access Cohort Discovery, go to the Health Data Research UK website (the Gateway): 

From here, either hover over and click on the ‘Cohort Discovery’ button in the middle: 

You can also click on the ‘search’ button at the top to see a menu where you can click on Cohort Discovery: 

Clicking on Cohort Discovery using either way will take you to the Gateway landing page for Cohort Discovery.  

To begin, click the Access Cohort Discovery button. First-time users will be prompted to register for a Gateway account – just follow the on-screen instructions on the Gateway landing page. 

To be approved, you must log in using a professional email address linked to your organisation – such as an academic, NHS, institutional, or company-issued account (including Azure or OpenAthens). Personal email accounts (e.g. @gmail, @hotmail) are not permitted for access. 

Once your registration is approved, return to Cohort Discovery using the access button. On your first login, you’ll initially see synthetic datasets only. Your access to additional collections will aim to be activated within one working day based on your user group (e.g. academic, industry, public sector). 

If your access hasn’t been updated after this time, please raise a support ticket using the yellow ‘Need Support?’ button in the bottom right of any Gateway screen.   

Landing Page 

When you log into Cohort Discovery, you will see the Collections Tab. 

The ‘Collections’ tab gives you an overview of all data collections that are available to you, and the variables available for querying. For this demo, we are only using synthetic collections.

 

Create New Query 

Click the ‘Create New Query’ blue button on the top right hand side of the page. On the new query page, you can start building your Cohort Discovery query: 

 

 

Start by selecting the data collections you want to query. In Cohort Discovery, a collection is equivalent to a dataset. 

Click on the “Collections” tab to view available options. You can: 

  • Select all collections at once by clicking the blue toggle (see red arrow), or 
  • Individually select/deselect collections using the tick boxes (see green arrow 

Make sure at least one collection is selected before building your query. 

 You may also want to: 

  • ‘Select Genes’ – Cohort Discovery is actively working with the community to develop genomic querying capability. Once ready, authorised users will be able to query genomics data 
  • Users not interested in genomics data may ignore this box 
  • Rename the query (to something that reflects your search, like ‘Asthma’ or ‘Impaired glucose tolerance’, this will be displayed in the history on the landing page). 

 Add Filter 

Either of the blue ‘Add filter’ buttons brings up the query builder: 

The query builder lets you search for condition, procedure, medication, measurement and observation details, as well as the gender, age or racial elements recorded on the health record.  

These are all terms from the OMOP common data model, which has been mapped to the vocabularies of source data (e.g., ICD10, SNOMED, READ codes etc.). The first set of numbers (in red highlight) indicate the number of terms that exist in each section, while the second set of numbers (in grey highlight) indicate the number of data collections that contain a term (or set of terms). 

Search terms can be found by using the dropdown menu’s or by entering them into the search bar. In this example, the term ‘impaired glucose tolerance’ is used in the search bar, and the system has returned all condition and observation records that have the term in the OMOP concept description. 

Note that the search bar also accepts clinical codes (ICD-10 or OMOP concept codes) as a search term. 

Select a search term to add it to your query: 

 

 Clicking ‘Add to new group’ will add this selection to Group 1 in your query: 

 A group can contain any number of parameters within it, although they must all have the same union operator (i.e., ‘AND’ or ‘OR’) within the group.  

Click away from the term mapping and then the ‘play’ button to run the query. The query will be run against all selected OMOP data collections, each of which is hosted by the Data Custodian of that collection in their secure network areas. 

 The results of your query will be displayed after the time it takes for the federated query to run (can be up to two minutes, depending on the number of collections you are running your query against, and the complexity of your query). 

 Query Results 

After your query has run, you will see a count of patients for each data collection that has data relevant to the search term. You can create new queries and run them immediately after you have launched a previous query i.e. you do not need to wait for the previous query to resolve (queries are queued and processed in the order they are created). 

 Count Details 

The count details results table shows the collection name, external URL, total, status (of the currently running query against each data collection) and count relative to all results. 

In this example, the query has returned total counts for four synthetic data collections that have different levels of low number suppression and rounding applied. The real count is 3735 (‘Synthetic Data – No Obfuscation’), while the result returned for ‘Synthetic Data – Min 150 and rounding’ has been rounded up to 3740, because rounding has been applied to that data collection. 

Where a data collection contains age and sex information in the underlying OMOP data, the age and sex distribution will also be shown: 

 Note: In this guide and any related demonstration videos published on the HDR UK website, synthetic datasets that contain artificial information are used. We do not demonstrate the live system which is only accessible to validated bona fide researchers. 

 Disclosure Control 

Data custodians who make their datasets discoverable on Cohort Discovery implement several obfuscation processes to ensure that individuals cannot be identified: 

  • Low number suppression – counts of less than 10 are returned as 0. 
  • Rounding – all counts are rounded up to the nearest 10. 

 These are both configurable by the data custodian. In addition to the rounding of results, it is not possible to query the ID of an individual person, and all OMOP data has been pseudonymised. This means that all identifiable information like names and addresses have been removed, and all potentially identifiable information, such as date of birth, has been pseudonymized to a level that protects privacy (e.g., month and day of birth are removed, while year of birth is retained to calculate age at a given event). 

 Query Complexity 

Question Groups 

You can define different question groups, and you can have as many questions in one group as you need. All query terms inside one group are combined by default using the ‘AND’ operator. The default operator combining many groups together is ‘OR’. 

 You can change the AND/OR logical operator within a group by clicking the operator name ‘AND’ or ‘OR’. Toggling this will also toggle the logical operator combining the many groups, meaning you will always have one of the logical situations in this image: 

 Hypothetical illustration of inter- and intra-group logical operators. The first row shows the default situations with Group operators, and the second row shows the situation after the intra-group operator has been toggled to ‘OR’. Your query may look different to these but will generally conform to this logical operator structure. 

 Adding terms and groups 

After locating a suitable search term, click the term to open the term-type specific options. Clicking ‘Add to new group’ creates a new question group in your query and adds the term to that group. Any subsequent additions will give an option to add any term to a completely new question group, or to an existing one.  

 In our second example, we will add Asthma to our first group and then include/exclude Acute viral pharyngitis.  

 After adding asthma, we will search again to include the term Acute viral pharyngitis. If you want to add it to the same grouping as our primary search, click three blue dots as shown to select to Add to Group 1: 

Inclusions and Exclusions 

Within one query Group you can have ‘Inclusion’ and ‘Exclusion’ logical modifiers to your search terms. For example, you may want to have a search that excludes all individuals who have used a certain medication, or have had a certain diagnosis, in which case you would ‘exclude’ hits for that query from your group. 

 By default, all questions within a group are added as inclusions, and you can change that by clicking the inclusion-exclusion toggle icon as follows (circled in green): 

 Here our query will return all patients who have had asthma minus those who also had acute viral pharyngitis. 

 You cannot include or exclude a whole question group, only the individual questions within a group. 

 Secondary Modifiers 

Some question types accept age and time modifiers to control the time scale for possible answers. You can define either an age or a time modifier to your question, but not both. Age and time modifiers must be applied to the event and not the person. Age and time modifiers will not work when applied to the person. 

 

The modifier can be added to the query either during the selection of the query term (in the Query Builder) or once the term has been added to the Group by clicking the ellipsis button for the Group and selecting ‘Edit’, like so: 

 

Here the age modifiers are set for each question within the Group. 

 History 

Click on the ‘History’ tab to see all your previous queries, including the query name, when it started, when it executed, it’s status, owner and whether it has been saved or not.  

 

The ellipsis symbols under Actions enables you to edit, rerun or remove a query. 

New queries are created by hitting the ‘Create New Query’ blue button on the top right-hand side. As you build and run new queries, they will appear in the history tab as a sortable list. 

 Additional Guide 

This is a quick guide to get you started using Cohort Discovery on the Gateway. The guide produced by BC Partners relating to this can be downloaded here. 

An additional guide produced by BC Partners (January 2024) can be downloaded here. 


FAQs

Who can access Cohort Discovery?

Access is available to approved researchers from academia, public sector, industry, and international organisations. Your level of access may vary based on your organisation type and the permissions set by individual data custodians.

What data can I search using Cohort Discovery?

You can explore pseudonymised patient counts across datasets provided by participating data custodians. Each dataset reflects what has been made available for discovery based on the data partner's governance, format, and permissions.

Can I use Cohort Discovery to run live analyses or download data?

No. Cohort Discovery is for feasibility assessment only. It returns aggregate counts—not individual-level or downloadable data. If you wish to request access to the actual data, you must submit a formal request via the Gateway.

I’m based outside the UK—can I use Cohort Discovery?

Yes. Some data custodians allow access to international users, while others restrict to UK-based researchers only. The system automatically filters what you can access based on your user group.

Still can’t find what you’re looking for?

The quickest way to get your issue solved is through the links above, but if you aren’t able to find a solution then contact us here:

Contact support
end of page