Quality control of the Our Future Health genotype data to support high quality release to the research community
Safe People
Genomics plc
Vincent Plagnol
Safe Projects
Much of the future work of the scientific community with Our Future Health (OFH) will rely on high quality genetic data that has gone through thorough quality control. An important step for the generation of such research appropriate data is referred to as imputation, a process that uses large sets of reference samples to provide a broader assessment of the participants’ genetic profile. Genomics plc collaborates with OFH to generate these high quality data releases. Much of the preparation work to generate these data releases can be done without accessing the OFH genotype data. However, final decisions that drive the inclusions of samples or genetic variants are best informed by the data themselves, so that the normal ranges for quality metrics can be understood. The aim of this application is therefore to provide Genomics plc with access to the OFH genotype data so that it becomes possible to finalise the quality control and imputation steps. As part of this application, the Genomics plc team will review the OFH data, compute and summarise standard quality metrics that will inform its main data processing pipelines. These choices will support the science community’s future work with OFH Quality control and careful data processing are essential to support OFH’s mission. These processing steps rely on defining the features of good quality data, at the level of the sample but also for each genetic variant that is tested. Choosing the right quality metrics and acceptance cutoffs for these metrics is a key part of the quality control process. While scientists with domain expertise have a sense for the acceptable ranges, these values are often dataset specific, as technology and ancestry composition of the cohort impact the expected values of these metrics. As part of this project, the Genomics plc team will directly access the genotype data, compute key quality metrics and assess their acceptables ranges. This work will identify data trends, potential low quality batches, and will inform the choices that will eventually feed into the Genomics plc data processing pipelines. This will result in improved data processing and eventually better data delivery to support all research projects that rely on OFH genetic data.
This work will support the release of curated and imputed genetic data to the research community. Higher quality data will facilitate downstream research, avoid potentially misleading findings caused by potential technical issues, and generally facilitate access and speed up discoveries based on the OFH data by researchers across the world. The alternative option of having multiple independent research teams going through the steps of defining their own data processing would be counterproductive and result in wasted resources. This motivates the use of the OFH data to generate the most accurate genetic data releases in a centralised manner, which this project will facilitate
Public Health Research
Safe Data
Our Future Health Genotype Array Data
Safe Setting