Multi-Sample Variants Format

The properly rendered version of this document can be found at Read The Docs.

If you are reading this on github, you should instead click here.

If your source data is jointly-called (e.g., like 1,000 Genomes) it will already be in “multi-sample variants” format when it is exported from the Variants API to Google BigQuery.

If your source data is single-sample gVCF or Complete Genomics masterVar format, this page offers some solutions to convert it to multi-sample variants format.

Overview

Suppose you have imported your single-sample files to the Variants API and exported them to BigQuery. Let’s refer to this original table as the “genome calls” table. It contains all reference calls and variant calls.

To facilitate variant-centric analysis like we see in the BigQuery 1,000 Genomes samples, we can generate a second table, the “multi-sample variants” table. The multi-sample variants table resembles a multi-sample VCF file. In this table:

  • every variant record includes calls for all callsets
  • variants which contained only reference calls for all callsets are omitted

Motivation

Data from source files in genome VCF (gVCF) format or in Complete Genomics format can be challenging to query due to the presence of non-variant segment records.

For example to lookup rs9536314 in the Klotho gene, the WHERE clause

WHERE
  reference_name = 'chr13'
  AND start = 33628137

becomes

WHERE
  reference_name = 'chr13'
  AND start <= 33628137
  AND end >= 33628138

to capture not only that variant, but any other records that overlap that genomic position.

Suppose we want to calculate an aggregate for a particular variant, such as the number of samples with the variant on one or both alleles and of samples that match the reference? The WHERE clause above will do the trick. But then suppose we want to do this for all SNPs in our dataset?

Examples

There are a few ways to generate the multi-sample variants table for use in variant-centric analyses such as Genome-Wide Association Study (GWAS):

A note about scaling: as the number of samples increases, so does the number of private and rare variants. At a certain point there are many, many rows with mostly 0/0 genotypes. We are experimenting with alternate transformations. Comment on this issue if you want a pointer to the most recent prototype.


Have feedback or corrections? All improvements to these docs are welcome! You can click on the “Edit on GitHub” link at the top right corner of this page or file an issue.

Need more help? Please see https://cloud.google.com/genomics/support.