Reference Genomes

The properly rendered version of this document can be found at Read The Docs.

If you are reading this on github, you should instead click here.

Reference Genomes such as GRCh37, GRCh37lite, GRCh38, hg19, hs37d5, and b37 are available on Google Cloud Platform.

Google Cloud Platform data locations

Provenance

GRCh37

Genome Reference Consortium Human Build 37 includes data from 35 gzipped fasta files:

More information on this source data can be found in this NCBI article and in the FTP README.

GRCh37lite

GRCh37lite is a subset of the full GRCh37 reference set plus the human mitochondrial genome reference sequence in one file: GRCH37-lite.fa.gz

More information on this source data can be found in the FTP README.

GRCh38

Genome Reference Consortium Human Build 38 includes data from 39 gzipped fasta files:

More information on this source data can be found in this NCBI article and in the FTP README.

Verily’s GRCh38

Verily’s GRCh38 reference genome is fully compatible with any b38 genome in the autosome.

Verily’s GRCh38:

  • excludes all patch sequences
  • omits alternate haplotype chromosomes
  • includes decoy sequences
  • masks out duplicate copies of centromeric regions

The base assembly is GRCh38_no_alt_plus_hs38d1. This assembly version was created specifically for analysis, with its rationale and exact genome modifications thoroughly documented in its README file.

Verily applied the following modifications to the base assembly:

  • Reference segment names are prefixed with “chr”.

    Many of the additional data files we use are provided by GENCODE, which uses “chr” naming convention.
  • All 74 extended IUPAC codes are converted to the first matching alphabetical base pair as recommended in the VCF 4.3 specification.

  • This release of the genome reference is named GRCh38_Verily_v1

hg19

Similar to GRCh37, this is the February 2009 assembly of the human genome with a different mitochondrial sequence and additional alternate haplotype assemblies. Includes data from all 93 gzipped fasta files from the UCSC FTP site.

More information on this source data can be found in the FTP README.

hs37d5

Includes data from GRCh37, the rCRS mitochondrial sequence, Human herpesvirus 4 type 1 and the concatenated decoy sequences in one file: hs37d5.fa.gz

More information on this source data can be found in the FTP README.

b37

The reference genome included by some versions of the GATK software which includes data from GRCh37, the rCRS mitochondrial sequence, and the Human herpesvirus 4 type 1 in one file: Homo_sapiens_assembly19.fasta.

More information on this source data can be found in the GATK FAQs.


Have feedback or corrections? All improvements to these docs are welcome! You can click on the “Edit on GitHub” link at the top right corner of this page or file an issue.

Need more help? Please see https://cloud.google.com/genomics/support.