Personal Genome Project Data¶
The properly rendered version of this document can be found at Read The Docs. If you are reading this on github, you should instead click here. |
This dataset comprises roughly 180 Complete Genomics genomes. See the Personal Genome Project and the publication for full details:
Google Cloud Platform data locations¶
- Google Cloud Storage folder gs://pgp-harvard-data-public
- Google Genomics Dataset ID 9170389916365079788
- Google BigQuery Dataset IDs
Provenance¶
Google Genomics variant set for dataset pgp_20150205
: 9170389916365079788 contains:
- the Complete Genomics datasets from gs://pgp-harvard-data-public/**/masterVar*bz2
Appendix¶
Google is hosting a copy of the PGP Harvard data in Google Cloud Storage.
All of the data is in this bucket: gs://pgp-harvard-data-public
If you wish to browse the data you will need to install gsutil.
Once installed, you can run the ls
command on the pgp bucket:
$ gsutil ls gs://pgp-harvard-data-public
gs://pgp-harvard-data-public/cgi_disk_20130601_00C68/
gs://pgp-harvard-data-public/hu011C57/
gs://pgp-harvard-data-public/hu016B28/
....lots more....
The sub folders are PGP IDs, so if we ls
a specific one:
$ gsutil ls gs://pgp-harvard-data-public/hu011C57/
gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/
And then keep diving down through the structure, you can end up here:
$ gsutil ls gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/
gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/dbSNPAnnotated-GS000015172-ASM.tsv.bz2
gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/gene-GS000015172-ASM.tsv.bz2
... and more ...
Your genome data is located at: gs://pgp-harvard-data-public/{YOUR_PGP_ID}
If you do not see the data you are looking for, you should contact PGP directly through your web profile.