Run SAMtools to index BAM files in Cloud Storage

The properly rendered version of this document can be found at Read The Docs.

If you are reading this on github, you should instead click here.

Suppose you have thousands of BAMs, which you have stored in Google Cloud Storage, and you need to create index files (BAI) for them.

The Grid Computing Tools github repo provides code and instructions for running SAMtools many times in parallel to create those index files.

Overview

This tool takes advantage of two key technologies to process a large number of files:

Google Compute Engine provides virtual machines in the cloud. With sufficient quota in your Google Cloud project, you can start dozens or hundreds of instances concurrently. The more instances you add to your cluster, the more quickly you can process your files.

Grid Engine is used to distribute the file operation tasks across all of the instances such that each instance takes the responsibility to download a single file, run the operation, and upload it back to Cloud Storage.

Directory structure

To use the tool, you will need to download both the Grid Computing Tools github repo and the Elasticluster repo to your local workstation or laptop.

No specific relationship exists between these two repositories, but in the following instructions, it is assumed that the directories:

  • grid-computing-tools
  • elasticluster

are siblings under a workspace root (WS_ROOT) directory.

Running the samples

The quickest way to get familiar with the samtools tool is by trying the sample.

The sample provided here lists just 6 files to work on, and the instructions below demonstrate spreading the processing over 3 worker instances.

  1. Create a cluster of Compute Engine instances running Grid Engine

    In your current shell:

    1. cd ${WS_ROOT}
    2. Follow the instructions to Create a Grid Engine cluster on Compute Engine
  2. Download the grid-computing-tools repository (if you have not already done so)

    cd ${WS_ROOT}
    git clone https://github.com/googlegenomics/grid-computing-tools.git
    
  3. Upload the src and samples directories to the Grid Engine master instance:

cd grid-computing-tools

elasticluster sftp gridengine << 'EOF'
mkdir src
mkdir src/common
mkdir src/samtools
put src/common/* src/common/
put src/samtools/* src/samtools/
mkdir samples
mkdir samples/samtools
put samples/samtools/* samples/samtools/
EOF
  1. SSH to the master instance

    elasticluster ssh gridengine
    
  2. Set up the configuration files for the samples

    The syntax for running the sample is:

    ./src/samtools/launch_samtools.sh [config_file]
    

    The config_file lists two sets of key parameters:

    • What operation to perform
    • What are the source and destination locations


    The operation to perform is controlled by the following:

    • SAMTOOLS_OPERATION: index


    The locations are determined by:

    • INPUT_LIST_FILE: file containing a list of GCS paths to the input files to process
    • OUTPUT_PATH: GCS path indicating where to upload the output files. If set to source, the output will be written to the same path as the source file (with the extension .bai appended)
    • OUTPUT_LOG_PATH: (optional) GCS path indicating where to upload log files


    To use the samples, you must update the OUTPUT_PATH and OUTPUT_LOG_PATH to contain a valid GCS bucket name. The sample config file sets a placeholder for the OUTPUT_PATH and OUTPUT_LOG_PATH such as:

    export OUTPUT_PATH=gs://MY_BUCKET/output_path/samtools_index
    export OUTPUT_LOG_PATH=gs://MY_BUCKET/log_path/samtools_index
    

    You can do this manually with the editor of your choice or you can change the config file with the command:

    sed --in-place -e 's#MY_BUCKET#your_bucket#' samples/samtools/*_config.sh
    

    Where your_bucket should be replaced with the name of a GCS bucket in your Cloud project to which you have write access.


  3. Run the sample:

    • Index a list of files using samtools index [ Estimated time to complete: 5 minutes ]
    ./src/samtools/launch_samtools.sh ./samples/samtools/samtools_index_config.sh
    

    When successfully launched, Grid Engine should emit a message such as:

    Your job-array 1.1-6:1 ("samtools") has been submitted
    

    This message tells you that the submitted job is a gridengine array job. The above message indicates that the job id is 1 and that the tasks are numbered 1 through 6. The name of the job samtools is also indicated.


  4. Monitoring the status of your job

    Grid Engine provides the qstat command to get the status of the execution queue.

    While the job is in the queue, the state column will indicate the status of each task. Tasks not yet allocated to a compute node will be collapsed into a single row as in the following output:

    $ qstat
    job-ID  prior   name       user      state submit/start at     queue            slots ja-task-ID
    ------------------------------------------------------------------------------------------------
         1  0.00000 my-job     janedoe   qw    06/16/2015 18:03:32                      1 1-6:1
    

    The above output indicates that tasks 1-6 of job 1 are all in a qw (queue waiting) state.

    When tasks get allocated, the output will look something like:

    $ qstat
    job-ID  prior   name       user      state submit/start at     queue            slots ja-task-ID
    ------------------------------------------------------------------------------------------------
         1  0.50000 my-job     janedoe   r     06/16/2015 18:03:45 all.q@compute002     1 1
         1  0.50000 my-job     janedoe   r     06/16/2015 18:03:45 all.q@compute001     1 2
         1  0.50000 my-job     janedoe   r     06/16/2015 18:03:45 all.q@compute003     1 3
         1  0.00000 my-job     janedoe   qw    06/16/2015 18:03:32                      1 4-6:1
    

    which indicates tasks 1-3 are all in the r (running) state, while tasks 4-6 remain in a waiting state.

    When all tasks have completed qstat will produce no output.


  5. Checking the logging output of tasks

    Each gridengine task will write to an “output” file and an “error” file. These files will be located in the directory the job was launched from (the HOME directory). The files will be named respectively:

    • job_name.ojob_id.task_id (for example: my-job.o1.10)
    • job_name.ejob_id.task_id (for example: my-job.e1.10)


    The error file will contain any unexpected error output, and will also contain any download and upload logging from gsutil.


  6. Viewing the results of the jobs

    When tasks complete, the result files are uploaded to GCS. You can view the list of output files with gsutil ls, such as:

    gsutil ls OUTPUT_PATH
    

    Where the OUTPUT_PATH should be the value you specified in the job config file (step 6 above).


  7. Viewing log files

    When tasks complete, the result log files are uploaded to GCS if OUTPUT_LOG_PATH was set in the job config file. The log files can be of value both to verify success/failure of all tasks, as well as to gather some performance statistics before starting a larger job.

    • Count number of successful tasks

      gsutil cat OUTPUT_LOG_PATH/* | grep SUCCESS | wc -l
      

    Where the OUTPUT_LOG_PATH should be the value you specified in the job config file (step 6 above).

    • Count number of failed tasks

      gsutil cat OUTPUT_LOG_PATH/* | grep FAILURE | wc -l
      

    Where the OUTPUT_LOG_PATH should be the value you specified in the job config file (step 6 above).

    • Compute total task time

      gsutil cat OUTPUT_LOG_PATH/* | \
        sed -n -e 's#^Task time.*: \([0-9]*\) seconds#\1#p' | \
        awk '{ sum += $1; } END { print sum/NR " seconds"}'
      
    • Compute average task time

      gsutil cat OUTPUT_LOG_PATH/* | \
        sed -n -e 's#^Task time.*: \([0-9]*\) seconds#\1#p' | \
        awk '{ sum += $1; } END { print sum " seconds"}'
      
  8. Destroying the cluster

    When you are finished running the samples, disconnect from the master instance and from your workstation shut down the gridengine cluster:

    elasticluster stop gridengine
    

Running your own job

To run your own job to index a list of BAM files requires the following:

  1. Create an input list file
  2. Create a job config file
  3. Create a gridengine cluster with sufficient disk space attached to each compute node
  4. Upload input list file, config file, and grid-computing-tools source to the gridengine cluster master
  5. Do a “dry run” (optional)
  6. Do a “test run” (optional)
  7. Launch the job

The following instructions provide guidance on each of these steps. It is recommended, though not a requirement, that you save your input list file and job config file to a directory outside the grid-computing-tools directory. For example, you might create a directory ${WS_ROOT}/my_jobs.

  1. Create an input list file

    If all of your input files appear in a single directory, then the easiest way to generate a file list is with gsutil. For example:

    gsutil ls gs://MY_BUCKET/PATH/*.bam > ${WS_ROOT}/my_jobs/bam_indexing_list_file.txt
    
  2. Create a job config file

    The easiest way to create a job config file is to base it off the sample and update:

    • INPUT_LIST_FILE
    • OUTPUT_PATH
    • OUTPUT_LOG_PATH

    To have the generated BAM index file written to the same location as the source BAM, set:

    OUTPUT_PATH=source
    

    Save the job config file to ${WS_ROOT}/my_jobs/.


  3. Create a gridengine cluster with sufficient disk space attached to each compute node

    1. Determine disk size requirements

      Each compute node will require sufficient disk space to hold the input and output files for its current task. Determine the largest file in your input list and estimate the total space you will need. It may be necessary to download the file and perform the operation manually to get a maximum combined input and output size.

      Persistent disk performance also scales with the size of the volume. Independent of storage requirements, for consistent throughput on long running jobs, use a standard persistent disk of at least 1TB, or use SSD persistent disk. More documentation is available for selecting the right persistent disk.


    2. Verify or increase quota

      Your choice for number of nodes and disk size must take into account your Compute Engine resource quota for the region of your cluster.

      Quota limits and current usage can be viewed with gcloud compute:

      gcloud compute regions describe *region*
      

      or in Cloud Platform Console:

      Important quota limits include CPUs, in-use IP addresses, and disk size.

      To request additional quota, submit the Compute Engine quota request form.


    3. Configure your cluster

      Instructions for setting the boot disk size for the compute nodes of your cluster can be found at Setting the boot disk size.

      You will likely want to set the number of compute nodes for your cluster to a number higher than the 3 specified in the example cluster setup instructions.

      Once configured, start your cluster.

  4. Upload input list file, config file, and grid-computing-tools source to the gridengine cluster master

elasticluster sftp gridengine << EOF
put ../my_jobs/*
mkdir src
mkdir src/common
mkdir src/samtools
put src/common/* src/common/
put src/samtools/* src/samtools/
EOF
  1. Do a “dry run” (optional)

    The tool supports the DRYRUN environment variable. Setting this value to 1 when launching your job will cause the queued job to execute without downloading or uploading any files.

    The local output files, however, will be populated with useful information about what files would be copied. This can be useful for ensuring your file list is valid and that the output path is correct.

    For example:

    $ DRYRUN=1 ./src/samtools/launch_samtools.sh ./samples/samtools/samtools_index_config.sh
    Your job-array 2.1-6:1 ("samtools") has been submitted
    

    Then after waiting for the job to complete, inspect:

    $ head -n 5 samtools.o3.1
    Task host: compute002
    Task start: 1
    Input list file: ./samples/samtools/samtools_index_file_list.txt
    Output path: gs://cookbook-bucket/output_path/samtools_index
    Output log path: gs://cookbook-bucket/log_path/samtools_index
    
    $ grep "^Will download:" samtools.o3.*
    samtools.o3.1:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot2_high_cov_GRCh37_bams/data/NA12878/alignment/NA12878.chrom9.SOLID.bfast.CEU.high_coverage.20100125.bam to /scratch/samtools.3.1/in/
    samtools.o3.2:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot2_high_cov_GRCh37_bams/data/NA12878/alignment/NA12878.chrom1.LS454.ssaha2.CEU.high_coverage.20100311.bam to /scratch/samtools.3.2/in/
    samtools.o3.3:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA12878/alignment/NA12878.chrom11.SOLID.corona.SRP000032.2009_08.bam to /scratch/samtools.3.3/in/
    samtools.o3.4:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA12878/alignment/NA12878.chrom12.SOLID.corona.SRP000032.2009_08.bam to /scratch/samtools.3.4/in/
    samtools.o3.5:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA12878/alignment/NA12878.chrom10.SOLID.corona.SRP000032.2009_08.bam to /scratch/samtools.3.5/in/
    samtools.o3.6:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA12878/alignment/NA12878.chromX.SOLID.corona.SRP000032.2009_08.bam to /scratch/samtools.3.6/in/
    
    $ grep "^Will upload:" samtools.o3.*
    samtools.o3.1:Will upload: /scratch/samtools.3.1/in/NA12878.chrom9.SOLID.bfast.CEU.high_coverage.20100125.bam.bai to gs://cookbook-bucket/output_path/samtools_index/
    samtools.o3.2:Will upload: /scratch/samtools.3.2/in/NA12878.chrom1.LS454.ssaha2.CEU.high_coverage.20100311.bam.bai to gs://cookbook-bucket/output_path/samtools_index/
    samtools.o3.3:Will upload: /scratch/samtools.3.3/in/NA12878.chrom11.SOLID.corona.SRP000032.2009_08.bam.bai to gs://cookbook-bucket/output_path/samtools_index/
    samtools.o3.4:Will upload: /scratch/samtools.3.4/in/NA12878.chrom12.SOLID.corona.SRP000032.2009_08.bam.bai to gs://cookbook-bucket/output_path/samtools_index/
    samtools.o3.5:Will upload: /scratch/samtools.3.5/in/NA12878.chrom10.SOLID.corona.SRP000032.2009_08.bam.bai to gs://cookbook-bucket/output_path/samtools_index/
    samtools.o3.6:Will upload: /scratch/samtools.3.6/in/NA12878.chromX.SOLID.corona.SRP000032.2009_08.bam.bai to gs://cookbook-bucket/output_path/samtools_index/
    
  2. Do a “test run” (optional)

    The tool supports environment variables to indicate which lines in the input list to run:

    • LAUNCH_MIN - lowest line number to process
    • LAUNCH_MAX - highest line number to process


    For example to launch a Grid Engine array job that only processes line 1:

    $ LAUNCH_MIN=1 LAUNCH_MAX=1 ./src/samtools/launch_samtools.sh ./samples/samtools/samtools_index_config.sh
    Your job-array 5.1-1:1 ("samtools") has been submitted
    

    The LAUNCH_MIN and LAUNCH_MAX values can be used with the DRYRUN environment variable:

    $ DRYRUN=1 LAUNCH_MIN=1 LAUNCH_MAX=5 ./src/samtools/launch_samtools.sh ./samples/samtools/samtools_index_config.sh
    Your job-array 6.1-5:1 ("samtools") has been submitted
    
  3. Launch the job

On the master instance, run the launch script, passing in the config file:

./src/samtools/launch_samtools.sh my_job_config.sh

where my_job_config.sh is replaced by the name of your config file created in step 2.


Have feedback or corrections? All improvements to these docs are welcome! You can click on the “Edit on GitHub” link at the top right corner of this page or file an issue.

Need more help? Please see https://cloud.google.com/genomics/support.