One thing we do regularly at work is taking a look at aligned sequences of human DNA as generated by what is called “next-generation sequencing”. This data is stored in so-called .bam
files, which can get pretty large. For example, the .bam
file for an individual whose whole genome is sequenced at 12x coverage is approximately 60GB.
To view these files, to check the alignment, look at the coverage of a specific region, etc, people typically use graphical browsers like the IGV or Savant. However, these require you to either run the tool on the server (which means relatively slow X-forwarding over SSH) or copying the BAM file to your local machine, which also takes a lot of time, especially if you want to take a look at a single region for a bunch of people.
For jobs like that I’ve found the text-based viewer integrated in SamTools to be very convenient. It’s a matter of running
samtools tview sample.bam /path/to/reference.genome.fasta
after which you get a view like this:
1000821 1000831 1000841 1000851 1000861 1000871 1000881 1000891 1000901
GGCCAGGCAGGGCTTCTGGGTGGAGTTCAAGGTGCATCCTGACCGCTGTCACCTTCAGACTCTGTCCCCTGGGGCTGGGGCAAGTGCCCGATGGGAGCGCA
.....................................................................................................
.......... ......................A.......................T...............G........A........C
........... .....................................................
............ ..............................................
..........................................................C........... .......................A.
................................................................................... ..........
..........
Using g followed by 1:23000000
you will jump to the given position on the given chromosome.
If the 1:23000000
doesn’t work, check the header of the BAM file to see how the chromosome is specified (sometimes it is chr1:23000000
, for example):
samtools view -H sample.bam
In the above example the dots indicate nucleotides that are identical to the reference (shown in the second line), the positions with letters indicate reads where a different base was read. In this example all of them are probably sequencing or alignment errors because only one discordant read is observed at any position. If you find a column with letters that means this position is indeed different from the reference. Also notice how the various reads are aligned and that in this case the coverage doesn’t seem to be very high.
Related Images: