What is Whole Genome Sequencing?

Andrew Magis, Arivale Senior Bioinformatics Scientist, PhD
Andrew Magis
Arivale Senior Bioinformatics Scientist, PhD

As the name implies, WGS is the process by which an organism’s genome sequence is determined. However, you may be surprised to learn this doesn’t actually measure the ‘whole’ genome with perfect fidelity.

The human genome is very big and very complex, and some areas of it are very difficult to measure accurately. A recent study determined that 84 percent of the human genome can be confidently determined using state-of-the-art methods similar to those we use at Arivale (Telenti et al., 2016). Most of the remaining sequence can still be measured, but with less confidence.

Let’s be clear, 84 percent is still a huge amount of information. By comparison, exome sequencing—which only focuses on gene regions—measures about one percent of the human genome. The largest single nucleotide polymorphism (SNP) genotyping chips measure less than 0.1 percent of the genome.

How does whole genome sequencing work?

Whole genome sequencing takes as an input purified DNA (separated out from all the other biomolecules that exist in your body). This DNA goes into an Illumina sequencer, and an immense amount of data comes out.

You might think the output would be the sequences of the 46 chromosomes (22 pairs of autosomes and one pair of sex chromosomes) that represent your whole genome, one base (i.e. letter) after another. However, no technology currently exists that can read human chromosomes in their entirety from start to finish. The length of the human chromosome 1 is nearly 250 million bases. Illumina sequencers only read a few hundred bases at a time, but they do this using a massively parallel approach.

The sequence of your genome is determined by reading billions of random sequences of your genome in segments of a few hundred bases at a time. These segments are called reads. These reads are compared against a reference genome (a careful compilation of an “average” human genome that is curated by many scientists) (Lander et al., 2001; Venter et al., 2001). By comparing the reads against the reference, we can see where you differ from the reference.


Figure 1: WGS reads aligned to the reference sequence. Each read is shown as a blue or orange horizontal rectangle with letters inside it, representing the read sequence. Each letter in the read represents a base. The double-stranded sequence of the human genome reference is at the bottom. The images illustrate three possible genotypes at the same position in the genome (red box). The positive strand of the human reference genome has a T at this specific position (blue T at the bottom). The left image shows a sample with no variant present in the red box. The individual is homozygous for the reference allele, T, and has genotype TT at this position. The middle image shows a sample with a variant (SNP) inherited from one parent in the red box. The yellow highlighted letters A are where this sample differs from the reference T at that position. As about half of the reads have an A and about half of the reads have a T, this individual’s genotype is heterozygous AT. The right image shows a sample with a variant (SNP) inherited from both parents; this individual’s genotype is homozygous AA. Note that the pink A at the bottom of all figures, does not represent a variant but instead is the reverse-strand complement of the T just above it (recall that DNA is double-stranded).

Types of Genetic Variants

So what kind of differences in the genome do we observe? The most common are called single nucleotide polymorphisms (SNPs, pronounced snips).

SNPs are what most people think of when they think of a genetic variant: a single base that has been changed to another base (Figure 1). SNPs are the most common type of genetic variation, and they are very important, but they do not tell the whole story.

Sometimes a few bases can be deleted or inserted. This type of variation is called an indel. For example, the genetic disease cystic fibrosis is commonly caused by an indel in the gene CFTR (Mullaney et al., 2010). Sometimes large regions of the genome can be deleted or duplicated in what is called a copy number variation. In addition, small segments of the genome can be reversed, which is called an inversion. Sometimes an entire chromosome can break in two and re-attach itself to another chromosome. These large changes are called translocations. A form of chronic myelogenous leukemia (CML) is associated with a translocation (Rowley, 1973). The results of all these changes? Genes can break apart, combine with other genes, appear multiple times, or disappear entirely. See Figure 2 for an illustration of these types of changes.


Figure 2: Some of the variation that occurs in the human genome. The letters A, B, C, and D in the colored boxes represent adjacent genes on a chromosome. Modified from (Mullally and Ritz, 2007).

Currently, Arivale coaches on two gene deletions, which are fairly common in the human population. Glutathione S-transferase theta-1 (GSTT1) and glutathione S-transferase mu-1 (GSTM1) are both involved in metabolism of a variety of toxic compounds, and may play a role in risk for certain types of cancer (Gudmundsdottir et al., 2001; Setiawan et al., 2000).

Previously, Arivale measured the gene deletion of GSTT1 using a single SNP (rs1130990). By providing information on the entire gene, rather than a single position within that gene, we can get a much clearer and more accurate picture of the gene deletion (Figure 3). Arivale scientists are investigating the relationship between these types of large genomic variations and your health. As we curate new relationships we will make them available on your dashboard for you to discuss with your Arivale Coach.


Figure 3: Glutathione S-transferase theta-1 (GSTT1) represented in two different samples. The blue and orange horizontal bars are reads aligned to the human genome reference sequence. The top box (A) shows only four reads covering this gene, indicating the gene is not present in this sample. The bottom box (B) shows hundreds of reads covering this gene, indicating the gene is present. Previously, Arivale measured the presence or absence of this gene using a single SNP (rs1130990), which is indicated by the purple box.

Rare and De Novo Variants

Scientists have discovered and validated more than 100 million SNPs and indels that occur in the human genome. These are stored in big online databases and curated by thousands of scientists. One of the major strengths of WGS is the fact it allows scientists to detect de novo variants. The term “de novo” is a Latin expression meaning “afresh” or “anew.” Scientists use this term to describe a variant you did not inherit from your parents. Instead, they are almost certainly unique to you.

A recent study of 10,000 sequenced whole genomes revealed that each of us has an average of 8,579 de novo variants (Telenti et al., 2016). This means we don’t yet have a complete picture of how much variability exists in the human population. Technologies such as SNP genotyping (currently used by many consumer genetics companies) are unable to detect these de novo variants, because SNP genotyping can only measure a variant once we already know it exists. 

Most existing studies on the effects of genetic variability only use “common” variants in the human population. While it might seem odd, scientists define “common” variants as occurring in at least one percent of the human population. In contrast, rare variants occur in less than one percent of the human population (and often much less than that). Rare variants, which would include the de novo variants described above, are the subject of intense research (Cirulli and Goldstein, 2010). Because they are so rare, it is difficult to figure out what, if any, effect they have. Many scientists believe these rare variants are the key to understanding many types of disease risks and other heritable traits (Bodmer and Bonilla, 2008). Only by sequencing large numbers of genomes will scientists begin to understand the effects of these rare and de novo variants.

So what does all of this mean for you?

This is a very exciting time to have your whole genome sequenced. However, we’ll be the first to tell you—understanding the genome is not easy. It’s big and complicated, and the effects of any given variant are difficult to predict.

Today, the number of whole genome sequences in the world is probably around a few hundred thousand, meaning that fewer than 0.01 percent of people alive today have had their genome sequenced. In the next few years, this number will likely grow into the millions, and then tens of millions. As this number grows, we will gain more understanding of how to read the genome, particularly with rare and de novo variants, not to mention those larger changes shown in Figure 2.

Most of the time, when rare things become common, they lose value. The opposite is true for WGS data. The more sequenced genomes there are in the world, the more valuable each one becomes, both for the human population as a whole, and for you personally.



Bodmer, W., and Bonilla, C. (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nat. Genet. 40, 695–701.

Cirulli, E.T., and Goldstein, D.B. (2010). Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat. Rev. Genet. 11, 415–425.

Gudmundsdottir, K., Tryggvadottir, L., and Eyfjord, J.E. (2001). GSTM1, GSTT1, and GSTP1 genotypes in relation to breast cancer risk and frequency of mutations in the p53 gene. Cancer Epidemiol. Biomarkers Prev. 10, 1169–1173.

Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921.

Mullally, A., and Ritz, J. (2007). Beyond HLA: the significance of genomic variation for allogeneic hematopoietic stem cell transplantation. Blood 109, 1355–1362.

Mullaney, J.M., Mills, R.E., Pittard, W.S., and Devine, S.E. (2010). Small insertions and deletions (INDELs) in human genomes. Hum. Mol. Genet. 19, R131–R136.

Rowley, J.D. (1973). Letter: A new consistent chromosomal abnormality in chronic myelogenous leukemia identified by quinacrine fluorescence and Giemsa staining. Nature 243, 290–293.

Setiawan, V.W., Zhang, Z.F., Yu, G.P., Li, Y.L., Lu, M.L., Tsai, C.J., Cordova, D., Wang, M.R., Guo, C.H., Yu, S.Z., et al. (2000). GSTT1 and GSTM1 null genotypes and the risk of gastric cancer: a case-control study in a Chinese population. Cancer Epidemiol. Biomarkers Prev. 9, 73–80.

Telenti, A., Pierce, L.T., Biggs, W.H., di Iulio, J., Wong, E.H.M., Fabani, M.M., Kirkness, E.F., Moustafa, A., Shah, N., Xie, C., et al. (2016). Deep Sequencing of 10,000 Human Genomes.

Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. (2001). The sequence of the human genome. Science 291, 1304–1351.