Infographic: The Sequencing and Assembly of the Human Genome

After sequencing fragments of DNA to obtain reads, most genomic pipelines follow one of two steps. The reads can be de novo assembled to construct longer stretches called contigs from scratch, with overlapping sequences on the ends dictating which read pieces belong next to each other (below left). Alternatively, reads can also be aligned to a reference genome to identify small genetic variations (below right). Where de novo assembly can be thought of as assembling a puzzle without the use of the picture on the box, alignment is the equivalent of piecing together a puzzle by looking at that picture. However, because a singular reference genome fails to capture all of the genetic diversity across humans, some sections of DNA might not be able to align to the reference genome well.

Infographic comparing assembly versus alignment
modified from ©, filo

The evolution of sequencing

There have been numerous sequence modalities developed in the last quarter century, but major advances include Sanger sequencing, sequencing by synthesis, nanopore long-read sequencing from Oxford Nanopore, and, most recently, high-fidelity single-molecule real-time sequencing from PacBio. These differ in the length of reads they generate, their efficiency, and accuracy, with technologies generally evolving to support faster, cheaper, and more-precise sequencing.

Illustration showing sanger sequencing


The first sequencing technology invented, and no longer used in modern projects, Sanger sequencing relies on tagging the ends of various sizes of DNA fragments with complementary fluorescent nucleotides. Fragments are then separated by size using gel electrophoresis and the final nucleotides’ fluorescence is read by a laser. The full sequence is inferred by piecing together the end nucleotides of the different-sized fragments. 

        YEARS IN USE: 19802010

        READ LENGTH: ~500–1,000 bases

        CONS: Low throughput, time intensive

Illustration showing sequencing by synthesis


Sequencing by synthesis (SBS) is the most commonly used type of sequencing today. It relies on synthesizing complementary DNA strands using fluorescently tagged nucleotides and capturing the output signal on a high-resolution camera. Hundreds of thousands of DNA fragments can be read at once, but SBS is limited to short lengths of DNA, making it challenging to assemble whole genomes de novo.

        YEARS IN USE: 2002today

        READ LENGTH: ~100–500 bases

        CONS: Limited to short reads

Illustration showing nanopore sequencing


Oxford Nanopore devices pull DNA through a bioengineered pore to produce electrical current fluctuations that are then translated into a sequence. This approach generates long reads that can be used for de novo genome assembly or to identify larger structural variations that may not be possible with short reads, but it is less accurate than other sequencing technologies. 

        YEARS IN USE: 2002today

        READ LENGTH: ~10 kb–1 Mb

        CONS: Error-prone

Illustration showing high-fidelity sequencing


Only recently released by PacBio, high-fidelity (HiFi) single-molecule real-time (SMRT) sequencing relies on similar fluorescence strategies as SBS. Like nanopore sequencing, HiFi produces long reads that can be used for de novo genome assembly or to identify structural variants, but it achieves improved accuracy by circularizing a long DNA molecule so that it can be read dozens of times in a single run.

        YEARS IN USE: 2020today

        READ LENGTH: ~10 kb

        CONS: Currently very expensive

Visualizing a pangenome

Unlike a linear reference genome, a graph genome allows a single region of the genome to take on a diverse set of sequences. For regions with high genetic diversity, a graph genome can better capture the many human DNA sequences that might exist.

Adapted from a graphic by Yohei Rosen

Read the full story.


Leave a Reply