Jack Chen, Associate Professor

Department of Molecular Biology and Biochemistry
Simon Fraser University

Office: SSB8111
Phone: (778)782-4823
Email: chenn(at)sfu.ca

B.Sc., Fudan University, Shanghai
Ph.D., Chinese Academy of Sciences, Qingdao

Chapter 1 - Comparative Genomics: an Emerging Field

From genetics to genomics: highlights of a 100-year history

In the milestone paper entitled "Initial sequencing and analysis of the human genome" (Lander et al., 2001), Eric Lander and colleagues nicely summarized the scientific progress made in the quest to understand the nature and content of genetic information. They proposed that the scientific progress falls into four main phases, corresponding roughly to four quarters of the century since the re-discovery of the Mendel's laws of genetics to the publication of the initial human genome. As a consequence of the completion of the Human Genome Project, a new field - comparative genomics - emerged.

Gregor Mendel published famous work on genetics in 1866. Mendel carefully cross-breeded thousands of pea plants, which led him to key insights, now called Mendel's Laws of Heredity, about how inherited traits are passed on from generation to generation. However, his publication was ignored by the scientific community until 1900, when three botanists independently rediscovered Mendel's work in the same year. Thus Mendel's work had little impact on the field of genetics before 1900. Mendel published his paper right after Charles Darwin published his book The Origin of Species, Darwin was never aware Mendel's discovery. One simple reason, according to James Watson (DNA the secret of life, 2003, Arrow books), is that genetic mechanisms turn out to be complicated. Furthermore, early biologists failed to distinguish between two fundamentally different pocesses, heredity and development.

Gregor Mendel, 1822-1884, Austria-Hungary.
Wikipedia

Chromosome as the cellular basis of heredity The rediscovery of Mendel's work on genetics coincided with the independent observation of chromosome using microscope by Walter Sutton and Theodor Boveri. It was obvious to them that the factors (or genes) Mendel described earlier were chromosomes. Their work was later called the Sutton-Boveri chromosome theory of inheritance. Thomas Morgan , while working on the fruit fly, Drosophila melanogaster, effectively proved the Sutton-Boveri theory. This breakthrough built on the discovery of one of the most instructive mutant fruit fly whose eyes are white. In constract, wild-type (normal) fruit flies have red eyes. Further observation of mutant fruit flies of Morgan and his colleagures found that chromosomes can break apart and re-form during the production of sperm and egg cells. This observation of recombination enabled Morgan and his colleagues to map out the positions of genes along each chromosome, resulting in genetic maps.

Thomas Morgan, 1866-1945, American.
Wikipedia

DNA double-helix structure Chromosome consists of proteins and DNA (deoxyribonucleic acid). Because protein is made up of tweny different amino acides and DNA is made up of only four different nucleotides, it was widely believed that protein is the genetic substance. Most biochemists expressed doubt that DNA is complex enough as the genetic information carrier. This concern continued even after Oswald Avery published his transformation study in 1944. DNA was firmly established as genetic information carrier after the experiment carried out by Alfred Hershey and Martha Chase, called the Hershey-Chase experiment. Based on the X-ray diffraction images of DNA produced by Rosalind Franklin, James Watson and Francis Crick published the DNA double-helix structure in 1953, which immediately reminded them that DNA is the genetic information carrier. The DNA double-helix structure is ideal for duplication of genetic information. Watson, Crick, and were awarded the Nobel Prize in 1962.

James Watson (left), 1928-, American.
Wikipedia

Francis Crick (right), 1916-2004, British.
Wikipedia

DNA sequencing Fred Sanger developed the dideoxy chain-termination method for sequencing DNA molecules, now known as "Sanger method". Sanger won his second Nobel Prize in 1980, which he shared with Walter Gilbert, who independently developed the "Maxam-Gilbert sequencing" method later. Sanger and colleagues sequenced the genome of bacteriophage X174 (Sanger et al., 1977), arguably the first complete genome sequenced.The development of the Sanger method made it possible to initiate the Human Genome Project. During the production phase of the Human Genome Project, automated Sanger sequencing machines were used. Now, the Sanger sequencing method is giving place to next-geneartion sequencing methods.

Fred Sanger, 1918-, British.
Wikipedia

The Human Genome Project officially started in 1990 and completed in 2003. This projected resulted in the sequencing of the 3 billion base pairs of the human genome. Scientists from six nations (United States of America, Great Britain, France, Germany, Japan, and China). In parallel to the Human Genome Project ("public" project), a company led by Craig Venter sequenced the human genome ("private" project). Importantly, it triggered the development of an array of key genomics technologies including automated DNA sequencing and microarray technology, as well as set the stage for a growing number of large- scale projects including the HapMap Project, the human ENCODE project , the 1000 Genome Project, and more recently, the Genome 10K Project.

The "Public" Human Genome (left)
The "Private" Human Genome (right)

Gene is a changing concept

Gene is a concept in motion. When Gregor Mendel first published his studies on pea plants in 1866, he did not know the term gene because it was not defined yet. The term was first coiled by Danish botanist Wilhelm Johanssen in 1909 while the physical basis of gene remained unknown at that time. In 1910, Thomas Morgan's work on fruitfilies shows that genes sit on chromosomes, leading to the idea of genes as beads on a string. In 1941, George Beadle and Edward Tatum introduced the concept that one gene makes an enzyme. Shortly after in 1944, Oswald Avery and colleagues found that genes are made of DNA. James Watson and Francis Crick in 1953 published the chemical structure of DNA, as well as the central dogma of molecular biology. The concept that gene is a contiguous segment of DNA was broken when Richard Roberts and Phillip Sharp discovered that genes can be split into segments, leading to the idea that one gene can make several proteins. Genes do not have to code for proteins. RNA genes include rRNA and tRNAs. Work by Victor Ambros and Gary Ruvkun on the nematode C. elegans lead to the discovery of the first microRNA gene in 1993.

DNA is the genetic information carrier

The Human Genome Project is the hallmark of Genomics

Model organisms facilitate understanding of the human genomes and molecular evolution

Model organisms serve as platform for discovery and for testing new technologies. During the early phase of the Human Genome Project, key technologies including cloning and genetic mapping were first developed perfected by working with the nematode Caenorhabditis elegans and the budding yeast Saccharomyces cerevisiae. Indeed the genomes of C. elegans and S. cerevisiae were sequenced and reported before the completion of the Human Genome Project (2003). After genome sequencing, these models also play critical role in annotating the human genomes. Prior to the Human Genome Project, each model organism had been used extensively in research. For example, the fruit fly Drosophila melanogaster was used by Thomas Morgan and his colleagues in elucidating relationship of genes and traits in the first quarter of the 20th centur. As such, a large volumn of biological information has been accumulated for each model organism, before their genomes are sequenced and analyzed. For effective access, publicly accessible organism-specific databases have been set up for each model organism.

A typical model organism database (a.k.a. MOD) has a web page that describes each gene, called a "gene page", which lists all results about the gene, as well as links to internal and external databases for further details. For convenient access, each MOD provides a BLAST page that allows users to the intrested gene page through similarity searches. For advanced users, a MOD makes it possible to download the entire gene set of the model organism or even the entire database through a FTP site. Many MODs also provide "Data Mining" servers that allow users to access data through scripts. For example, WormBase, the official database for the model organism C. elegans, has an associated data mining server .

Over the last few years, many closely related species of each of the well estbablished model organisms have been sequenced, analyzed, and published. Now, at least 12 Drosophila species and five Caenorhabditis species, have been sequenced. The sequencing of these closely related species have greatly promoted the power for comparative genomics. These new genome sequences of related species have been incorporated into the corresponding MOD databases, enabling each MOD to be a "clade-specific database" (Stein, 2005) .

The budding yeast Saccharomyces cerevisiae The budding yeast is the first eukaryote whose genome was subject to whole-genome sequencing (1996). The official database is SGD.

SGD
(SGD logo)

The nematode Caenorhabditis elegans C. elegans is the first metazoan whose genome was subjected to whole-genome sequencing (1998). Genome sequence, annotation, genome-wide functional genomics results, as well as biology data published by the C. elegans research community over the last 40 years since C. elegans was first established as a model organism, is hosted in the public database WormBase.

WormBase
(WormBase logo)

The fruit fly Drosophila melanogaster The genome of the fruit fly D. melanogaster was sequenced by a joint effort of the "public" and the "private". The official database is FlyBase. By now, genomes of 12 Drosophila species have been sequenced for comparative studies.

FlyBase
(FlyBase logo)

The model higher plant Arabidopsis thaliana The small flowering plant Arabidopsis thaliana is the first plant whose genome was subject to whole-genome sequencing. The official database is TAIR.

Alignment is a key step for comparitive genomics

After DNA (or RNA, protein) sequences are obtained, the first step in comparative genomics is alignment, putting sequences side by side. Correct alignment of molecular sequences reveal differences, or mutations, between sequences, which can be taken from different individual humans.

Below, alignment of a DNA sequence taken from a normal human individual and a DNA sequence taken from a sickle cell disease patient reveals a single base difference (i.e., a point mutation), which underlies the disease.

Point mutation (A->T) revealed in sickle cell disease patients

Alignment can be down between DNA sequences, RNA sequences, as well as protein sequences. For identifiying mutations, molecular sequences are usually compared agains a publicly recognized reference sequence (such alignments are usually called pair-wise alignments. For example, reference sequences of whole genome sequences of human and many model organisms have been established. For infering evolutionary relationships, many related sequences are aligned and their relative (pair-wise) distances are estimated.

Multiple sequence aligment of RFX DNA binding domains (DBD) reveals strong conservation of this family of proteins

Homologous genes shares common ancesor (orthology and paralogy)

Transposon is a driving force in evolution

Horizontal gene transfer is common in prokaryotes

Molecular evolution drives speciation

Evolution through cis-regulatory changes

Jack Chen, Associate Professor

Department of Molecular Biology and Biochemistry Simon Fraser University

Chapter 1 - Comparative Genomics: an Emerging Field

Department of Molecular Biology and Biochemistry
Simon Fraser University