- From genetics to genomics: highlights of a 100-year history
In the milestone paper entitled "Initial sequencing and analysis of the human genome" (Lander
et al., 2001), Eric Lander and colleagues nicely summarized the scientific progress
made in the quest to understand the nature and content of genetic information. They
proposed that the scientific progress falls into four main phases, corresponding roughly
to four quarters of the century since the re-discovery of the Mendel's laws of genetics
to the publication of the initial human genome. As a consequence of the completion
of the Human Genome Project, a new field - comparative genomics - emerged.
Gregor Mendel published
famous work on genetics in 1866. Mendel carefully cross-breeded thousands of
pea plants, which led him to key insights, now called Mendel's Laws of
Heredity, about how inherited traits are passed on from generation to
generation. However, his publication was ignored by the scientific
community until 1900, when three botanists independently rediscovered Mendel's
work in the same year. Thus Mendel's work had little impact on the field of
genetics before 1900. Mendel published his paper right after Charles Darwin
published his book The Origin of Species, Darwin was never aware Mendel's
discovery. One simple reason, according to James Watson (DNA the
secret of life, 2003, Arrow books), is that genetic mechanisms turn out to be
complicated. Furthermore, early biologists failed to distinguish between two
fundamentally different pocesses, heredity and development.
|
|
Gregor Mendel, 1822-1884, Austria-Hungary. Wikipedia |
Chromosome as the cellular basis of heredity The rediscovery of Mendel's
work on genetics coincided with the independent observation of chromosome using
microscope by Walter Sutton and Theodor Boveri. It was obvious to
them that the factors (or genes) Mendel described earlier were chromosomes. Their
work was later called the Sutton-Boveri chromosome theory of inheritance. Thomas
Morgan , while working on the fruit fly, Drosophila melanogaster,
effectively proved the Sutton-Boveri theory. This breakthrough built on the
discovery of one of the most instructive mutant fruit fly whose eyes are white. In
constract, wild-type (normal) fruit flies have red eyes. Further observation
of mutant fruit flies of Morgan and his colleagures found that chromosomes
can break apart and re-form during the production of sperm and egg cells. This
observation of recombination enabled Morgan and his colleagues to map out
the positions of genes along each chromosome, resulting in genetic maps.
|
|
Thomas Morgan, 1866-1945, American. Wikipedia |
DNA double-helix structure Chromosome consists of proteins
and DNA (deoxyribonucleic acid). Because
protein is made up of tweny different amino acides and DNA is made up of only
four different nucleotides, it was widely believed that protein is the genetic
substance. Most biochemists expressed doubt that DNA is complex enough
as the genetic information carrier. This concern continued even after
Oswald Avery published his transformation study in 1944. DNA was firmly
established as genetic information carrier after the experiment
carried out by Alfred Hershey and Martha Chase, called
the
Hershey-Chase experiment. Based on the X-ray diffraction images of DNA
produced by
Rosalind Franklin, James Watson and Francis Crick published the
DNA double-helix structure in 1953, which immediately reminded
them that DNA is the genetic information carrier. The DNA double-helix
structure is ideal for duplication of genetic information. Watson, Crick, and
were awarded the Nobel Prize in 1962.
|
|
James Watson (left), 1928-, American. Wikipedia
Francis Crick (right), 1916-2004, British. Wikipedia |
DNA sequencing Fred Sanger developed the dideoxy chain-termination
method for sequencing DNA molecules, now known as "Sanger method". Sanger
won his second Nobel Prize in 1980, which he shared with Walter Gilbert,
who independently developed the "Maxam-Gilbert sequencing" method later.
Sanger and colleagues sequenced the genome of bacteriophage X174 (Sanger
et al., 1977), arguably the first complete genome sequenced.The development of the Sanger method made it possible to
initiate the Human Genome Project. During the production phase of the
Human Genome Project, automated Sanger sequencing machines were used.
Now, the Sanger sequencing method is giving place to next-geneartion
sequencing methods.
The Human Genome Project officially started in 1990 and completed in
2003. This projected resulted in the sequencing of the 3 billion base pairs of
the human genome. Scientists from six nations (United States of America, Great
Britain, France, Germany, Japan, and China). In parallel to the Human Genome
Project ("public" project), a company led by
Craig Venter sequenced the human genome ("private" project). Importantly, it triggered the
development of an array of
key genomics technologies including automated DNA sequencing and microarray
technology, as well as set the stage for a growing number of large-
scale projects including the HapMap
Project, the human ENCODE project
, the 1000 Genome Project,
and more recently, the Genome 10K Project.
- Gene is a changing concept
Gene is a concept in motion. When Gregor Mendel first published his studies on
pea plants in 1866, he did not know the term gene because it was not defined yet.
The term was first coiled by Danish botanist Wilhelm Johanssen in 1909 while
the physical basis of gene remained unknown at that time. In 1910, Thomas
Morgan's work on fruitfilies shows that genes sit on chromosomes, leading to
the idea of genes as beads on a string. In 1941, George Beadle and Edward Tatum
introduced the concept that one gene makes an enzyme. Shortly after in 1944,
Oswald Avery and colleagues found that genes are made of DNA. James Watson and
Francis Crick in 1953 published the chemical structure of DNA, as well as the
central dogma of molecular biology. The concept that gene is a contiguous
segment of DNA was broken when Richard Roberts and Phillip Sharp discovered
that genes can be split into segments, leading to the idea that one gene can
make several proteins. Genes do not have to code for proteins. RNA genes
include rRNA and tRNAs. Work by Victor Ambros and Gary Ruvkun on the nematode
C. elegans lead to the discovery of the first microRNA gene in 1993.
- DNA is the genetic information carrier
- The Human Genome Project is the hallmark of Genomics
- Model organisms facilitate understanding of the human genomes and molecular evolution
Model organisms serve as platform for discovery and for testing new technologies.
During the early phase of the Human Genome Project, key technologies including cloning and
genetic mapping were first developed perfected by working with the nematode Caenorhabditis
elegans and the budding yeast Saccharomyces cerevisiae. Indeed the genomes of
C. elegans and S. cerevisiae were sequenced and reported before the completion
of the Human Genome Project (2003). After genome sequencing, these models also play
critical role in annotating the human genomes. Prior to the Human Genome Project,
each model organism had been used extensively in research. For example, the fruit
fly Drosophila melanogaster was used by Thomas Morgan and his colleagues
in elucidating relationship of genes and traits in the first quarter of the 20th
centur. As such, a large volumn of biological information has been accumulated
for each model organism, before their genomes are sequenced and analyzed. For
effective access, publicly accessible organism-specific databases have been set
up for each model organism.
A typical model organism database (a.k.a. MOD) has a web
page that describes each gene, called a "gene page", which lists all results about
the gene, as well as links to internal and external databases for further details.
For convenient access, each MOD provides a BLAST page that allows users to the
intrested gene page through similarity searches. For advanced users, a MOD makes it
possible to download the entire gene set of the model organism or even the entire
database through a FTP site. Many MODs also provide "Data Mining" servers that
allow users to access data through scripts. For example,
WormBase, the official database for the
model organism C. elegans, has an associated data mining server .
Over the last few years, many closely related species of each of the well
estbablished model organisms have been sequenced, analyzed, and published. Now,
at least 12 Drosophila species and five Caenorhabditis species,
have been sequenced. The sequencing of these closely related species have greatly
promoted the power for comparative genomics. These new genome sequences of
related species have been incorporated into the corresponding MOD databases,
enabling each MOD to be a "clade-specific database"
(Stein, 2005) .
The budding yeast Saccharomyces cerevisiae The budding yeast is the
first eukaryote whose genome was subject to whole-genome sequencing
(1996). The
official database is SGD.
The nematode Caenorhabditis elegans C. elegans is the first
metazoan whose genome was subjected to whole-genome sequencing (1998). Genome sequence,
annotation, genome-wide functional genomics results, as well as biology data published
by the C. elegans research community over the last 40 years since
C. elegans was first established as a model organism, is hosted in the public
database WormBase.
The fruit fly Drosophila melanogaster The genome of the fruit fly
D. melanogaster was sequenced by a joint effort of the "public" and the
"private". The official database is FlyBase.
By now, genomes of 12 Drosophila species have been sequenced for
comparative studies.
The model higher plant Arabidopsis thaliana The small
flowering plant Arabidopsis thaliana is the first plant whose genome was
subject to whole-genome sequencing. The official database is
TAIR.
- Alignment is a key step for comparitive genomics
After DNA (or RNA, protein) sequences are obtained, the first step in comparative
genomics is alignment, putting sequences side by side. Correct alignment of molecular
sequences reveal differences, or mutations, between sequences, which can be taken
from different individual humans.
Below, alignment of a DNA sequence taken from a normal human individual and a DNA
sequence taken from a sickle cell disease patient reveals a single base difference
(i.e., a point mutation), which underlies the disease.
|
|
|
Point mutation (A->T) revealed in sickle cell disease patients
|
Alignment can be down between DNA sequences, RNA sequences, as well as protein
sequences. For identifiying mutations, molecular sequences are usually compared agains a
publicly recognized reference sequence (such alignments are usually called pair-wise
alignments. For example, reference sequences of whole
genome sequences of human and many model organisms have been established. For
infering evolutionary relationships, many related sequences are aligned and their
relative (pair-wise) distances are estimated.
|
|
|
Multiple sequence aligment of RFX DNA binding domains (DBD) reveals
strong conservation of this family of proteins
|
- Homologous genes shares common ancesor (orthology and paralogy)
- Transposon is a driving force in evolution
- Horizontal gene transfer is common in prokaryotes
- Molecular evolution drives speciation
- Evolution through cis-regulatory changes
|