Nansheng J. Chen

Jack Chen, Associate Professor

Department of Molecular Biology and Biochemistry
Simon Fraser University

Office: SSB8111
Phone: (778)782-4823
Email: chenn(at)sfu.ca

B.Sc., Fudan University, Shanghai
Ph.D., Chinese Academy of Sciences, Qingdao


Home | Teaching | Professional activities | Projects | Research Group | Publications

Comparative Genomics (MBB461/MBB761): an outline

Overview: This new course aims to provide a comprehensive introduction to the emerging field comparative genomics to upper level undergraduate students and perhaps also graduate students. Since the completion of the Human Genome Project (1990-2013), the number of the sequenced genomes has been increasing exponentially, due to the revolutionary development in DNA sequencing technologies, and most importantly demands of researchers in various fields including medicine, fishery, forestry, agriculture, and evolution. Although genome sequences contain ultimate information responsible for driving gene expression, development, and differentiation, how information is encoded in the genome is largely unknown. How does a transcription factor drive unique phenotypes? How does novelty occur in genome evolution? How can a mutation cause a disease condition? Comparative genomics has been effective for addressing these questions. This course has been designed to review how these questions are tackled in comparative genomics in the last decade.

This course has five modules. Each module takes ~three weeks. Each student will do a course project and a presentation on a published work. There is no final exam. In each module, in addition to introducing key concepts, I will also describe a representative large project, which helps to put the students in perspective of what is going on in the field. The first module introduces the fundamentals of comparative genomics including key concepts, sequencing methods, bioinformatics tools, and resources for comparative genomics. In the second module, I will focus on the various types of functional elements in genome, starting with gene. I will also introduce the identification and function of ultraconserved elements, which is followed by describing cis-regulatory elements. The third module deals with the most active subfield of comparative genomics: variations within a species. The fourth module is devoted to describe comparison of genomes of different species. In the last module, students will present their projects.

Textbook: none

Prerequisite: TBD

Class size: 20 students

TopicsDescriptions
 
Module 1: Fundamentals
Chapter 1
Comparative genomics: an emerging field Details
Key words & concepts
  1. Alignment: from small sequences to whole genome
  2. Homology: paralog vs. ortholog
  3. Synteny: chromosomal synteny vs. synteny block
  4. Phylogeny: gene phylogeny vs. organismal phylogeny
  5. Species & speciation
Overview

Genome is a linear molecular structure that carries genetic information. It consists of four nucleotides (A, T, C, and G). Genetic information is scattered in genome and is encoded in these four "letters" in different order and combination. One fundamental question is how genetic information is encoded in genome, which is sometimes called the "dark matter"?

A genome contains a large number of different types of parts: genes and regulatory elements, sometimes collectively called functional elements. Genes, in particulary protein-coding genes, are relatively straightforward to identify. In contrast, other types of functional elements like enhancers and insulators are harder to detect. Regulatory elements dictates gene expression. In particular, it determines where and when a gene is expressed. It also determines how much is gene is expressed.

Presently, an effective way to understand genome is comparative genomics, in which two or more genomes are compared for similarities and for differences. Comparative analysis can be down at the DNA level and at the protein level. Using bioinformatics programs, sequences are aligned and the alignments are examined for their evolutionary relationship. Are they homologous, or do they share common ancestor? Comparative analysis can also be done for genomes of different distances, ranging from genomes of different strains of a species to different species that are distanly related. Differences of genomes (i.e., genotypes) can therefore be linked to functional consequences, or phenotypes.

Objectives

Understand why comparative genomics is important for studying G2P (genotype-to-phenotype) relationship, and how comparative genomics is carried out.

Readings
  1. Primer: Comparative genomics (by Ross Hardison, 2003).
  2. Review: Comparative genomics (by Web Miller et al., 2004).
Chapter 2
DNA sequencing technologies: the driving force
Details
Key words & concepts
  1. First-generation
    • Sanger sequencing
  2. Second-generation
    • 454 sequencing
    • Solexa sequencing
    • SOLiD sequencing
  3. Third-generation
    • Nanopore sequencing
Overview

The ability to determine the sequence of nucleotides are ordered in a DNA molecule is an essential first step for understanding the composition and function of a genome. Over the last 40 years, many different methods were developed. Sanger sequencing method, which was invented by Fred Sanger who won his second Nobel Prize for it, was the key method used in the Human Genome Project. Second(or "next")-generation DNA sequencing methods were introduced a few years after the completion of the Human Genome Project. Because different sequencing methods produce reads with different lengths and can be either paired or single-ended, they can be used to address different questions. The third-generation sequencing methods are on the horizon for production use.

Popular DNA sequencing methods will be described in this lesson, primarily because sequence reads from each method are unique so that they need separate methods for handling and for analysis.

Objectives

Master file format of each sequencing method and how they can be processed for further analysis.

Readings
  1. Review: Applications of new sequencing technologies for transcriptome analysis (by Morozova and Marra, 2009)
  2. Review: The potential and challenges of nanopore sequencing (by Branton et al., 2008)
  3. File format: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants (by Cock et al., 2009).
Chapter 3
Bioinformatics: the enabling force
Details
Key words & concepts
  1. Alignment
    • Short sequence alignment
    • Whole genome alignment
  2. Database searches (blast)
  3. Homolog identification
    • Ortholog
    • Paralog
  4. Synteny
    • Synteny block identification (orthoCluster)
    • Synteny breakpoint
  5. Genome visualization
    • dotplot
    • Circos
    • Genome Browser & Synteny browser (gbrowse & gbrowse_syn)
Overview

Bioinformatics is an sister field of genomics and emerged almost simultaneously as genomics. The amount of information generated in large genome sequencing and related functional genomics projects can no longer be recorded in notebooks used comfortably by traditional biologists. Computers, including both hardware and software, are needed for genome information storage, management, retrieval, analysis, and report.

In this lecture, I will overview bioinformatics tools developed in the last 10 years or so for DNA alignment, for synteny identification, and for display using genome browsers.

Objectives

Be comfortable with selecting computer programs appropriate for different conditions.

Readings
  1. Bioinformatics: alive and kicking (by Stein, 2008)
  2. BWA (by Li et al., 2010)
  3. OrthoCluster (by Zeng et al., 2008) & OrthoClusterDB (by Ng et al., 2009)
  4. gbrowse (by Lincoln Stein, et al., 2010)
  5. Circos: An information aesthetic for comparative genomics (by Krzywinski et al., 2009)
Chapter 4
Resources for comparative genomics
Details
Key words & concepts
  1. Primary and clade-specific databases
    • NCBI genome resources
    • Ensembl genome resources
    • UCSC genome resources
  2. Model organism databases (MODs)
Overview

Sequence results of publicly funded genome projects are deposited in publicly accessible databases, which makes it convenient to data retrieval. In this lecture, three major public genome data resources will be described. Although data stored in these databases are essentially identical, each database has its own unique bioinformatics tools, which makes each of them useful for certain purposes.

Objectives

Be familar with the architecture of each of the three public genome data resources and learn how to retrieve data effectively.

Readings
  1. Touring Ensembl: A practical guide to genome browsing (by Spudich and Fernandez-Suarez, 2010)
  2. Entrez Gene: gene-centered information at NCBI (by Ostel et al., 2007)
Chapter 5
The Human Genome Project
Details
  1. The Human Genome Project
    • The Public
    • The Private
 
Module 2: Functional elements: identification & function
Chapter 6
Gene
Details
Key words & concepts
  1. Gene definition
    • Protein-coding gene
      • Exon
      • Intron
      • 5' & 3' UTRs
      • Promoter
      • Alternative splicing & isoform
    • Non-coding gene
      • tRNA
      • rRNA
      • microRNA
  2. Gene prediction
    • Comparative prediction
    • Transcriptome-based prediction
  3. Genes in operons
    • Prokaryote
    • Eukaryote
  4. Gene duplication
  5. Gene birth and death
Overview

Gene is a concept in motion. When Gregor Mendel first published his studies on pea plants in 1866, he did not know the term gene because it was not defined yet. The term was first coiled by Danish botanist Wilhelm Johanssen in 1909 while the physical basis of gene remained unknown at that time. In 1910, Thomas Morgan's work on fruitfilies shows that genes sit on chromosomes, leading to the idea of genes as beads on a string. In 1941, George Beadle and Edward Tatum introduced the concept that one gene makes an enzyme. Shortly after in 1944, Oswald Avery and colleagues found that genes are made of DNA. James Watson and Francis Crick in 1953 published the chemical structure of DNA, as well as the central dogma of molecular biology. The concept that gene is a contiguous segment of DNA was broken when Richard Roberts and Phillip Sharp discovered that genes can be split into segments, leading to the idea that one gene can make several proteins. Genes do not have to code for proteins. RNA genes include rRNA and tRNAs. Work by Victor Ambros and Gary Ruvkun on the nematode C. elegans lead to the discovery of the first microRNA gene in 1993.

The structure of gene is dynamic. In genome, new genes are born and existing genes can die. The birth and death of genes can be detected by comparing genomes of closely related organisms.

Objectives

Readings
  1. Origins, evolution and phenotypic impact of new genes (by Kaessmann, Genome Research, 2010)
  2. What is a gene, post-ENCODE? History and updated definition (by Gerstein et al., Genome Research, 2007)
  3. The origin of new genes: glimpses from the young and old (by Long et al., Nature Review Genetics, 2003)
Chapter 7
Ultraconserved elements
Details
  1. Identification
  2. Function
Chapter 8
Functional elements: cis-regulatory elements
Details
  1. Transcription factor binding sites (TFBSs)
  2. Promoter
  3. Enhancer
  4. Insulator
  5. Finding motifs: ChIP-chip & ChIP-SEQ
Chapter 9
ENCODE & modENCODE projects
Details
  1. Pilot project (1% of the human genome)
  2. ENCODE: functional elements in humans
  3. modENCODE: functional elements in D. melanogaster and C. elegans
Chapter 10
Synteny blocks
Details
  1. Chromosomal synteny
  2. Synteny blocks
    • Perfect synteny blocks
    • Imperfect synteny blocks
    • "Ultraconserved" synteny blocks
  3. Synteny breakpoints
Chapter 11
Genome rearrangement events & genome evolution
Details
  1. Deletion
  2. Insertion
  3. Inversion
  4. Transposition
  5. Translocation
 
Module 3: Intra-species comparison
Chapter 12
Genome variations
Details
Shortest woman & tallest man

Shortest man & tallest woman
Key words & concepts
  1. Types of GVs
  2. Formation of GVs
    • Duplication
    • Non-homologous recombination
  3. GVs and disease conditions
  4. Personalized genomics & medicine
Readings
  1. New York Times: Adventures in Very Recent Evolution (Nichlas Wade, 2010)
  2. New York Times: Scientists Cite Fastest Case of Human Evolution (by Nichlas Wade, 2010)
Chapter 13
From SNP to HapMap
Details
  1. Types of SNPs
  2. Density and genome distribution
  3. Impact on genes
    • Coding regions
    • Regulatory regions
  4. Haplotype
  5. The HapMap Project
    • Phase 1
    • Phase 2
    • Phase 3
Chapter 14
Structural variation (SV)
Details
Key words & concepts
  1. Comparative genomics hybridization (CGH)
  2. Copy number variation (CNV)
    • Deletion
    • Duplication
  3. Balanced rearrangement (BRE)
    • Inversion
    • Transposition and translocation
Readings
  1. Copy Number Variation in Human Health, Disease, and Evolution (by Zhang et al., Annual Review, 2009)
Chapter 15
Loss-of-function variations
Details
Key words & concepts
  1. Identification
  2. Validation
  3. Buffering of genetic variation
Readings
  1. Initial sequence of the chimpanzee genome and comparison with the human genome
    Note: Although this paper describes differences between two species (human & chimpanzee), such kind of differences also exist between human individuals. It reported many human disease genes in the chimpanzee genome.
  2. Principles for the Buffering of Genetic Variation (by Hartman et al., Science, 2001)
Chapter 16
GWAS (genome-wide association studies)
Details
Readings
  1. Genomewide Association Studies and Assessment of the Risk of Disease (by Manolio, NEJM, 2010)
Chapter 17
Personalized genomes & The 1000 Genome Project
Details
  1. Personalized genomes
    • James Watson
    • Craig Venter
    • Yan Huang ("An Asian")
    • A Korean
    • Desmond Tutu
  2. The 1000 Genome Project
 
 
Module 4: Inter-species comparison
Chapter 18
Gene family: contraction and expansion
Details
  1. Gene family classification
  2. Comparative gene family classification
  3. Stable gene family (e.g., ABC transporters)
  4. Dynamic gene family (e.g., chemosensory genes)
Chapter 19
Transcription factor and gene battery
Details
  1. Classification of transcription factors
  2. Example: RFX gene family
  3. Example: RFX gene battery
Chapter 20
Horizontal gene transfer
Details
Readings
  1. Nature Review Focus: Horizontal gene transfer (2005)
  2. Lateral gene transfer and the nature of bacterial innovation (by Ochman et al., Nature, 2000)
  3. Lateral gene transfer between Archaea and Bacteria (by Nelson et al., Nature, 1999)
Chapter 21
Virulence factors & drug targets
Details
Readings
  1. Carbon metabolism of intracellular bacterial pathogens and possible links to virulence (Eisenreich et al., Nature Reviews Microbiology, 2010)
Chapter 22
Metagenomics
Details
Key words & concepts
  1. Environmental
    • Hot spring
    • Ocean
    • Sludge
    • soil
  2. Organismal
    • Gut
    • Skin
    • feces
    • Lung
Readings
  1. Primer: Metagenomics
Chapter 23
What makes us human?
Details

Key words & concepts
  1. Human vs. animals
  2. Human vs. chimpanzee
  3. Human vs. Neandertal
  4. Human vs. human
Overview

Objectives

Readings
  1. A Draft Sequence of the Neandertal Genome (by Green et al., Science, 2010)
  2. An RNA gene expressed during cortical development evolved rapidly in humans (by Pollard et al., Nature, 2006) Evolution at two levels in humans and chimpanzees (by King and Wilson, Science, 1975) Note: This study was regarded as the first contribution to comaprative genomics (Sean Carool, PLoS Biology, 2005).
Listenings
  1. Dr. Katherine Pollard: What makes us human?
Chapter 24
The Genome 10K Project
Details
Key words & concepts
  1. Ancestral state reconstruction
  2. Comparative genomics
  3. Molecular evolution
  4. Species conservation
  5. Vertebrate biology
Overview

This large-scale project was proposed in anticipation of a precipitous drop in costs and an increase in sequencing efficiency.

Objectives

Be aware of this large-scale project and the resource that will be available for comparative genomics du ring and after the completion of this project.

Readings
  1. Genome 10K: A Proposal to Obtain Who le-Genome Sequence for 10 000 Vertebrate Species (by Haussler et al., 2009)
 
Module 5: Student projects & presentations
There will be multiple presentation sessions. Overview

Students will be divided into groups of two students. Each group will propose a comparative genomics project at the beginning of the course, which will be carried out duing this course. At the end of the course, each group will present their projects. One student will focus on background and motivation, while the second student on results and interpretation.


Please send input to chenn@sfu.ca. Last updated: August 4, 2010