Jack Chen, Associate Professor

Department of Molecular Biology and Biochemistry
Simon Fraser University

Office: SSB8111
Phone: (778)782-4823
Email: chenn(at)sfu.ca

B.Sc., Fudan University, Shanghai
Ph.D., Chinese Academy of Sciences, Qingdao

Comparative Genomics (MBB461/MBB761): an outline

Overview: This new course aims to provide a comprehensive introduction to the emerging field comparative genomics to upper level undergraduate students and perhaps also graduate students. Since the completion of the Human Genome Project (1990-2013), the number of the sequenced genomes has been increasing exponentially, due to the revolutionary development in DNA sequencing technologies, and most importantly demands of researchers in various fields including medicine, fishery, forestry, agriculture, and evolution. Although genome sequences contain ultimate information responsible for driving gene expression, development, and differentiation, how information is encoded in the genome is largely unknown. How does a transcription factor drive unique phenotypes? How does novelty occur in genome evolution? How can a mutation cause a disease condition? Comparative genomics has been effective for addressing these questions. This course has been designed to review how these questions are tackled in comparative genomics in the last decade.

This course has five modules. Each module takes ~three weeks. Each student will do a course project and a presentation on a published work. There is no final exam. In each module, in addition to introducing key concepts, I will also describe a representative large project, which helps to put the students in perspective of what is going on in the field. The first module introduces the fundamentals of comparative genomics including key concepts, sequencing methods, bioinformatics tools, and resources for comparative genomics. In the second module, I will focus on the various types of functional elements in genome, starting with gene. I will also introduce the identification and function of ultraconserved elements, which is followed by describing cis-regulatory elements. The third module deals with the most active subfield of comparative genomics: variations within a species. The fourth module is devoted to describe comparison of genomes of different species. In the last module, students will present their projects.

Textbook: none

Prerequisite: TBD

Class size: 20 students

Topics	Descriptions

Module 1: Fundamentals
Chapter 1 Comparative genomics: an emerging field Details	Key words & concepts Alignment: from small sequences to whole genome Homology: paralog vs. ortholog Synteny: chromosomal synteny vs. synteny block Phylogeny: gene phylogeny vs. organismal phylogeny Species & speciation Overview Genome is a linear molecular structure that carries genetic information. It consists of four nucleotides (A, T, C, and G). Genetic information is scattered in genome and is encoded in these four "letters" in different order and combination. One fundamental question is how genetic information is encoded in genome, which is sometimes called the "dark matter"? A genome contains a large number of different types of parts: genes and regulatory elements, sometimes collectively called functional elements. Genes, in particulary protein-coding genes, are relatively straightforward to identify. In contrast, other types of functional elements like enhancers and insulators are harder to detect. Regulatory elements dictates gene expression. In particular, it determines where and when a gene is expressed. It also determines how much is gene is expressed. Presently, an effective way to understand genome is comparative genomics, in which two or more genomes are compared for similarities and for differences. Comparative analysis can be down at the DNA level and at the protein level. Using bioinformatics programs, sequences are aligned and the alignments are examined for their evolutionary relationship. Are they homologous, or do they share common ancestor? Comparative analysis can also be done for genomes of different distances, ranging from genomes of different strains of a species to different species that are distanly related. Differences of genomes (i.e., genotypes) can therefore be linked to functional consequences, or phenotypes. Objectives Understand why comparative genomics is important for studying G2P (genotype-to-phenotype) relationship, and how comparative genomics is carried out. Readings Primer: Comparative genomics (by Ross Hardison, 2003). Review: Comparative genomics (by Web Miller et al., 2004).
Chapter 2 DNA sequencing technologies: the driving force Details	Key words & concepts First-generation Sanger sequencing Second-generation 454 sequencing Solexa sequencing SOLiD sequencing Third-generation Nanopore sequencing Overview The ability to determine the sequence of nucleotides are ordered in a DNA molecule is an essential first step for understanding the composition and function of a genome. Over the last 40 years, many different methods were developed. Sanger sequencing method, which was invented by Fred Sanger who won his second Nobel Prize for it, was the key method used in the Human Genome Project. Second(or "next")-generation DNA sequencing methods were introduced a few years after the completion of the Human Genome Project. Because different sequencing methods produce reads with different lengths and can be either paired or single-ended, they can be used to address different questions. The third-generation sequencing methods are on the horizon for production use. Popular DNA sequencing methods will be described in this lesson, primarily because sequence reads from each method are unique so that they need separate methods for handling and for analysis. Objectives Master file format of each sequencing method and how they can be processed for further analysis. Readings Review: Applications of new sequencing technologies for transcriptome analysis (by Morozova and Marra, 2009) Review: The potential and challenges of nanopore sequencing (by Branton et al., 2008) File format: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants (by Cock et al., 2009).
Chapter 3 Bioinformatics: the enabling force Details	Key words & concepts Alignment Short sequence alignment Whole genome alignment Database searches (blast) Homolog identification Ortholog Paralog Synteny Synteny block identification (orthoCluster) Synteny breakpoint Genome visualization dotplot Circos Genome Browser & Synteny browser (gbrowse & gbrowse_syn) Overview Bioinformatics is an sister field of genomics and emerged almost simultaneously as genomics. The amount of information generated in large genome sequencing and related functional genomics projects can no longer be recorded in notebooks used comfortably by traditional biologists. Computers, including both hardware and software, are needed for genome information storage, management, retrieval, analysis, and report. In this lecture, I will overview bioinformatics tools developed in the last 10 years or so for DNA alignment, for synteny identification, and for display using genome browsers. Objectives Be comfortable with selecting computer programs appropriate for different conditions. Readings Bioinformatics: alive and kicking (by Stein, 2008) BWA (by Li et al., 2010) OrthoCluster (by Zeng et al., 2008) & OrthoClusterDB (by Ng et al., 2009) gbrowse (by Lincoln Stein, et al., 2010) Circos: An information aesthetic for comparative genomics (by Krzywinski et al., 2009)
Chapter 4 Resources for comparative genomics Details	Key words & concepts Primary and clade-specific databases NCBI genome resources Ensembl genome resources UCSC genome resources Model organism databases (MODs) Yeast: SGD Arabidopsis: TAIR (TAIR funding crisis) C. elegans: WormBase Drosophila melanogaster: FlyBase Rat: RGD Overview Sequence results of publicly funded genome projects are deposited in publicly accessible databases, which makes it convenient to data retrieval. In this lecture, three major public genome data resources will be described. Although data stored in these databases are essentially identical, each database has its own unique bioinformatics tools, which makes each of them useful for certain purposes. Objectives Be familar with the architecture of each of the three public genome data resources and learn how to retrieve data effectively. Readings Touring Ensembl: A practical guide to genome browsing (by Spudich and Fernandez-Suarez, 2010) Entrez Gene: gene-centered information at NCBI (by Ostel et al., 2007)
Chapter 5 The Human Genome Project Details	The Human Genome Project The Public The Private

Module 2: Functional elements: identification & function
Chapter 6 Gene Details	Key words & concepts Gene definition Protein-coding gene Exon Intron 5' & 3' UTRs Promoter Alternative splicing & isoform Non-coding gene tRNA rRNA microRNA Gene prediction Comparative prediction Transcriptome-based prediction Genes in operons Prokaryote Eukaryote Gene duplication Gene birth and death Overview Gene is a concept in motion. When Gregor Mendel first published his studies on pea plants in 1866, he did not know the term gene because it was not defined yet. The term was first coiled by Danish botanist Wilhelm Johanssen in 1909 while the physical basis of gene remained unknown at that time. In 1910, Thomas Morgan's work on fruitfilies shows that genes sit on chromosomes, leading to the idea of genes as beads on a string. In 1941, George Beadle and Edward Tatum introduced the concept that one gene makes an enzyme. Shortly after in 1944, Oswald Avery and colleagues found that genes are made of DNA. James Watson and Francis Crick in 1953 published the chemical structure of DNA, as well as the central dogma of molecular biology. The concept that gene is a contiguous segment of DNA was broken when Richard Roberts and Phillip Sharp discovered that genes can be split into segments, leading to the idea that one gene can make several proteins. Genes do not have to code for proteins. RNA genes include rRNA and tRNAs. Work by Victor Ambros and Gary Ruvkun on the nematode C. elegans lead to the discovery of the first microRNA gene in 1993. The structure of gene is dynamic. In genome, new genes are born and existing genes can die. The birth and death of genes can be detected by comparing genomes of closely related organisms. Objectives Readings Origins, evolution and phenotypic impact of new genes (by Kaessmann, Genome Research, 2010) What is a gene, post-ENCODE? History and updated definition (by Gerstein et al., Genome Research, 2007) The origin of new genes: glimpses from the young and old (by Long et al., Nature Review Genetics, 2003)
Chapter 7 Ultraconserved elements Details	Identification Function
Chapter 8 Functional elements: cis-regulatory elements Details	Transcription factor binding sites (TFBSs) Promoter Enhancer Insulator Finding motifs: ChIP-chip & ChIP-SEQ
Chapter 9 ENCODE & modENCODE projects Details	Pilot project (1% of the human genome) ENCODE: functional elements in humans modENCODE: functional elements in D. melanogaster and C. elegans
Chapter 10 Synteny blocks Details	Chromosomal synteny Synteny blocks Perfect synteny blocks Imperfect synteny blocks "Ultraconserved" synteny blocks Synteny breakpoints
Chapter 11 Genome rearrangement events & genome evolution Details	Deletion Insertion Inversion Transposition Translocation

Module 3: Intra-species comparison
Chapter 12 Genome variations Details Shortest woman & tallest man Shortest man & tallest woman	Key words & concepts Types of GVs Formation of GVs Duplication Non-homologous recombination GVs and disease conditions Personalized genomics & medicine Readings New York Times: Adventures in Very Recent Evolution (Nichlas Wade, 2010) New York Times: Scientists Cite Fastest Case of Human Evolution (by Nichlas Wade, 2010)
Chapter 13 From SNP to HapMap Details	Types of SNPs Density and genome distribution Impact on genes Coding regions Regulatory regions Haplotype The HapMap Project Phase 1 Phase 2 Phase 3
Chapter 14 Structural variation (SV) Details	Key words & concepts Comparative genomics hybridization (CGH) Copy number variation (CNV) Deletion Duplication Balanced rearrangement (BRE) Inversion Transposition and translocation Readings Copy Number Variation in Human Health, Disease, and Evolution (by Zhang et al., Annual Review, 2009)
Chapter 15 Loss-of-function variations Details	Key words & concepts Identification Validation Buffering of genetic variation Readings Initial sequence of the chimpanzee genome and comparison with the human genome Note: Although this paper describes differences between two species (human & chimpanzee), such kind of differences also exist between human individuals. It reported many human disease genes in the chimpanzee genome. Principles for the Buffering of Genetic Variation (by Hartman et al., Science, 2001)
Chapter 16 GWAS (genome-wide association studies) Details	Readings Genomewide Association Studies and Assessment of the Risk of Disease (by Manolio, NEJM, 2010)
Chapter 17 Personalized genomes & The 1000 Genome Project Details	Personalized genomes James Watson Craig Venter Yan Huang ("An Asian") A Korean Desmond Tutu The 1000 Genome Project

Module 4: Inter-species comparison
Chapter 18 Gene family: contraction and expansion Details	Gene family classification Comparative gene family classification Stable gene family (e.g., ABC transporters) Dynamic gene family (e.g., chemosensory genes)
Chapter 19 Transcription factor and gene battery Details	Classification of transcription factors Example: RFX gene family Example: RFX gene battery
Chapter 20 Horizontal gene transfer Details	Readings Nature Review Focus: Horizontal gene transfer (2005) Lateral gene transfer and the nature of bacterial innovation (by Ochman et al., Nature, 2000) Lateral gene transfer between Archaea and Bacteria (by Nelson et al., Nature, 1999)
Chapter 21 Virulence factors & drug targets Details	Readings Carbon metabolism of intracellular bacterial pathogens and possible links to virulence (Eisenreich et al., Nature Reviews Microbiology, 2010)
Chapter 22 Metagenomics Details	Key words & concepts Environmental Hot spring Ocean Sludge soil Organismal Gut Skin feces Lung Readings Primer: Metagenomics
Chapter 23 What makes us human? Details	Key words & concepts Human vs. animals Human vs. chimpanzee Human vs. Neandertal Human vs. human Overview Objectives Readings A Draft Sequence of the Neandertal Genome (by Green et al., Science, 2010) An RNA gene expressed during cortical development evolved rapidly in humans (by Pollard et al., Nature, 2006) Evolution at two levels in humans and chimpanzees (by King and Wilson, Science, 1975) Note: This study was regarded as the first contribution to comaprative genomics (Sean Carool, PLoS Biology, 2005). Listenings Dr. Katherine Pollard: What makes us human?
Chapter 24 The Genome 10K Project Details	Key words & concepts Ancestral state reconstruction Comparative genomics Molecular evolution Species conservation Vertebrate biology Overview This large-scale project was proposed in anticipation of a precipitous drop in costs and an increase in sequencing efficiency. Objectives Be aware of this large-scale project and the resource that will be available for comparative genomics du ring and after the completion of this project. Readings Genome 10K: A Proposal to Obtain Who le-Genome Sequence for 10 000 Vertebrate Species (by Haussler et al., 2009)

Module 5: Student projects & presentations
There will be multiple presentation sessions.	Overview Students will be divided into groups of two students. Each group will propose a comparative genomics project at the beginning of the course, which will be carried out duing this course. At the end of the course, each group will present their projects. One student will focus on background and motivation, while the second student on results and interpretation.

Please send input to chenn@sfu.ca. Last updated: August 4, 2010

Jack Chen, Associate Professor

Department of Molecular Biology and Biochemistry Simon Fraser University

Comparative Genomics (MBB461/MBB761): an outline

Department of Molecular Biology and Biochemistry
Simon Fraser University