RStudio-BallDr. Riley teaches both the undergraduate Bio-360 and graduate Bio-664 Bioinformatics courses. The undergraduate course uses popular online bioinformatics tools, while the graduate course introduces R programming with popular R Bioconductor packages. The undergraduate course also includes a 1 credit Bio-361 BioPython Lab.

 

 

 

 

SYLLABUS                     BIOLOGY 664

Old Title: Bioinformatics for Molecular Biologists
Potential New Title: Integrated Bioinformatics Using R for Both Wet and Dry Scientists
Weekly Schedule: 3:30 – 5:00 Tuesday & Thursday
Location: Wheatley Biology Conference Room W-3-022

Instructor: Todd Riley
Office Hours: Tuesdays and Thursdays from 10:30 to 12:00. Please email if you need to schedule a different time.
Office: ISC 4730 (4th floor)
Email: todd.riley@umb.edu
Phone: 617-287-3236

I. Overview: This course has changed quite a bit – starting with the initial changes made last year. Members of our department have agreed that our graduate students need to learn how to use R to analyze their biological data. Loaded with hundreds of packages, the R programming environment really has become the de facto standard for analyzing many different types of biological data. Although R is sometimes clumsy, it is always very powerful – with the latest and greatest statistical tools and also great graphing capability for producing publication-quality figures. In short, we think it’s important that our graduate students learn at least some R to help them analyze their data. So the course has been revamped to include R labs. Each class will begin with a lecture and then end with an R lab where we will apply what we’ve learned. We will be changing the name of the course to reflect its old and new goals:

Goal 1: Provide our students with both the computational foundation and the statistical foundation to competently analyze biological data using the R statistical programming environment.

Goal 2: Provide a graduate level bioinformatics course that is accessible and rewarding for both wet lab scientists (e.g. molecular biologists) and dry scientists (e.g. machine learning, data mining, statistics).

Goal 3: Provide the skillset necessary to design, execute, and analyze a basic research project using R and to highlight the analysis in a “publishable” paper that contains high-quality figures generated in R.

Getting the Most Out of This Class: What you get out of a class is highly dependent upon what you put into it. If you are serious about learning how to analyze biological data using the latest and greatest tools to find answers to important questions in biology, you’ve come to the right place! I highly recommend that you commit yourself to diligently studying all the material in each chapter – including the exercises at the end of each chapter. It is also important not to look at the solutions to the exercises until after you have implemented your own solution. The best way to learn how to approach a problem and implement a sound solution is to go through the iterative process yourself.

Note: This course analyzes data from molecular and cellular biology – including genomics and systems biology. A similar course with a focus on ecology is taught by Jarret Byrnes. Here is the course info: http://jarrettbyrnes.info/biol697/. For those students who straddle both disciplines, I would suggest you take both courses! Also, this course serves as a possible follow-up to our Bio-360/361 undergraduate bioinformatics course and lab. In the Bio-361 lab, students learn bioinformatics skills using Biopython instead of R and Bioconductor. However, the undergraduate course is not a prerequisite for this course.

Prerequisites: An undergraduate course or a graduate course in molecular biology or genetics (Biol 370 or Biol 675 or permission of the instructor). You are also required to have a basic knowledge of algebra and introductory calculus (although no calculus will be used). Undergraduate courses in probability theory, computer science, and genetics are useful, but not required. Students who are new to programming should read chapter 1 of Adler before or during the first week of the course. Students who are new to genetics (or rusty) should read Shultz before or during the first week of the course.

Also, you must install the latest versions of R and R-studio on your laptop before the first day of class. If you are running Windows, please install Cygwin as well and choose “C:\” as your root directory during the installation. The default Cygwin settings for everything else will suffice. You will also need to add the “c:\bin” directory to your path. You can edit your path environment variable by going to Control Panel->System->Advanced system settings->Environment variables->System variables. (Cygin is a full suite of unix shells and utility programs ported to the Windows platform.) Please bring your laptop to all classes.

 

II. Required Text: Applied Statistics for Bioinformatics using R – 2nd Edition (DRAFT) , Wim P. Krijnen and Todd R. Riley

Chapters of the 2nd Edition are provided by the links below.

PLEASE NOTE: The new chapters below are updated often. After clicking on one of the links below, be sure to hit the “Refresh” button to make sure that you are getting the most recent version:

Ordering a Hard Copy from Campus Printing: http://www.umb.edu/quinn_graphics/quinngraphics

Instructor’s Editorial Comments: I think that this book is great technically and really appropriate for this course since it’s completely centered around using R. On the downside, unfortunately in the 1st edition there are some typos, grammatical errors, and some awkward English. Hopefully, the 2nd edition improves upon these weaknesses. I plan to go through most of the material in each chapter, and then finish with ChIP-seq and RNA-seq analysis in R which are not covered in the book.

Highly Recommended Text: Adler, J. (2009) R in a Nutshell: A Desktop Quick Reference. O’Reilly.
Purchase: [amazon]

Instructor’s Editorial Comments: This is a great R Reference Manual. In fact, I use this book and I’ve been programming in R for years.

Recommended Text: Shultz, M. (2009) The Stuff of Life: A Graphic Guide to Genetics and DNA
Purchase: [amazon]

Instructor’s Editorial Comments: This book provides a great up-to-date, cartoon-style overview of genetics that is both highly informative and somewhat entertaining.

Useful Online References for R
Quick-R: http://www.statmethods.net/
R & Bioconductor Manual: http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual
Apply Family in R: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-…
Apply Functions: http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm
Producing Simple Graphs with R: http://www.harding.edu/fmccown/R/
A Handbook of Statistical Analyses Using R – Brian S. Everitt and Torsten Hothorn (PDF)
Statistics Using R with Biological Examples – Kim Seefeld, Ernst Linder (PDF)
simpleR – John Verzani (PDF)
R Fundamentals and Programming Techniques – Thomas Lumley (PDF)
A list of tutorials in R from universities around the world: http://pairach.com/2012/02/26/r-tutorials-from-universities-around-the-world/

 

III. Assignments and Final Independent Project:
1. Problem Sets – 100% of your overall grade is determined by your grades on the problem sets
2. Independent Final Project – I also recommend submitting an extra-credit, final project on biological data of your choosing

1. Problem Sets: All problem sets will be done in an R script (*.r file). A major goal of this course is that each student learns how to write clear, concise, well-documented, reusable R code. Please follow the guidelines below:

  • The R scripts must be uploaded into Blackboard under “Course Materials” by the due date before the beginning of class.
  • Liberally comment your code with lines beginning with the “#” character. The purpose of commenting is to help others and yourself understand the logic behind your code later on. The more commenting you have, the more reusable your code will be later on!
  • Define column and row labels in all matrices and data.frames – which makes your data structures more readable and understandable.
  • Use named row and column referencing whenever possible – which will make your R code more reusable.
  • Use numbered indexing as little as possible. Instead, use named referencing, nrow(), and ncol().
  • The submitted R script must load any necessary libraries and/or datasets not included in the base installation of R.
  • All lines inside the submitted R script must run error free.
  • Use the following naming scheme for all your submitted R scripts: biology664.spring2014.hwN.firstName.lastName.r  (replace bold text).
  • Make sure that your full name is at the top of the text inside the R script.

2. Independent Final Project: Each student can also submit an independent, extra credit project that should be designed and executed by the student.  This project can be of considerable benefit to the student if it is closely related to his/her thesis research project or professional research at work. Three potential ideas are to study a gene, pathway, and/or phenotype closely related to your research. For a gene or pathway you can study their expression and/or regulation related to stimuli or phenotypes. For a phenotypic study you can use RNA, protein, and/or epigenetic data to find potential biomarkers. I’m open to ideas, so feel free to run them by me. You are welcome to use you own data, but may need to augment with outside data if you don’t have enough. Of course, you need to use R for your analysis.

A. Project Proposal: If a student wishes to submit an independent project, he/she should prepare a brief proposal, 2-4 pages, describing the independent project and must submit this proposal no later than October 27. The proposal should be divided into four sections:

1.  Background and objectives: A description of the background of the biological system and the question(s) that you hope to answer.

2.  Computational methods: The computational methods that you intend to use to answer the question(s) in your proposal.

3.  Discussion: A brief description of how you plan to evaluate the biological significance of the results of your computer analysis. It’s very important in science to motivate your audience to care about your work with its “Impact” or “Significance”.

4.  Several references describing the background of your proposed project.

The proposal will not be graded, because its sole purpose is to determine whether the objectives of the project are reasonable and interesting.

Please note that the final project should be designed to test a biological hypothesis. I don’t consider projects that are purely technical, such as designing PCR primers, to be appropriate at the graduate level.

B. Final Report: The optional final report should be in the form of a scientific paper, divided into the following sections: (1) Abstract, (2) Background and objectives, (3) Computational methods, (4) Results and discussion, (5) Conclusions, (6) A brief description of how the conclusions of your analyses could be tested using biochemical or genetic techniques, (7) References.

References: Please follow the Cell Journal guidelines for references EXACTLY.  I highly recommend that you use a referencing and bibliography software package like EndNote, Zotero, etc. It will make your life much easier! References in the text should include the authors’ names and dates:

– One author: (Pearson, 1996)
– Two authors: (Smith and Waterman, 1981)
– Three or more authors: (Altschul et al., 1990)
– Multiple references: (Pearson, 1996; Smith and Waterman, 1981; Altschul et al., 1990)

The references in the bibliography should also adhere to the Cell Journal format:

– Journal article: Lipman, D.J., Pearson, W.R. (1985). Rapid and sensitive protein similarity searches.  Science 227, 1435-1441.
– Book chapter: Schuler G.D. (1998). Sequence alignment and database searching.  In: Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, AD Baxevani and BFF Ouellette, eds.  Wiley Interscience, New York, NY.

Organization: Please try to organize the sequence information and the interpretations as clearly as possible.  It is unreasonable to expect the reader to hunt through large numbers of pages to find data supporting a specific conclusion.  There are two acceptable ways of organizing the figures.  First, the sequence data and text can be integrated into the body of the paper.  Second, the sequence data can be compiled into a series of clearly-labeled appendices.

Figures: Every figure should have a caption adequately describing the contents of the figure without having to resort to reading the main text.  There must be at least 4 figures created by the student, and at least 3 of them should be created in R.

Length: The final report should be 10-15 pages double-spaced, not including computer output or references.
IV. Classroom Policy

Honesty: The Homework assignments are intended to be done individually. You can talk with each other, but all submitted work should be done strictly on your own. All students are expected to follow the University’s Code of Student Conduct. (If you are caught cheating, whether you are the giver, receiver, or collaborator, the consequences can be dire.)

Accommodations: Section 504 of the Americans with Disabilities Act of 1990 offers guidelines for curriculum modifications and adaptations for students with documented disabilities. If applicable, students may obtain adaptation recommendations from the Ross Center for Disability Services, Campus Center 2nd Floor, 2100 Street, Room 2010, 617-287-7430. The student must present these recommendations and discuss them with each professor within a reasonable period, preferably by the end of the Add/Drop period.

 

V. Course Material

R Code: Unfortunately, some R code in the textbook drifts off the page and is lost. Also, solutions are missing for some exercises and don’t work for others. To solve these problems, I’ve provided all the R Code from each chapter (including solutions to the exercises) in R scripts separated by chapter. I’ve also reformatted some of the code to increase readability. Please use these R scripts in class with RStudio:

Chapter 1 R Script

Chapter 1 Solutions R Script

Chapter 2 R Script

Dr Kesseli’s Question R Script

Chapter 2 Solutions R Script

Chapter 2 Supplemental R Script

Chapter 3 R Script

Chapter 3 Solutions R Script

Chapter 3 Supplemental R Script

Chapter 4 R Script

Chapter 4 Solutions R Script

Chapter 4 Supplemental R Script

Chapter 5 R Script

Chapter 5 Solutions R Script

Chapter 6 R Script

Chapter 6 Solutions R Script

Chapter 7 R Script

Chapter 7 Solutions R Script

Chapter 8 R Script

Chapter 8 Supplemental R Script

Chapter 8 Solutions R Script

Chapter 9 R Script

Chapter 9 Solutions R Script

Chapter 10 R Script

Chapter 10 Solutions R Script

 

PDFs: Other course material including lecture slides and papers will be posted below in the Course Schedule:

 

VI. Course Schedule

Note: The following course schedule is tentative and may change depending on the needs and wishes of the students. We may spend more time on certain course material and skip other material based on student feedback.

Tuesday, September 5 – Chapter 1: Brief Introduction into Using R – programming in RStudio

Thursday, September 7 – Chapter 1: Brief Introduction into Using R – vectors, lists, matrices, data.frames

Tuesday, September 12 – Chapter 1: Brief Introduction into Using R – more matrices and data.frames

Thursday, September 14 – Chapter 2: Data Display and Descriptive Statistics – univariate data display

Tuesday, September 19 – Chapter 2: Data Display and Descriptive Statistics – descriptive statistics

Thursday, September 21 – Chapter 3: Important Distributions – binomial, Poisson, normal, cumulative

Tuesday, September 26 – Chapter 3: Important Distributions – Χ2, T, F, hypergeometric

Thursday, September 28 – Chapter 4: Estimation and Inference – Z-test, t-Test, F-test, binomial

Tuesday, October 3 – Chapter 4: Estimation and Inference – Χ2-test, Fisher’s exact test, normality tests, outliers, Wilcoxon rank-sum test

Thursday, October 5 – Chapter 5: Linear Models – lm, rlm, ANOVA

Tuesday, October 10 – Chapter 5: Linear Models – assumptions, robust test, R2

Thursday, October 12 – Chapter 5: Linear Models – applications, exercises

Tuesday, October 17 – Chapter 6: Microarray Analysis – preprocessing, filtering, linear models, annotating

Thursday, October 19 – Chapter 6: Microarray Analysis – GO analysis, interpreting

Tuesday, October 24 – Chapter 6: Microarray Analysis – exercises

Thursday, October 26 – Chapter 7: Cluster Analysis and Trees – distance, single linkage, k-means

Tuesday, October 31 – Chapter 7: Cluster Analysis and Trees – correlation coefficient, PCA

Thursday, November 2 – Chapter 8: Classification Methods – ROC curves, AUROC, trees, Random Forests

Tuesday, November 7 – Chapter 8: Classification Methods – SVMs, neural nets, logistic regression

Thursday, November 9 – Chapter 9: Analyzing Sequences – querying, pattern-matching, motif-finding, PWMs

Tuesday, November 14 – Chapter 9: Analyzing Sequences – local and global alignments, and Substitution Matrices

Thursday, November 16 – Chapter 9: Analyzing Sequences – dynamic programming, BLAST, MUSCLE

Tuesday, November 21 – RNA-seq Analysis, File Formats

Thursday, November 23 – Thanksgiving Vacation

Tuesday, November 28 – RNA-seq Analysis

Thursday, November 30 – Chapter 10: Markov Models – random sampling, transition matrix, stationary distribution

Tuesday, December 5 – Chapter 10: Markov Models – phylogenetic trees

Thursday, December 7 – Chapter 10: Markov Models – Hidden Markov Models, profile HMMs, PFAM

Tuesday, December 12 – Extra make-up class

Tuesday, December 19 – Optional independent project due by 5pm!

CONTACT

Dr. Todd Riley
Assistant Professor of Biology
University of Massachusetts Boston
100 Morrissey Blvd. | ISC Building Room 4730
Boston, Massachusetts 02125
Phone: (617) 287-3236

GOOGLE MAP