Week 3: Sequence Alignment & Phylogenetic Reconstruction

Molb 4485/5485 – Computers in Biology

Nicolas Blouin and Vikram Chhatre

Wyoming INBRE Bioinformatics Core
Dept. of Molecular Biology
University of Wyoming
nblouin@uwyo.edu
vchhatre@uwyo.edu
http://molb4485.uwyo.online

Introduction

Nucleotide and amino acid sequences (proteins) are commonly used to reconstruct the phylogenetic history of organisms. Sequences obtained by researchers are deposited in various online databases and are freely available for use by others. The analysis of sequence data frequently begins by obtaining an alignment. Since sequences vary in length because of mutations (insertions or deletions–indels, point mutations, translocations, etc.), gaps are added to sequences to maintain positional homology. Sequence data can also be obtained from unrecognized or undescribed organisms to provide information about their identity or their proper placement among groups of organisms whose identity is already known.

Analysis typically includes alignment of sequence information obtained for the same gene region in several organisms to see how homologous sites in the sequences compare.

Goals For This Exercise:

Learning Outcomes From Hands-on Work:

1. Obtain Sequences for cytochrome b Genes

Use NCBI website to locate and obtain at least one nucleotide (DNA) sequence for cytochrome oxidase for each of the following bears in the table below and put them all into a single fasta file on Teton. Your searches through NCBI in week two exposed you to all of the tools needed to quickly complete this portion of this week’s exercise.

1.1 Get onto Teton and prepare to input data

ssh <YourNetID>@teton.uwyo.edu
cd /project/inbre-train/<yourNetID>/ 

pwd
mkdir <your_last_name>_Week3    # example: Jones_Week3
cd Jones_Week3
vi KnownBears.fasta
/project/inbre-train/Week3Data/GiantPanda.fasta
>gi|38679207|gb|AY390359.1| Ailuropoda melanoleuca from Daili cytochrome b (cytb) gene, complete cds; mitochondrial
ATGATCAACATCCGAAAAACTCATCCATTAGTTAAAATTATCAACAACTCATTCATTGACCTTCCAACACCATCAAACATTTCAACATGATGGAACTTTGGGTCTCTGTTAGGAGTGTGTCTGATCTTGCAAATCTTAACAGGCTTATTTCTAGCCATACACTATACATCAGATACAGCTACAGCCTTTTCATCAGTCGCACACATTTGTCGAGACGTCAACTATGGTTGATTTATCCGATATATACATGCCAATGGGGCCTCTATATTTTTTATCTGCCTATTTATACACGTAGGGCGAGGCTTATACTATGGATCATACCTATTTCCAGAGACATGGAATATCGGAATTATTCTCCTACTTACAGTTATAGCCACAGCATTCATAGGGTATGTACTACCTTGAGGACAAATATCCTTCTGAGGAGCAACCGTCATTACTAACCTACTATCAGCAATTCCTTACATTGGCACTAATCTAGTGGAGTGAGTCTGAGGGGGTTTCTCCGTAGATAAAGCAACACTAACCCGATTTTTTGCTTTTCACTTTATCCTTCCATTTATCATCTCAGCACTAGCAATAGTCCATCTATTATTCCTTCACGAAACAGGATCTAATAACCCCTCCGGAATTCCATCTGACCCAGACAAAATCCCATTTTACCCCTATCATACAATTAAAGACATCCTAGGCGTCCTATTTCTTGTCCTCGCCTTAATAACCCTGGCTTTATTCTCACCAGACCTGTTAGGAGACCCTGATAACTATACCCCTGCAAATCCACTAAGTACCCCGCCACATATTAAGCCTGAATGGTACTTTCTATTTGCCTACGCTATCCTGCGATCTATTCCTAATAAACTAGGAGGGGTGCTAGCTCTAATCTTCTCTATTCTAATTCTAACTATTATTCCACTATTACATACATCCAAACAACGAAGCATGATATTCCGACCTCTAAGTCAATGCTTATTCTGACTCCTAGTAGCAGACCTACTCACACTAACATGAATTGGAGGACAGCCAGTAGAACACCCCTTCATTATTATTGGGCAATTGGCCTCTATTCTCTACTTTACAATTCTTCTAGTACTTATACCTATCACTAGCATTATTGAGAATAGCCTCTCAAAATGAAGA

Latin name   Common name
Ursus arctos    Brown Bear
Ursus spelaeus    Cave Bear
Ursus maritimus    Polar Bear
Helarctos malayanus    Sun Bear
Melursus ursinus    Sloth Bear
Ursus americanus    American Black Bear
Ursus thibetanus    Asiatic Black Bear
Tremarctos ornatus    Speckled Bear
Ailuropoda melanoleuca    Giant Panda (given to you)

cp KnownBears.fasta Backup.fasta
ls /project/inbre-train/Week3Data

GiantPanda.fasta  UnknownBear1.fasta  UnknownBear2.fasta  UnknownBear3.fasta
grep ">" KnownBears.fasta

2. Sequence Alignment with MAFFT

Today we will use a program called MAFFT (http://mafft.cbrc.jp/alignment/software) to preform our multiple sequence alignments Whenever you are new to a command line program. It is a good idea to get a handle on just “how” to go about executing the program properly. EVERY program on the command line has a “help me” function.

This will look like Greek to you now, but trust us it becomes second nature in a short period of time. Let’s look at the help page for MAFFT, but first we need to load the module or we will get an error. MAFFT also needs the Intel compiler, GCC (GNU Compiler Collection) loaded:

module load swset gcc mafft
mafft --help
mafft --auto --thread 4 KnownBears.fasta > KnownBears.aln

Where:


ls
cat KnownBears.aln

3. Phylogenetic Reconstruction with RAxML

RAxML (Randomized Accelerated Maximum Likelihood) is one of many programs that will produce a phylogenetic tree based on an alignment. Once again let’s look at what the help page looks like for RAxML after we load the module. This time we only need to type -h not —help:

module load raxml 

raxmlHPC-PTHREADS-AVX  -h
mkdir BearTrees
#!/bin/bash
#SBATCH -J RAxML
#SBATCH -n 1
#SBATCH -t 30:00
#SBATCH --mem=20g
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your email address>
#SBATCH --account=inbre-train

#each echo commands is just to make sure that all of your code is running properly
echo "Starting raxmlHPC run at $(date)"

module load swset gcc raxml

#This next bit of code is all on one line

raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -p 618 -N 50 -x 309 -s KnownBears.aln -n KnownBearTree -w <The_absolute_path_to_the_directory_you_just_Made/BearTrees>  -o GiantPanda

echo "RAxML has finished at $(date)"
sbatch MyFirstRAxML.sh
ls BearTrees
cat BearTrees/RAxML_bestTree.KnownBearTree

3.1 Task

Now that you have made a tree. Let’s place 3 unknown bear samples in to the context of your known bear phylogeny. There are 3 samples we have to examine and their DefLines tell us where the samples came from. Each of these sequences are in this directory (where you got the panda sequence from):

ls –l /project/inbre-train/Week3Data

-rwxr-xr-x 1 nblouin inbre-train  1171 Oct 17 15:03 GiantPanda.fasta
-rwxr-xr-x 1 nblouin inbre-train  1214 Oct 17 13:44 UnknownBear1.fasta
-rwxr-xr-x 1 nblouin inbre-train  1200 Oct 17 13:45 UnknownBear2.fasta
-rwxr-xr-x 1 nblouin inbre-train  1217 Oct 17 15:41 UnknownBear3.fasta
grep ">" /project/inbre-train/Week3Data/UnknownBear*

UnknownBear1.fasta:>UnknownBear1|Bear Carcass (tissue only) Found Kenai, AK
UnknownBear2.fasta:>UnknownBear2|scat near salmon run Siberia
UnknownBear3.fasta:>UnknownBear3|Blood sample from poacher's encampment, NW India

#Hint: cat all of the unknown sequences (/project/inbre-train/Week3Data/UnknownBear*) and your known sequences (KnownBears.fasta) into a new file called AllBears.fasta

3.2 Saving Your Work

/project/inbre-train/username/History
history > LASTNAME_week3_history.sh