Week 3b. Sequence Alignment & Phylogenetic Reconstruction

November 12, 2019

Molb 4485/5485 – Computers in Biology


Nucleotide and amino acid sequences (proteins) are commonly used to reconstruct the phylogenetic history of organisms. Sequences obtained by researchers are deposited in various online databases and are freely available for use by others. The analysis of sequence data frequently begins by obtaining an alignment. Since sequences vary in length because of mutations (insertions or deletions–indels, point mutations, translocations, etc.), gaps are added to sequences to maintain positional homology. Sequence data can also be obtained from unrecognized or undescribed organisms to provide information about their identity or their proper placement among groups of organisms whose identity is already known.

Analysis typically includes alignment of sequence information obtained for the same gene region in several organisms to see how homologous sites in the sequences compare.

Table of Contents

Goals For This Exercise

Learning Outcomes From Hands-on Work

1. Obtain Sequences for cytochrome b Genes

Use NCBI website to locate and obtain at least one nucleotide (DNA) sequence for cytochrome oxidase for each of the following bears in the table below and put them all into a single fasta file on Teton. Your searches through NCBI in week two exposed you to all of the tools needed to quickly complete this portion of this week’s exercise.

1.1 Get onto Teton and prepare to input data

ssh <YourNetID>@teton.uwyo.edu
cd /project/inbre-train/<yourNetID>/ 


cd molb4485



mkdir <your_last_name>_Phylo    # example: Jones_Week3

cd Jones_Phylo

vi KnownBears.fasta


>gi|38679207|gb|AY390359.1| Ailuropoda melanoleuca from Daili cytochrome b (cytb) gene, complete cds; mitochondrial

PLEASE SET your Algorithm parameter:

Max Target Sequences to 500

to make sure you get a hit from each bear in the table below. Copy each of the sequences you get into your KnownBears.fasta file.

Latin name   Common name
Ursus arctos    Brown Bear
Ursus spelaeus    Cave Bear
Ursus maritimus    Polar Bear
Helarctos malayanus    Sun Bear
Melursus ursinus    Sloth Bear
Ursus americanus    American Black Bear
Ursus thibetanus    Asiatic Black Bear
Tremarctos ornatus    Speckled Bear
Ailuropoda melanoleuca    Giant Panda (given to you)

cp KnownBears.fasta Backup.fasta

ls /project/inbre-train/Week3Data

GiantPanda.fasta  UnknownBear1.fasta  UnknownBear2.fasta  UnknownBear3.fasta

cat /project/inbre-train/Week3Data/GiantPanda.fasta KnownBears.fasta >> KnownBears2.fasta

grep ">" KnownBears2.fasta

mv KnownBears2.fasta KnownBears.fasta

2. Sequence Alignment with MAFFT

Today we will use a program called MAFFT (http://mafft.cbrc.jp/alignment/software) to perform our multiple sequence alignments Whenever you are new to a command line program. It is a good idea to get a handle on just “how” to go about executing the program properly. EVERY program on the command line has a help me function.

This will look like Greek to you now, but trust us it becomes second nature in a short period of time. Let’s look at the help page for MAFFT, but first we need to load the module or we will get an error. MAFFT also needs the Intel compiler, GCC (GNU Compiler Collection) loaded:

module load swset gcc mafft

mafft --help

mafft --auto --thread 4 KnownBears.fasta > KnownBears.aln


cat KnownBears.aln

3. Phylogenetic Reconstruction with RAxML

RAxML (Randomized Accelerated Maximum Likelihood) is one of many programs that will produce a phylogenetic tree based on an alignment. Once again let’s look at what the help page looks like for RAxML after we load the module. This time we only need to type -h not —help:

module load raxml 

mkdir BearTrees
#SBATCH -n 1
#SBATCH -t 30:00
#SBATCH --mem=20g
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your email address>

#SBATCH --ntasks-per-node=8
#SBATCH --account=inbre-train
#SBATCH partition=inv-inbre

#each echo commands is just to make sure that all of your code is running properly
echo "Starting raxmlHPC run at $(date)"

module load swset gcc raxml

#This next bit of code is all on one line

raxmlHPC-PTHREADS-AVX -T 8 -f a -m GTRGAMMA -p 618 -N 50 -x 309 -s KnownBears.aln -n KnownBearTree -w <The_absolute_path_to_the_directory_you_just_Made/BearTrees>  -o GiantPanda

echo "RAxML has finished at $(date)"
sbatch MyFirstRAxML.sh
ls BearTrees
cat BearTrees/RAxML_bestTree.KnownBearTree

3.1 Task

Now that you have made a tree. Let’s place 3 unknown bear samples in to the context of your known bear phylogeny. There are 3 samples we have to examine and their DefLines tell us where the samples came from. Each of these sequences are in this directory (where you got the panda sequence from):

ls –l /project/inbre-train/Week3Data

-rwxr-xr-x 1 nblouin inbre-train  1171 Oct 17 15:03 GiantPanda.fasta
-rwxr-xr-x 1 nblouin inbre-train  1214 Oct 17 13:44 UnknownBear1.fasta
-rwxr-xr-x 1 nblouin inbre-train  1200 Oct 17 13:45 UnknownBear2.fasta
-rwxr-xr-x 1 nblouin inbre-train  1217 Oct 17 15:41 UnknownBear3.fasta
grep ">" /project/inbre-train/Week3Data/UnknownBear*

UnknownBear1.fasta:>UnknownBear1|Bear Carcass (tissue only) Found Kenai, AK
UnknownBear2.fasta:>UnknownBear2|scat near salmon run Siberia
UnknownBear3.fasta:>UnknownBear3|Blood sample from poacher's encampment, NW India

Hint: cat all of the unknown sequences (/project/inbre-train/Week3Data/UnknownBear*) and your known sequences (KnownBears.fasta) into a new file called AllBears.fasta

4. Saving Your Work

history > /project/inbre-train/inbre0XX/History/LASTNAME_week3_history.sh

5. Push to GitHub

git status

git add .

git commit -m "Added files related to phylogenetics exercise"

git push