BLAST TUTORIAL

Molb 4485/5485 – Computers in Biology

Nicolas Blouin and Vikram Chhatre

Wyoming INBRE Bioinformatics Core
Dept. of Molecular Biology
University of Wyoming
nblouin@uwyo.edu
vchhatre@uwyo.edu
http://molb4485.uwyo.online

What is BLAST?

Today you will learn how to use both the GUI (graphical user interface) and command line versions of BLAST. Both are useful implementations when looking for sequence similarity for DNA, RNA, and amino acid (protein) sequences. Which version you will use depends on your question. The GUI interface is user friendly and has many tools that will help you perform sequence analysis. The command line version allows the user the freedom to customize their searches and results in addition to performing analysis on hundreds to millions of sequences at the same time.

1. GUI BLASTn Activity

Use BLASTn to identify 3 sequences and use the information from your search to answer the questions below. To begin, you will need to copy this file from MtMoran into your iMac desktop using scp (secure copy).


scp UserID@teton.uwyo.edu:/project/inbre-train/Week2Data/GUIBLASTsequences.fasta ~/Desktop/

Each group (or 4 computers) will only look at 3 sequences each and answer the following 7 questions for each of the 3 sequences:

When answering the questions please place them in a text file in your project space (/project.inbre-train/<YourNetId>). Call this file YourLastNameGUIBLAST.txt. This means that you will be working on the command line and on your iMac at the same time

  1. How long is your sequence? (provide sequence ID and length)

  2. Which sequence matches your query the best? What data support this conclusion? List the sequence GenBank ID and state the rationale for your response.

  3. What organism (or group of organisms) is the mostly likely source your query sequence?

    1. What is the common name for this organism or group?

    2. What phylum contains this organism?

  4. Estimate the number of sequences with an e-value less than 1-e-3 from your search

  5. If possible, give the names of at least three different organisms with significant E values. Record the name of the organism, the common name, and the E value.

  6. Look at the first matching sequence in the table. For that subject sequence, determine the length of the alignment, in nucleotides, and the fraction of nucleotides that match your query sequence. (This is included in the alignment information for that match).

  7. Use GenBank record for the hits and links (e.g. PubMed) to possible function of the protein that’s specified by your DNA sequence. Describe what is known about the role of this protein in the organism that provided the DNA. Note: not all of you will find information for this.

Bonus: Rather than copying the entire fasta file to you desktop from MtMoran, can you describe how else you might extract only those sequences that you need for this exercise?

Double Bonus: Can you translate your explanation above into Linux code? Below is a table of common commands for your reference, many of which we went over last week.

2. Command Line BLASTn Activity

Now you will use BLAST on the command line

  1. Login to Teton; and clear your history for today’s work
history -c
  1. Copy the sequences for this exercise into you home directory
$ cp /project/inbre-train/Week2Data/Week2SearchSeqsForBLASTp.faa ./
  1. Now let’s make a shell script to run our analysis This is just like what we did with the GUI interface but here we specify all of the parameters ourselves. Open a file called MyFirstBLAST.sh using the vi text editor and enter the following:

#!/bin/bash
#SBATCH -J BLASTp
#SBATCH -n 1
#SBATCH -t 30:00
#SBATCH --mem=20g
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your email address>
#SBATCH --account=inbre-train

#each echo commands is just to make sure that all of your code is running properly
echo "started"

module load swset gcc blast-plus
echo "loaded required modules: Blast"

blastp -db swissprot \
    -evalue 1e-20 \
    -num_threads 8 \
    -outfmt "6 qseqid stitle  sscinames sseqid qcovs evalue pident \
        length mismatch gapopen qstart qend sstart send bitscore" \
    -query Week2SearchSeqsForBLASTp.faa \
    -out MyFirstBLAST.out \
    -max_target_seqs 5 

echo "Completed BLAST query."
  1. Now let’s run the script
sbatch MyFirstBLAST.sh
  1. This should run in a few minutes. Let’s look at the results using head

2.1 Task

Using grep, cut, pipes (“|”) and redirects (“>”) create a file called A_hits.txt that only contain the GenBank ID for each of the hits for the queries in your BLAST from your UnknownA query.

Hint: the GenBank ID is buried in the 4th column in between two pipes just after “gi”. Talk through this with your neighbors.

Think about this logically:

2.2 Saving Your Work

Saving on the Teton Remote Server: Make sure you are in the directory you made last week

/project/inbre-train/username/

After you are finished with all exercises, use the history command to direct all your activity to a text file as follows:

history > History/LASTNAME_week2_history.sh

2.3 Retrieve hit sequences from GenBank using Batch Entrez

Now that we have our list of hits we want to do some downstream analysis (like make a tree of relatedness of the sequences). To do so we will use our text file and NCBI’s Batch Entrez program.

  1. Download the text file to your desktop in VirtualBox using scp. For this you will need to know the path to your file (use pwd to do this).

  2. Then go to Batch Entrez Website in your web browser and use your file to retrieve your sequences. Make sure you set the database drop down to protein like in the image below since our GenBank IDs are for proteins sequences.