An Introduction to Linux & Commandline

Molb 4485/5485 -- Computers in Biology

Nicolas Blouin and Vikram Chhatre

Wyoming INBRE Bioinformatics Core
Dept. of Molecular Biology
University of Wyoming
nblouin@uwyo.edu
vchhatre@uwyo.edu
http://molb4485.uwyo.online

Table of Contents

  1. The Terminal
  2. Connecting to Remote Server
  3. Where Am I?
  4. Linux Directory Structure
  5. Changing Directories
  6. Making Directories
  7. Making & Editing Files
  8. Moving Directories & Files
  9. Copying Directories
  10. Viewing Directory Contents
  11. Removing Files and Directories
  12. Display File Contents
  13. File & Folder Permissions
  14. Your First Shell Script
  15. Unix Power Commands
  16. Saving Your Work

1. The Terminal

This is the common name for the application that gives you text-based access to the operating system of the computer. Basically, it allows you to type input into the computer so that you can receive output from the programs you call. On Unix based machines there is always a terminal program.

Some brief tips before we move further:

2. Connecting To Remote Server

Often with bioinformatics analysis, you will be performing tasks on a remote network server, one much more powerful than your workstation. Let's use the terminal program you just learned about to connect with this server. We will use Secure SHell (SSH) protocol to talk to the remote server. In the following example, replace username with the user name provided to you.

Two-Factor Authentication

To improve computer security, University of Wyoming has started requiring two-factor authentication to grant access to network servers. Two factor implies password plus a second type of authentication. You all have Yubikeys (a usb flash drive like device) for the token needed for this.

    $ ssh username@teton.uwyo.edu
                         TWO-FACTOR AUTHENTICATION
=============================================================================
This system requires two-factor authentication.

The password requirement is your UWYO domain password.

The token can be generated by your registered YubiKey or manually input with
the Duo mobile app. If you have questions about using this implementation of
two-factor authentication, contact the ARCC team at arcc-help@uwyo.edu

Please enter the two-factor password the in the form:

                            <password>,<token>

=============================================================================
    $ wyoinbre,<Press YubiKey Gold Button>
    [username@tlog1 ~]$

3. Where Am I (on the filesystem)?

As you will learn everything in Linux is relative to where you are in the file system. Therefore, knowing where you are before launching a command is valuable information. Luckily, there are built in commands for this type of information. Understanding the location of files will be a key part of success.

4. Linux Directory Structure

Linux files are arranged in a hierarchical structure, or a directory tree. From the root directory (/) there are many subdirectories. Each subdirectory can contain files or other subdirectories, etc., etc. Whenever you're using the terminal you will always be 'in' a directory. The default behavior of opening a terminal window, or logging into a remote computer, will place you in your ‘home’ directory. This is true when we logged into MtMoran. The home directory contains files and directories that only you can modify, we will get to those permissions later.

To see what files, or directories, you have in your home directory we will use the ls command.

5. Changing Directories

To move between directories (folders) we use the cd (change directory) command. We are currently in our home directory. Lets move to /project/inbre-train/username/LearnLinux. The cd command uses following syntax:

    $ cd DIRECTORY
    $ cd /project/inbre-train/username/LearnLinux

6. Making Directories

Creating directories in Unix is done with the mkdir (make directory) command.

    $ mkdir DirectoryName

Using spaces when naming directories, like on your desktop, is not advised in the Unix file system. This is why you see the use of _ in place of spaces. You can escape a space in Unix but it creates unnecessary typing and can create issues executing certain programs. Generally, using spaces in file and directory names is something to avoid.

7. Making and Editing Files

In this section you will learn the basics of making files and putting things into those files. There are a variety of ways we can accomplish this as Unix has built in multiple editors for these tasks. We will review a few here.

    $ touch FILENAME

This will create a new, empty file.

    $ vi FILENAME 

This is a built in text editor that will allow us to put information into a file.

8. Moving Directories and Files

To move a file or directory the mv (move) command is used. This is the first command we have used that requires two arguments. You need to specify the source and the destination for the moving.

    $ mv SOURCE DESTINATION

9. Copying Directories

To copy a file or directory cp (copy) command is used. Just like mv you will need a source and a destination to copy something.

    $ cp SOURCE DESTINATION

10. Viewing Directory Contents

To view the contents of directories we use the ls (list segments) command.

    $ ls DIRECTORY

If no directory is provided ls will list the contents of the current directory.

11. Removing Files and Directories

Caution: This is a dangerous command. File/Folder deletion in Unix is permanent and nonreversible.

If you run ls on your LearnLinux/Work/ directory, it is likely full of lots of empty files and directories by this point. Wouldn't it be nice if there were a way to clean that up? Of course there is a way, however it can be dangerous. To delete directors and files from the system we have two options the rm (remove) and rm –r commands.

    $ rm FILE

One more time just to be clear. It is possible to delete EVERY file you have ever created with the rm command. Using the –i flag rm will ask for confirmation before deleting anything.

12. Display File Contents

There are various commands available to display/print the contents of a file. The default of all these commands is to display the contents of the file on the terminal. These commands are less, cat, head, and tail.

    $ less FILENAME 

Displays file contents on the screen with line scrolling (to scroll you can use 'arrow' keys, 'PgUp/PgDn' keys, 'space bar' or 'Enter' key). Press 'q' to exit.

    $ cat FILENAME

Simplest form of displaying contents. It catalogs the entire contents of the file on the screen. In case of large files, entire file will scroll on the screen without pausing.

    $ head FILENAME

Displays only the 10 starting lines of a file by default. Any number of lines can be displayed with the -n flag followed by the number of lines.

    $ tail FILENAME 

As the name implies the opposite of head this displays the last 10 lines. Again -n option can be used to change this.

13. File & Folder Permissions

All files in any operating system have a set of permissions associated with the file that define what can be done with the file and by whom. What = read, write (modify), and/or execute a file. Whom = user, group, or public. These permissions are denoted with the following syntax:

Permissions
Read: r
Write: w
Execute: x

Relations
User: u
Group: g
All: a

Changing permissions is done via chmod (change mode) command

    $ chmod [Options] RELATIONS [+ or -] PERMISSIONS FILE

14. Your First Shell Script

Just like Perl, Python, R, C++ etc. BASH (Bourne Again SHell) is a programming language that works on Unix and Unix-like computers (Linux, Macintosh, BSD etc.). All the commands that you have been passing to the terminal, are in fact being executed by bash, the command shell. A shell script is simply a collection of various bash commands that are executed sequentially. To make a script we simply write shell commands into a file and then treat that file like any other program or command.

  # This is my first shell script.
  echo "Hello World!"
  :wq

15. Unix Power Commands

The commands that you have learned so far are essential for doing any work in Unix, but they don’t really let you do anything that is very useful. The following section will introduce new commands that will start to show you the power of Unix.

15.1 Pipes & Redirects

Everything we have done so far has sent the result of the command to the screen. This is feasible when the data being displayed is small enough to fit the screen or if it is the endpoint of your analysis. But for large data outputs, or if you need a new file, printing to the screen isn't very useful. Unix has built in methods to hand output from commands using > (greater than) or < (lesser than) or >> signs.

    # Creates a new file (file2) with same contents as old file (file1) 
    $ cat FILE1 > FILE2 

    
    # Appends the contents for file1 to file2, equivalent to opening file1, 
    # copying all the contents, pasting the copied contents to the end of 
    # the file2 and saving it! 
    $ cat FILE1 >> FILE2 

    $ cat FILE1 | less 

Here, cat command displays the contents of the file1, but instead of sending it to standard output (screen) it sends it through the pipe to the next command less so that contents of the file are now displayed on the screen with line scrolling.

From the LearnLinux/Data/ directory

    $ cat seq.fasta

    $ head seq.fasta > new.txt

    $ cat new.txt

    $ tail seq.fasta > new.txt

    $ cat new.txt

Now lets try that with the append option.

    $ head –n 1 seq.fasta > new.txt

    $ tail –n 1 seq.fasta >> new.txt

The grep (globally search a regular expression and print) is one of the most useful commands in Unix and it is commonly used to filter a file/input, line by line, against a pattern.

    $ grep [OPTIONS] PATTERN FILENAME 

Like any other command there are various options available man grep for this command. Most useful options include:

Some typical scenarios to use grep

You might already know that fasta files header must start with a > character, followed by a DNA or protein sequence on subsequent lines. To find only those header lines in a fasta file, we can use grep.


15.3 Word Count

wc (word count) is a useful command in bioinformatics because it can quickly identify how many lines or words are in a file.

    $ wc FILENAME


15.4 Sorting Files

sort command can be used to arrange things in a file. Simplest way to use this command is:

    $ sort FILE1 > SORTED_FILE1

sort has these commonly used flags:

TASK

The LearnLinux/Data/Sequences directory consists of numerically labeled files. Unix can sort either alphabetically or numerically (not both) and hence they are arranged in Seq1.fa, Seq10.fa, Seq11.fa etc. In order to sort them in an easy to read way, try using:

    $ ls |sort –t 'q' -k 2n

This command lists all the files in Sequences/ directory and then passes it to sort command. Sort command then sorts it numerically but only using 3rd and 4th letters of the first field (file name)

Try using sort on Data/Arabidopsis/At_genes.gff

    $ sort -r -k 1 At_genes.gff 
    $ sort -r -k 4 At_genes.gff


15.5 Uniq

uniq (unique) command removes duplicate lines from a sorted file, retaining only one instance of the running matching lines. Optionally, it can show only lines that appear exactly once, or lines that appear more than once. uniq requires sorted input since it compares only consecutive lines. As a result, you will nearly always use uniq after a sort and a pipe, e.g., ($ sort <some file> | unique)

    $ uniq [OPTIONS] INFILE OUTFILE

Useful options include:

TASK

From Data/


15.6 Dividing files by Columns

cut extracts entire columns of data from files. By default, it assumes that columns are tab delimited, but this is not always the case. If your data file contains columns (called 'fields' here) that are separated by other delimiters e.g. space or comma, then you will need to tell cut about it.

The following example assumes that the fields are separated by tabs. This will print the first column from the input file to the screen.

    $ cut –f1 FILE      

Here is an example of a .csv file (comma separated values). The following command will display columns 2 through 4 from this file.

    $ cut –d ',' –f2-4 FILE  

Another example where the delimiter is a pipe (|) and the command will display 1st and 9th column.

    $ cut –d '|' –f1,9 FILE

TASK



16. Saving Your Work

i. Saving on the Teton Remote Server

    $ history > History/netid_week1_history.sh