Indexing a Reference Genome (L12)
Class Exercise #2: Indexing a Genomic Reference (FASTA)
For this tutorial, we will be using an Organism-specific biological database called Cryptodb. These types of databases are updated more frequently than broader biological databases like Ensembl, especially for non-model organisms. As a result, they tend to be more comprehensive.
If you are working with a non-model organism, it is important to check whether a dedicated database exists for your species, as it may provide more up-to-date and detailed information.
The steps we will perform include:
- Making a directory called
indexed_genomes_example
- Grab the correct FASTA file from the Cryptodb website
- Load the
hisat2
program usingmodule load
- Build the genome index using
hisat2-build
What is the purpose of building a genome index?
Prior to mapping, most aligners require you to construct and index the genome, so that the aligner can quickly and efficiently retrieve reference sequence information.
Indexing in general is widely used in bioinformatics to improve performance. This step can take a long time depending on how large the genome is (i.e. hours) so it is a good idea to generate a script.
For this example, we will be indexing the genome of the parasite Cryptosporidium parvum, which is very small in size.
Instructions
-
Make a directory called
indexed_genomes_example
-
Grab the correct FASTA file from the Cryptodb website and place inside of
indexed_genomes_example
.https://cryptodb.org/common/downloads/Current_Release/CparvumIOWA-ATCC/fasta/data/
- Which file will you select? Think about the TYPE of aligner we are using.
- Once you make a decision, right-click, select copy-link
Use the
wget
commandwget path-you-copied
You should see the following:
Connecting to cryptodb.org (cryptodb.org)|128.192.21.13|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 9275167 (8.8M) [application/x-fasta] Saving to: ‘CryptoDB-67_CparvumIOWA-ATCC_Genome.fasta’ 100%[=================================================================>] 9,275,167 5.93MB/s in 1.5s
Check the file with
head
once the download is complete.>CP044422 | organism=Cryptosporidium_parvum_IOWA-ATCC | version=2022-10-21 | length=920510 | SO=chromosome AAACCCCTAAACCTAAACCTAAACCTAAACCTAAACCCTAAACCTAAACCTAAACCTAAA CCTAAAACCTAAACCTAAAACCTAAACCTAAAACCTAAAACCTAAACCTAAACCTAAACC CCTAAACCTAAACCTAAACCTAAAAACCTAAACCCTAAACCTAAACCTAAAAAACCTAAA CCTAAACCTAAACCTAAACCTAAACCTAAACCTAAACCTAAAAAACCTAAAACCTAAACC TAAACCTAAACCTAAACCTAAAACCTAAAAACCTAAACCTAAAAACCTAAAAAACCTAAA CCTAAAAAACCTAAACCTAAACCTAAAACCTAAAACCTAAACCTAAAACCTAAAACCTAA ACCTAAACTAAACCTAAAAAACCTAAAAACCTAAACCCCTAAACCTAAACCTAAAACCTA AACCTAAACCTAAACCTAAACCTAAAACCTAAACCTAAACCTAAAACCTAAACCTAAACC TAAACCTAAACCTCCTAAAAACCTAAACCTAAACCTAAACCTAAACCTAAAAAACCTAAA
-
Load the
hisat2
program usingmodule load
module load gcc/13.3.0-xp3epyt module load hisat2/2.2.1-x7h4grf
-
Build the genome index using the
hisat2-build
program.hisat2-build
can index reference genomes of any size.hisat2-build
Output:
Usage: hisat2-build [options]* <reference_in> <ht2_index_base> reference_in comma-separated list of files with ref sequences hisat2_index_base write ht2 data to files with this dir/basename Options: -c reference sequences given on cmd line (as <reference_in>)
To index a reference genome using hisat2-build, the basic command syntax is:
hisat2-build input-fasta-file basename
input-fasta-file
: The reference genome FASTA file you want to index.basename
: The prefix for the output index files. This is the "name" that will be used for all generated index files.
For this case, use
CparvumIOWA_ATCC_v68
as the basename
Final Ouput
Small indexes are stored in files with the .ht2
extension. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files is no longer used by HISAT2
once the index is built.
The final output should look similar to below once complete:
CparvumIOWA_ATCC_v68.1.ht2 CparvumIOWA_ATCC_v68.6.ht2
CparvumIOWA_ATCC_v68.2.ht2 CparvumIOWA_ATCC_v68.7.ht2
CparvumIOWA_ATCC_v68.3.ht2 CparvumIOWA_ATCC_v68.8.ht2
CparvumIOWA_ATCC_v68.4.ht2
CparvumIOWA_ATCC_v68.5.ht2
Check that the file sizes are correct:
-rw-r--r-- 1 pdrodrig pi-jdragon 7.0M Feb 19 11:45 CparvumIOWA_ATCC_v68.1.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 2.2M Feb 19 11:45 CparvumIOWA_ATCC_v68.2.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 80 Feb 19 11:45 CparvumIOWA_ATCC_v68.3.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 2.2M Feb 19 11:45 CparvumIOWA_ATCC_v68.4.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 3.9M Feb 19 11:45 CparvumIOWA_ATCC_v68.5.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 2.3M Feb 19 11:45 CparvumIOWA_ATCC_v68.6.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 12 Feb 19 11:45 CparvumIOWA_ATCC_v68.7.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 8 Feb 19 11:45 CparvumIOWA_ATCC_v68.8.ht2