Indexing a Reference Genome (L12)
Class Exercise #2: Indexing a Genomic Reference (FASTA)
For this tutorial, we will be using an Organism-specific biological database called Cryptodb. These types of databases are updated more frequently than broader biological databases like Ensembl, especially for non-model organisms. As a result, they tend to be more comprehensive.
If you are working with a non-model organism, it is important to check whether a dedicated database exists for your species, as it may provide more up-to-date and detailed information.
The steps we will perform include:
- Making a directory called
indexed_genomes_example - Grab the correct FASTA file from the Cryptodb website
- Load the
hisat2program usingmodule load - Build the genome index using
hisat2-build
What is the purpose of building a genome index?
Prior to mapping, most aligners require you to construct and index the genome, so that the aligner can quickly and efficiently retrieve reference sequence information.
Indexing in general is widely used in bioinformatics to improve performance. This step can take a long time depending on how large the genome is (i.e. hours) so it is a good idea to generate a script.
For this example, we will be indexing the genome of the parasite Cryptosporidium parvum, which is very small in size.
Instructions
-
Make a directory called
indexed_genomes_example -
Grab the correct FASTA file from the Cryptodb website and place inside of
indexed_genomes_example.https://cryptodb.org/common/downloads/Current_Release/CparvumIOWA-ATCC/fasta/data/- Which file will you select? Think about the TYPE of aligner we are using.
- Once you make a decision, right-click, select copy-link
Use the
wgetcommandwget path-you-copiedYou should see the following:
Connecting to cryptodb.org (cryptodb.org)|128.192.21.13|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 9275167 (8.8M) [application/x-fasta] Saving to: ‘CryptoDB-67_CparvumIOWA-ATCC_Genome.fasta’ 100%[=================================================================>] 9,275,167 5.93MB/s in 1.5sCheck the file with
headonce the download is complete.>CP044422 | organism=Cryptosporidium_parvum_IOWA-ATCC | version=2022-10-21 | length=920510 | SO=chromosome AAACCCCTAAACCTAAACCTAAACCTAAACCTAAACCCTAAACCTAAACCTAAACCTAAA CCTAAAACCTAAACCTAAAACCTAAACCTAAAACCTAAAACCTAAACCTAAACCTAAACC CCTAAACCTAAACCTAAACCTAAAAACCTAAACCCTAAACCTAAACCTAAAAAACCTAAA CCTAAACCTAAACCTAAACCTAAACCTAAACCTAAACCTAAAAAACCTAAAACCTAAACC TAAACCTAAACCTAAACCTAAAACCTAAAAACCTAAACCTAAAAACCTAAAAAACCTAAA CCTAAAAAACCTAAACCTAAACCTAAAACCTAAAACCTAAACCTAAAACCTAAAACCTAA ACCTAAACTAAACCTAAAAAACCTAAAAACCTAAACCCCTAAACCTAAACCTAAAACCTA AACCTAAACCTAAACCTAAACCTAAAACCTAAACCTAAACCTAAAACCTAAACCTAAACC TAAACCTAAACCTCCTAAAAACCTAAACCTAAACCTAAACCTAAACCTAAAAAACCTAAA -
Load the
hisat2program usingmodule loadmodule load gcc/13.3.0-xp3epyt module load hisat2/2.2.1-x7h4grf -
Build the genome index using the
hisat2-buildprogram.hisat2-buildcan index reference genomes of any size.hisat2-buildOutput:
Usage: hisat2-build [options]* <reference_in> <ht2_index_base> reference_in comma-separated list of files with ref sequences hisat2_index_base write ht2 data to files with this dir/basename Options: -c reference sequences given on cmd line (as <reference_in>)To index a reference genome using hisat2-build, the basic command syntax is:
hisat2-build input-fasta-file basenameinput-fasta-file: The reference genome FASTA file you want to index.basename: The prefix for the output index files. This is the "name" that will be used for all generated index files.
For this case, use
CparvumIOWA_ATCC_v68as the basename
Final Ouput
Small indexes are stored in files with the .ht2 extension. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files is no longer used by HISAT2 once the index is built.
The final output should look similar to below once complete:
CparvumIOWA_ATCC_v68.1.ht2 CparvumIOWA_ATCC_v68.6.ht2
CparvumIOWA_ATCC_v68.2.ht2 CparvumIOWA_ATCC_v68.7.ht2
CparvumIOWA_ATCC_v68.3.ht2 CparvumIOWA_ATCC_v68.8.ht2
CparvumIOWA_ATCC_v68.4.ht2
CparvumIOWA_ATCC_v68.5.ht2
Check that the file sizes are correct:
-rw-r--r-- 1 pdrodrig pi-jdragon 7.0M Feb 19 11:45 CparvumIOWA_ATCC_v68.1.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 2.2M Feb 19 11:45 CparvumIOWA_ATCC_v68.2.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 80 Feb 19 11:45 CparvumIOWA_ATCC_v68.3.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 2.2M Feb 19 11:45 CparvumIOWA_ATCC_v68.4.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 3.9M Feb 19 11:45 CparvumIOWA_ATCC_v68.5.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 2.3M Feb 19 11:45 CparvumIOWA_ATCC_v68.6.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 12 Feb 19 11:45 CparvumIOWA_ATCC_v68.7.ht2
-rw-r--r-- 1 pdrodrig pi-jdragon 8 Feb 19 11:45 CparvumIOWA_ATCC_v68.8.ht2