FASTA and FASTQ: A Guide to Key File Formats for Sequencing Data SEQanswers (2024)

FASTA and FASTQ: A Guide to Key File Formats for Sequencing Data SEQanswers (1)



Performing successful sequencing analysis requires an understanding of different file formats and how they are used for various applications. Scientists interested in completing their own sequencing analysis should learn the purpose and contents of each format. In the next two articles, we will explore some of the essential file formats used in sequencing data analysis and their significance in the field.

FASTA

The FASTA file format is one of the most popular formats for storing biological sequence data. These text-based files can be used for storing strings of amino acids (peptides) or nucleotide sequences (DNA or RNA). They are routinely used for sequence annotation, database searches, and multiple sequence alignment.

The FASTA format was originally created with the development of the FASTP program1, a platform used for searching amino acid sequence databases. “When the FASTP program was written in the fall of 1983, there were no publicly available protein sequence databases, so there was no standard format for protein sequences,” explained FASTA creator Dr. William R. Pearson, Professor of Biochemistry and Molecular Genetics in the School of Medicine at the University of Virginia. “There were two standard formats for DNA sequence databases, the Genbank format and the EMBL format. Both formats were developed by database people, and had field labels in specific columns with multiple field types, which included a lot more information than simply the sequence; those formats are still in use today.”

During the FASTP program’s development, Pearson and his colleagues collaborated with Margaret Dayhoff’s group from the Protein Identification Resource (PIR) at Georgetown University. Her group had a relatively simple but important format for protein sequence databases, and that original format looked like the image in Figure 1.

FASTA and FASTQ: A Guide to Key File Formats for Sequencing Data SEQanswers (2)
Figure 1: Original format for storing protein sequence information developed by the PIR (provided by Dr. Pearson)

“We received a protein sequence database from the PIR group in this format, so it was the first format the FASTP program could read,” explained Pearson. “However, molecular biologists using this format often forgot to include the description line, which meant that the first line of the sequence was lost (because it was read as the description).” This ultimately led to the simpler and more common file bioinformaticians use today. “The FASTA format was invented by putting both the accession information (HAHU) and the description on the line starting with the ‘>’ (greater-than sign),” Pearson explained about the new file example, shown in Figure 2 below.

FASTA and FASTQ: A Guide to Key File Formats for Sequencing Data SEQanswers (3)
Figure 2: An example of the updated and current FASTA layout (provided by Dr. Pearson)

The file’s new format was rapidly adopted for a number of reasons. “This was very easy for biologists to remember, and, because there were no fixed location fields, it was easy to type in sequences correctly,” said Pearson. When asked if there were any advantages of storing data in this file type, Pearson stated succinctly, “The advantage was simplicity: A line starting with ‘>’ for a description (and to indicate the beginning of a new sequence in a file/database with multiple sequences), everything else is a sequence to be analyzed.”

After its initial development in 1983, FASTA has remained relatively the same. “Since then, different groups have used the information in the description line in different ways, but there were no constraints on either the length of the description line or the length of the sequence line,” added Pearson. “This was another feature that made the format easy to use and easy to incorporate into analysis workflows.”

Despite the growing number of file types used for sequencing analysis and sequence storage, the FASTA format is still highly utilized to this day. As Pearson explained, “Almost all other bioinformatics file formats involve some kind of field-based format, which in general can be much more powerful and easier to compute on. But the FASTA format allowed biologists to easily enter (and examine) sequence data to create their own sequence sets. It is very information-dense, and is well suited to similarity searching, the purpose it was designed for.”

FASTA facts:

  • FASTA uses standard IUB/IUPAC amino acid and nucleic acid codes
  • Some of the common file extensions are: “.fasta”, “.fa”, “.ffn”, “.frn”, “.fna”, and “.faa”
  • Pearson clarified that “FASTA” is pronounced “FAST-long-A”, not “FAST-Ah”
  • In-depth details about FASTA organization can be found at: https://blast.ncbi.nlm.nih.gov/doc/blast-topics/

From FASTA to FASTQ

Derived from FASTA, the FASTQ format is a similar text file containing important sequence information. However, FASTQ files contain details related to the sequencing run from which they originated. The main difference between the two files is that the FASTQ format contains raw sequencing information, specifically the quality scores related to the base calls.

The FASTQ format was created by Dr. Jim Mullikin during his time at the Wellcome Trust Sanger Institute2, although its widespread use and an official publication on the format didn’t occur until years later. Initially designed for Sanger capillary sequencing, the FASTQ format was adapted for use with next-generation sequencing. Several other variations of FASTQ were created for specific technologies, but now the format has become fairly consistent across platforms.

It is important to understand the contents of FASTA files because this format contains raw sequence data that can be used to evaluate the accuracy of the base calls and filter out low-quality reads and sequencing errors. Additionally, FASTQ files are highly utilized and fit into many analysis pipelines. Other important file types that contain primary sequencing data that users should be familiar with include FAST5 files and HDF5 files.

FASTQ Layout

Unlike the greater-than sign (‘>’) that starts the FASTA description line, the FASTQ format (shown in Figure 3) begins with an ‘@’ which is followed by a description line. The description may include details about the sequences or the sequencing run, such as the instrument the data was generated on. The nucleotide sequence begins on the second line of the file, and the third line is simply a ‘+’ (plus sign) which serves as a separator and may also contain a brief description. On the fourth line of the FASTQ file, quality scores for each respective base (from the second line) are represented by American Standard Code for Information Interchange (ASCII) characters.

FASTA and FASTQ: A Guide to Key File Formats for Sequencing Data SEQanswers (4)
Figure 3: A normal layout of a FASTQ containing the ‘@’ and description (line one), the bases (line two), the ‘+’ separator (line three), and the quality scores represented by ASCII characters (line four).

The quality measurements on line four are shown using Phred scores, which assess the reliability of base calls. Phred quality scores are expressed as a logarithmic probability value and represent the estimated error rate for a given base call. The higher the Phred score, the lower the probability that the base is incorrect. These measures are generally used as the standard format for quality across technologies.

FASTQ facts:

  • FASTQ uses the base calls A, C, T, G, and N
  • Common file extensions include: “.fastq” and “.fq” or the gzip-compressed format, “.fastq.gz”
  • Short-read technologies performing pair-end sequencing generate a FASTQ for each read
  • Tools like FASTQC3 and Nanoplot4 are popular tools for processing FASTQ files

References:

  1. Lipman D, Pearson W. Rapid and sensitive protein similarity searches. Science. 1985;227(4693):1435-1441. doi:https://doi.org/10.1126/science.2983426
  2. co*ck PJA, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research. 2010;38(6):1767-1771. doi:https://doi.org/10.1093/nar/gkp1137
  3. Andrews S. Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data. www.bioinformatics.babraham.ac.uk. Published 2010. http://www.bioinformatics.babraham.a...rojects/fastqc
  4. De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Berger B, ed. Bioinformatics. 2018;34(15):2666-2669. doi:https://doi.org/10.1093/bioinformatics/bty149
FASTA and FASTQ: A Guide to Key File Formats for Sequencing Data 
		
		SEQanswers (2024)
Top Articles
Diplomeo : le spécialiste de l’Orientation en France
Excel COM add-ins and Automation add-ins
Katie Pavlich Bikini Photos
Gamevault Agent
Hocus Pocus Showtimes Near Harkins Theatres Yuma Palms 14
Free Atm For Emerald Card Near Me
Craigslist Mexico Cancun
Hendersonville (Tennessee) – Travel guide at Wikivoyage
Doby's Funeral Home Obituaries
Vardis Olive Garden (Georgioupolis, Kreta) ✈️ inkl. Flug buchen
Select Truck Greensboro
How To Cut Eelgrass Grounded
Craigslist In Flagstaff
Shasta County Most Wanted 2022
Energy Healing Conference Utah
Testberichte zu E-Bikes & Fahrrädern von PROPHETE.
Aaa Saugus Ma Appointment
Geometry Review Quiz 5 Answer Key
Walgreens Alma School And Dynamite
Bible Gateway passage: Revelation 3 - New Living Translation
Home
Shadbase Get Out Of Jail
Gina Wilson Angle Addition Postulate
Celina Powell Lil Meech Video: A Controversial Encounter Shakes Social Media - Video Reddit Trend
Walmart Pharmacy Near Me Open
Dmv In Anoka
A Christmas Horse - Alison Senxation
Ou Football Brainiacs
Access a Shared Resource | Computing for Arts + Sciences
Pixel Combat Unblocked
Umn Biology
Obituaries, 2001 | El Paso County, TXGenWeb
Cvs Sport Physicals
Mercedes W204 Belt Diagram
Rogold Extension
'Conan Exiles' 3.0 Guide: How To Unlock Spells And Sorcery
Colin Donnell Lpsg
Teenbeautyfitness
Weekly Math Review Q4 3
Facebook Marketplace Marrero La
Nobodyhome.tv Reddit
Topos De Bolos Engraçados
Gregory (Five Nights at Freddy's)
Grand Valley State University Library Hours
Holzer Athena Portal
Hampton In And Suites Near Me
Stoughton Commuter Rail Schedule
Bedbathandbeyond Flemington Nj
Free Carnival-themed Google Slides & PowerPoint templates
Otter Bustr
San Pedro Sula To Miami Google Flights
Selly Medaline
Latest Posts
Article information

Author: Velia Krajcik

Last Updated:

Views: 5859

Rating: 4.3 / 5 (74 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Velia Krajcik

Birthday: 1996-07-27

Address: 520 Balistreri Mount, South Armand, OR 60528

Phone: +466880739437

Job: Future Retail Associate

Hobby: Polo, Scouting, Worldbuilding, Cosplaying, Photography, Rowing, Nordic skating

Introduction: My name is Velia Krajcik, I am a handsome, clean, lucky, gleaming, magnificent, proud, glorious person who loves writing and wants to share my knowledge and understanding with you.