FASTA and FASTQ: A Guide to Key File Formats for Sequencing Data SEQanswers (2024)

Performing successful sequencing analysis requires an understanding of different file formats and how they are used for various applications. Scientists interested in completing their own sequencing analysis should learn the purpose and contents of each format. In the next two articles, we will explore some of the essential file formats used in sequencing data analysis and their significance in the field.

FASTA

The FASTA file format is one of the most popular formats for storing biological sequence data. These text-based files can be used for storing strings of amino acids (peptides) or nucleotide sequences (DNA or RNA). They are routinely used for sequence annotation, database searches, and multiple sequence alignment.

The FASTA format was originally created with the development of the FASTP program¹, a platform used for searching amino acid sequence databases. “When the FASTP program was written in the fall of 1983, there were no publicly available protein sequence databases, so there was no standard format for protein sequences,” explained FASTA creator Dr. William R. Pearson, Professor of Biochemistry and Molecular Genetics in the School of Medicine at the University of Virginia. “There were two standard formats for DNA sequence databases, the Genbank format and the EMBL format. Both formats were developed by database people, and had field labels in specific columns with multiple field types, which included a lot more information than simply the sequence; those formats are still in use today.”

During the FASTP program’s development, Pearson and his colleagues collaborated with Margaret Dayhoff’s group from the Protein Identification Resource (PIR) at Georgetown University. Her group had a relatively simple but important format for protein sequence databases, and that original format looked like the image in Figure 1.

Figure 1: Original format for storing protein sequence information developed by the PIR (provided by Dr. Pearson)

“We received a protein sequence database from the PIR group in this format, so it was the first format the FASTP program could read,” explained Pearson. “However, molecular biologists using this format often forgot to include the description line, which meant that the first line of the sequence was lost (because it was read as the description).” This ultimately led to the simpler and more common file bioinformaticians use today. “The FASTA format was invented by putting both the accession information (HAHU) and the description on the line starting with the ‘>’ (greater-than sign),” Pearson explained about the new file example, shown in Figure 2 below.

Figure 2: An example of the updated and current FASTA layout (provided by Dr. Pearson)

The file’s new format was rapidly adopted for a number of reasons. “This was very easy for biologists to remember, and, because there were no fixed location fields, it was easy to type in sequences correctly,” said Pearson. When asked if there were any advantages of storing data in this file type, Pearson stated succinctly, “The advantage was simplicity: A line starting with ‘>’ for a description (and to indicate the beginning of a new sequence in a file/database with multiple sequences), everything else is a sequence to be analyzed.”

After its initial development in 1983, FASTA has remained relatively the same. “Since then, different groups have used the information in the description line in different ways, but there were no constraints on either the length of the description line or the length of the sequence line,” added Pearson. “This was another feature that made the format easy to use and easy to incorporate into analysis workflows.”

Despite the growing number of file types used for sequencing analysis and sequence storage, the FASTA format is still highly utilized to this day. As Pearson explained, “Almost all other bioinformatics file formats involve some kind of field-based format, which in general can be much more powerful and easier to compute on. But the FASTA format allowed biologists to easily enter (and examine) sequence data to create their own sequence sets. It is very information-dense, and is well suited to similarity searching, the purpose it was designed for.”

FASTA facts:

FASTA uses standard IUB/IUPAC amino acid and nucleic acid codes
Some of the common file extensions are: “.fasta”, “.fa”, “.ffn”, “.frn”, “.fna”, and “.faa”
Pearson clarified that “FASTA” is pronounced “FAST-long-A”, not “FAST-Ah”
In-depth details about FASTA organization can be found at: https://blast.ncbi.nlm.nih.gov/doc/blast-topics/

From FASTA to FASTQ

Derived from FASTA, the FASTQ format is a similar text file containing important sequence information. However, FASTQ files contain details related to the sequencing run from which they originated. The main difference between the two files is that the FASTQ format contains raw sequencing information, specifically the quality scores related to the base calls.

The FASTQ format was created by Dr. Jim Mullikin during his time at the Wellcome Trust Sanger Institute², although its widespread use and an official publication on the format didn’t occur until years later. Initially designed for Sanger capillary sequencing, the FASTQ format was adapted for use with next-generation sequencing. Several other variations of FASTQ were created for specific technologies, but now the format has become fairly consistent across platforms.

It is important to understand the contents of FASTA files because this format contains raw sequence data that can be used to evaluate the accuracy of the base calls and filter out low-quality reads and sequencing errors. Additionally, FASTQ files are highly utilized and fit into many analysis pipelines. Other important file types that contain primary sequencing data that users should be familiar with include FAST5 files and HDF5 files.

FASTQ Layout

Unlike the greater-than sign (‘>’) that starts the FASTA description line, the FASTQ format (shown in Figure 3) begins with an ‘@’ which is followed by a description line. The description may include details about the sequences or the sequencing run, such as the instrument the data was generated on. The nucleotide sequence begins on the second line of the file, and the third line is simply a ‘+’ (plus sign) which serves as a separator and may also contain a brief description. On the fourth line of the FASTQ file, quality scores for each respective base (from the second line) are represented by American Standard Code for Information Interchange (ASCII) characters.

Figure 3: A normal layout of a FASTQ containing the ‘@’ and description (line one), the bases (line two), the ‘+’ separator (line three), and the quality scores represented by ASCII characters (line four).

The quality measurements on line four are shown using Phred scores, which assess the reliability of base calls. Phred quality scores are expressed as a logarithmic probability value and represent the estimated error rate for a given base call. The higher the Phred score, the lower the probability that the base is incorrect. These measures are generally used as the standard format for quality across technologies.

FASTQ facts:

FASTQ uses the base calls A, C, T, G, and N
Common file extensions include: “.fastq” and “.fq” or the gzip-compressed format, “.fastq.gz”
Short-read technologies performing pair-end sequencing generate a FASTQ for each read
Tools like FASTQC³ and Nanoplot⁴ are popular tools for processing FASTQ files

References:

Lipman D, Pearson W. Rapid and sensitive protein similarity searches. Science. 1985;227(4693):1435-1441. doi:https://doi.org/10.1126/science.2983426
co*ck PJA, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research. 2010;38(6):1767-1771. doi:https://doi.org/10.1093/nar/gkp1137
Andrews S. Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data. www.bioinformatics.babraham.ac.uk. Published 2010. http://www.bioinformatics.babraham.a...rojects/fastqc
De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Berger B, ed. Bioinformatics. 2018;34(15):2666-2669. doi:https://doi.org/10.1093/bioinformatics/bty149