It is critical when reproducing data to understand how a sequence is derived from its sample. As such, it is important to document how a sample was prepared for sequencing by identifying the library preparation, including the use of PCR, any specific sequences that were targeted with primers (ex. The V gene or Constant region- this helps determine the exact area that is available in the resulting sequences), and whether a library is derived from DNA or RNA, as this can change the sequencing methods. This information should be stored in the appropriate MiAIRR fields in the Metadata CSV file.
Once the preparation of a sample is considered and documented, sequences not only have to be downloaded, but primers must be removed and demultiplexed, separating files if they contain specific barcodes. At iReceptor, we utilize Cutadapt for all primer removal, as well as quality filtering, an important step in ensuring high quality, less error-prone sequences. If the sequences were paired-end, as many tend to be, then we at iReceptor use PEAR to bring those reads together and create sequences that are ready for annotation (otherwise, sequences would only hold the forward or reverse read). This process should be documented in the appropriate MiAIRR fields in the Metadata CSV file.
After processing and pairing of reads is complete, sequences are prepared for annotation by being changed from fastq files to fasta format- this is done with every file because many annotation tools require fasta input rather than fastq. Though there are some simple commands and scripts that can be written in python to change the file type, we have also used the Fastx toolkit in the past to prepare the reads for annotation.