Raw and processed sequences should be archived both before and after annotations have been produced. Maintaining the preliminary sequence files allows us to connect samples to the final annotations succinctly. Keeping a consistent format for filenames of the pre-annotation files and annotation files is imperative to connect all stages of sequence processing and ensure that samples can be traced through to their annotation files. For example, files taken from the SRA will generally have SRR IDs (SRR1234567, etc.) attached to them. If the SRR ID is kept consistent between different processing stages in filenames, then the sample can be easily traced from start to finish. Typically, this is the main identifier that is conserved between steps- only prefixes to the filenames will be altered (filtered_SRR#######.fastq, paired_SRR#######.fastq, SRR#######.fasta, etc.). All SRR IDs correspond to particular sample IDs and that information is documented in the metadata.
Though only the annotation files will be uploaded to the iReceptor database, we main copies of the original sequence data (fasta/fastq) locally. All pre-annotation files remain in the same directory that the annotation files are stored in, keeping studies together and easily traversable. The curation and archival process for data curated in iReceptor repositories is detailed in the "Hints on Curation and Provenance" section
- Log in to post comments