Hints on curation and provenance

The iReceptor curation process uses a well defined directory structure to maintain data provenance about the studies it uploads into its repositories. Although this process may seem overly structured, we have found that when managing many data sets across many research studies, such a curation directory structure is extremely helpful for data provenance. Although not required, a methodology like this one is recommended by the iReceptor team.

In addition, the iReceptor Team maintains a data provenance page for each of the repositories it manages. These pages hold a record of all of the changes that have been made to each iReceptor repository, providing a mechanism for a researcher to track changes over time for these repositories.

When data is curated for iReceptor, the following directory structure and process is used:

Each study should reside in a single directory
- Directory name should contain Study ID (as per MiAIRR "Unique ID assigned by study registry e.g. PRJNA001"), first, and last author (e.g. PRJNA001_Smith_Jones).
Each directory should have the following sub-directories
- papers - directory where the original paper and the extracted metadata spreadsheet for the study resides
  - The file name of the metadata spreadsheet should contain the author name and the date the spreadsheet was changed (e.g PRJNA001_Smith_Jones_2018-09-25.xlsx).
  - Dated versions of the metadata spreadsheet as the metadata is changed should be kept for data provenance. We typically keep a separate tab in the spreadsheet to log the changes that occur in the metadata spreadsheet and when they occurred and why they were necessary.
  - Scripts utilized to process data can be included as another directory within papers if so inclined.
- SRA_files - directory where the files downloaded from the on line repository reside (typically from the SRA). See "Get access to the paper's data"
- fastq_files - directory where the fastq files are stored. These files are the result of the "Bioinformatics processing" curation step
- fasta_files - directory where the fasta files are stored. These files are typically the result of the "Bioinformatics processing" curation step
- annotation_files - directory where the annotation files are stored. These files are the result of the "Annotate sequences" curation step and are the files that are typically loaded into the iReceptor repository.
  - mixcr_YYYY-MM-DD or igblast_YYYY-MM-DD or vquest_YYYY-MM-DD
    - A folder for the annotation tool used should be created for each annotation run and should contain all annotation files for that annotation run.
- uploads - directory that holds the specific data used for an upload to an iReceptor repository. This is a critical step as it provides a direct mapping from a set of files on disk to a set of data that is in a public repository.
  - Each upload into a repository should have its own folder, encoded with the date as below.
  - upload-YYYY-MM-DD
    - A folder that holds the details of the process of uploading the metadata into the repository.
      - Each time the metadata spreadsheet is uploaded into the repository, a CSV file needs to be exported. The CSV file for this specific upload into the repository should be stored. This file should be named as the metadata spreadsheet with a date of the upload and the tag uploaded added to the file name (e.g. PRJNA001_Smith_Jones_2018-09-25.csv).
      - Each time a data set is uploaded to the repository, a record of the input files should be kept here. This should be done by creating a Unix filesystem "link" to the annotations folder where the annotation data is stored. For example, for a study PRJNA001 with authors Smith and Jones:
        There should be a study directory PRJNA001_Smith_Jones
        Assuming the metadata for this study was created on 2018-09-25, there should be a Excel file PRJNA001_Smith_Jones/papers/PRJNA001_Smith_Jones_2018-09-25.xlsx
        Assuming there is MiXCR annotation run that was performed on 2018-10-23, there should be a directory PRJNA001_Smith_Jones/annotation_files/mixcr-2018-12-03. This directory should hold all of the MiXCR annotation files for the annotation process.
        Assuming a data upload was performed on 2018-12-22, there should be a directory PRJNA001_Smith_Jones/uploads/upload-2018-12-22
        This directory should contain a dump of the Excel metadata file in CSV format, with a single header row, named PRJNA001_Smith_Jones_2018-09-25.csv. The date should remain the same as the Excel file to indicate that the CSV came from that Excel file.
        This directory should contain a link to the folder PRJNA001_Smith_Jones/annotation_files/mixcr-2018-12-03
        This directory should contain output files for each of the input files that were processed in the uploading. The iReceptor data loading scripts create these files automatically if you use the script to load the data. The preferred output file would be the name of the input file with a .out extension.
        The output for metadata loading would be PRJNA001_Smith_Jones_2018-09-25.csv.out
        The output for loading an annotation file SRR00001_annotation.txt would be SRR00001_annotation.txt.out

Given the above structure, for any given study it should be possible to trace back any errors in the data loading process and determine the source of the errors. Any errors that are discovered should be tracked through the above process, fixed, and new directories and files should be created as required.