iReceptor v3.0 Data Provenance

On June 4, 2020 the iReceptor Platform, including all of the repositories used in the platform, moved to iReceptor v3.0. iReceptor v3.0 uses the AIRR Data Commons (ADC) query API v1.0 as well as the v1.3 release of the AIRR Specification. As a result, a number of fields, and in particular the contents of some of these fields, that are stored in the iReceptor Public Archive (IPA) and displayed through the iReceptor Scientific Gateway will have changed. These changes apply to all IPA repositories. Data that is downloaded from these repositories directly or through the iReceptor Gateway will also contain these changes, and therefore care should be used when comparing data that was downloaded before and after this change. All deprecated fields in the AIRR specification still exist in the specification but are marked as deprecated. A number of new fields were added and a number of fields have had type definition changes. Most of the changes apply to the Repertoire level in the AIRR specification, although there are a small number of changes to the Rearrangement specification as well. For more information on these changes, please refer to the AIRR Specification v1.3 Release Notes.

One of the main changes across all repositories is driven by the adoption of of ontologies or controlled vocabularies by the AIRR Community and used in the AIRR Specification. Ontologies for the fields study_type, species, age_unit, disease_diagnosis, tissue, cell_subset, and cell_species are now being used. The main impact of these changes is that in addition to the value of the field (e.g. "Homo sapiens" for the species field) the ontology ID for that field is also provided (e.g. "NCBITAXON:9606"). Please refer to the AIRR Ontology page for more details.

A comprehensive list of the data provenance changes of moving from iReceptor v2.0 to iReceptor v3.0 are listed below:

  • Fields study_type, species, age_unit, disease_diagnosis, tissue, cell_subset, and cell_species: updated to use the AIRR specified ontologies. Although this will have an impact on many important fields such as disease_diagnosis, tissue, and cell_subset, having these fields based on ontologies will make AIRR-seq data far more interoperable and reusable (the IR in FAIR).
  • Fields values inconsistently representing no data in the IPA have been fixed to conform to the AIRR specification for representing missing data. Field values that used to equate to no data ("NA", "na") were updated to consistently return either null (in AIRR Repertoire JSON and ADC API response) or an empty string (in AIRR TSV files). In IPA repositories this affected the following fields: quality_thresholds, age_event, ancestry_population, ethnicity, race, strain_name, link_type, linked_subjects, disease_diagnosis, disease_length, disease_stage, prior_therapies, immunogen, intervention, medical_history, anatomic_site, disease_state_sample, collection_time_point_relative, collection_time_point_reference, biomaterial_provider, tissue_processing, cell_phenotype, cell_quality, cell_isolation, cell_processing_protocol, template_quality, template_amount, library_generation_kit_version, reverse_pcr_primer_target_location, sequencing_facility, sequencing_run_id, sequencing_run_date, sequencing_kit, filename, paired_reads_assembly, quality_thresholds, primary_match_cutoffs, collapsing_method, data_processing_protocols, data_processing_files
  • cell_subset: The AIRR specification states that the cell_subset ontology based field should only hold cell types that are expected to undergo VDJ recombination and that have been sorted into cell subsets. These cells are listed in the Cell Ontology beneath the Lymphocyte node. The IPA repositories stored other cell types such as PBMC, spleenocytes, and glial cells as a cell subset and this is now considered incorrect. Studies that previously stored such cells in cell_subset would now have a cell_subset of null and use the tissue field to specify where the cells were obtained from. For example, unless the PBMC was further sorted into cell_subsets using Flow Cytometry or other similar technique such a study would have the following metadata: tissue = blood, cell_subset = null . If either B cells or T cells were targeted for sequencing through PCR, this can be determined through assessing pcr_target_locus.
  • pcr_target_locus: This field is now a controlled vocabulary in the AIRR specification, all data in IPA has been updated accordingly to one of IGH, IGI, IGK, IGL, TRA, TRB, TRD, TRG.
  • sex: The sex of the subject is now a controlled vocabulary in the AIRR specification, all data in IPA has been updated accordingly to one of male, female, pooled, hermaphrodite, intersex, not collected, or not applicable
  • physical_linkage: This field is now a controlled vocabulary in the AIRR specification, all data in IPA has been updated accordingly to "none" in IPA.
  • age_min, age_max: The AIRR Specification now supports age_min and age_max with the field value able to take on a null value for unbounded age ranges (e.g >40 would equate to age_min = 40 and age_max = null). The IPA repositories and the old API would return -1 and 999 for unknown age_min and age_max values. This has been changed.
  • cell_storage: cell_storage is a required field in the AIRR specification. In some instances the IPA repositories returned an empty value. These studies now return FALSE.