The focus of the iReceptor project is to federate large Adaptive Immune Receptor Repertoire (AIRR-seq) data repositories from multiple laboratories and enable researchers to easily and efficiently perform complex analyses on these federated repositories. One of the key design components of iReceptor is its distributed data model. A distributed data model, although difficult to support, is we believe critical to the success of research in this area. This is for two main reasons:
- Next generation sequencing has caused an explosion in the data available to labs that are carrying out immunogenetics research. In order to answer complex immunogenetic questions, these labs need to collaborate in a variety of ways. Although large scale repositories for AIRR-seq data exist, it is our belief that it is not practical to provide a single repository (or even a small number of large repositories) at the scale that will be effective in enabling the types of collaboration that need to occur. A distributed data model means that each lab needs to store a relatively small amount of data. A Scientific Gateway that enables the federation of such data sets means that complex research questions (through queries across distributed AIRR-seq data sets) can be answered.
- AIRR-seq data is invariably patient data, and therefore needs to be treated with confidentiality and security. Data use typically goes through institutional ethics boards and the data stewards at given institutions need to be confident that the data is treated securely. A distributed data model enables the data steward to store, monitor, and share data while at the same time having explicit and direct control over who has access to that data.
The goal of the iReceptor project is to hide the technical complexities of the above problem, while at the same time empowering immunogenetics researchers to perform very sophisticated (and in many cases, computationally expensive) analyses on federated data from multiple, distributed AIRR-seq repositories.
There are a number of architectural components of the iReceptor Platform:
- An iReceptor AIRR-seq data model and repository design that builds on the AIRR Community's Minimal Standard for AIRR-seq data (MiAIRR). The MiAIRR standard encompasses the basic sequence data, meta-data about the source of the sequence data (subject, lab, experiment, and sample data), and annotation data from annotation tools such as IMGT's vQuest, igblast, and MiXCR. This data model has been developed by the AIRR Community and implemented by the iReceptor team.
- A data repository service that provides access to AIRR-seq data repositories that use the MiAIRR data standard. This service will allow research labs that have an AIRR-seq repository that supports the MiAIRR data model to expose parts of their repository to external clients. Access to this data is provided through the implementation of the iReceptor REST API, a web-based application programming interface that allows queries to be made against an iReceptor repository. Through this service, the iReceptor Gateway will enable researchers to pose queries across multiple, distributed, AIRR-seq repositories and to federate the results of those queries for analysis.
- An iReceptor Data Commons. The above two components, in combination, provide the ability to create a network of international AIRR compliant data repositories that we call the iReceptor Data Commons.
- A scientific gateway web platform that can federate distributed AIRR-seq repositories and perform complex analyses on this federated data. The iReceptor Gateway will not only link distributed AIRR-seq repositories but it will also enable researchers to perform complex analyses on high end computational systems (for example, systems in Compute Canada). Note, the key component of the iReceptor Gateway is that it hides the complexity of the repository queries, data staging of the federated data, and the advanced computation from the end user.
- A set of analysis services that enable AIRR-seq researchers to perform computationally expensive analyses on advanced computational infrastructure. Using the AGAVE Science-as-a-Service architecture, the iReceptor Gateway enables extensible and flexible data movement and advanced computation on a wide range of HPC platforms including Compute Canada infrastructure.
A diagram of the iReceptor architecture is given above. In this diagram, there are a set of distributed AIRR-seq repositories (the iReceptor Data Commons), existing at collaborating international institutions (e.g Simon Fraser University, Compute Canada, and the University of Texas South Western (UTSW) Medical School). Each centre maintains its own repository (using the iReceptor repository model) and has explicit control over that labs data. Through the iReceptor Repository Service, each lab/institution can control who has access to which components in the repository (managed by their lab data steward). Using the iReceptor Gateway, researchers can pose complex immunogenetics research questions across multiple repositories, with researchers able to access only those repositories to which they have been granted access. The iReceptor Gateway federates the results of the queries across the distributed repositories and stores that data in a researcher's workspace. That data can then be staged to a large HPC resource for computation. After the computation is completed, the iReceptor Gateway stages the analysis results back to the researcher's workspace. Using this model, a researcher can then perform iterative data exploration and analysis across all of their collaborators data.
In the architecture diagram above, the blue components are existing components of the v2.0 release of the iReceptor platform (May 2018). The green components are components that are proposed components that we expect to integrate into the iReceptor Data Commons in the near future. The white components are conceptual components that the current architeture supports but have not been implemented and are not on the immediate road map for iReceptor.