The focus of the iReceptor project is to federate large immunogenetic databases from multiple laboratories and enable researchers to easily and efficiently perform complex analyses on these federated data bases. One of the key design components of iReceptor is its distributed data model. A distributed data model, although difficult to support, is we believe critical to the success of research in this area. This is for two main reasons:
- Next generation sequencing has caused an explosion in the data available to labs that are carrying out immunogenetics research. In order to answer complex immunogenetic questions, these labs need to collaborate in a variety of ways. Although large scale repositories for sequence data exist, it is our belief that it is not practical to provide repositories at the scale that will be effective in enabling the types of collaboration that need to occur. A distributed data model means that each lab needs to store a relatively small amount of data. A Scientific Gateway that enables the federation of such data sets means that complex research questions (through queries across distributed data sets) can be answered.
- Immunogenetic data is invariably patient data, and therefore needs to be treated with confidentiality and security. Data use typically goes through institutional ethics boards and the data stewards at given institutions need to be confident that the data is treated securely. A distributed data model enables the data steward to store, monitor, and share data while at the same time having explicit and direct control over who has access to that data.
The goal of the iReceptor project is to hide the technical complexities of the above problem, while at the same time empowering immunogenetics researchers to perform very sophisticated (and in many cases, computationally expensive) analyses on federated data from multiple, distributed databases. There are five architectural components of the system:
- An iReceptor immunogenetics data model and database design that builds on and enhances existing immunogenetics data models. A general data model that encompasses the basic sequence data, meta-data about the source of the sequence data (subject, lab, experiment, and sample data), and annotation data from annotation tools such as IMGT's vQuest. This data model will be shared with the research community, enabling research labs to create databases using this data model.
- A data adaptor/import/conversion service for transforming collaborator immunogenetics data sets into the iReceptor data model. Such a service will enable researchers to import data into a database that uses the iReceptor data model. The service will support the importing of sequence data, meta-data about the sequence data, and annotation data (as described above).
- A data base service that will expose access to immunogenetics databases that use the iReceptor database model. This service will allow research labs that have a immunogenetics database that supports the iReceptor data model to expose parts of their database to the iReceptor world. Through this service, the iReceptor Gateway will allow researchers to pose queries across multiple, distributed, immunogenetics databases and to federate the results of those queries for analysis.
- A scientific gateway web platform that can federate distributed immunogenetics databases and perform complex analyses on this federated data. The iReceptor Gateway will not only link distributed immunogenetic databases but it will also enable researchers to perform complex analyses on high end computational systems (for example, systems in the Compute Canada network of HPC machines). Note, the key component of the iReceptor Gateway is that it hides the complexity of the database queries, data staging of the federated data, and the advanced computation from the end user.
- A set of analysis services that enable immunogenetics researchers to perform computationally expensive analyses on advanced computational infrastructure. Using the Agave Scientific Gateway web service architecture, the iReceptor Gateway enables extensible and flexible data movement and advanced computation on a wide range of HPC platforms including Compute Canada infrastructure.
A diagram of the iReceptor architecture can be seen below. In this diagram, there are a set of distributed immunogenetics databases, existing at collaborating international institutions (e.g Simon Fraser University, Compute Canada, and the University of Texas South Western (UTSW) Medical School). Each centre maintains its own database (using the iReceptor DB Model) and has explicit control over that labs data. Through the iReceptor DB Import Service, research teams can load lab, experiment, subject, sequence, and annotation data into their own local database. Through the iReceptor DB Service, each lab/institution can control who has access to which components in the database (managed by their lab data steward). Using the iReceptor Gateway, researchers can pose complex immunogenetics research questions across multiple databases, with researchers able to access only those databases to which they have been granted access. The iReceptor Gateway federates the results of the distributed database queries and stores that data in a researchers workspace. That data can then be staged to a large HPC resource for computation. After the computation is completed, the iReceptor Gateway stages the analysis results back to the researcher's workspace. Using this model, a researcher can then perform iterative data exploration and analysis across all of their collaborators data.
Proposed configuration for next phase of iReceptor environment. Data migration services facilitate input of data into nodes of receptor databases (e.g., VDJServer data commons, BC Genome Sciences Centre, iReceptor Public Archive (iPA) at SFU, etc.). iReceptor database service authenticates access at 3 levels: public data “commons”; sharing within consortia (common consent, MTA, etc.); and within a laboratory. Agave (TACC) iReceptor Gateway webservice queries sequences across nodes (e.g., give me all sequences from anti-HIV antibodies using IGHV1-69 gene), and packages these for analysis by offsite immune repertoire tools