A project of this scale and complexity demands an open, modular, and extensible approach to data coordination.
The Human Cell Atlas will be organizing and standardizing terabytes of data for billions of cells, across multiple modalities, generated by hundreds of labs around the world. We want to make this data open and easily accessible to all researchers, enabling the scientific community to innovate rapidly without barriers to data access. We also want to make it easy for computational researchers to develop and share new analysis approaches. To do this, we intend to design and build a modern, cloud-based, modular architecture for organizing and sharing data for the Human Cell Atlas. All software will be developed in the open and made available as open source.
Diagram of key components of the open-source data coordination platform, including  a data ingestion service,  a synchronized data store with multiple cloud replicas  a collection of secondary analysis pipelines for basic data processing and  a collection of tertiary portals for analyses, visualizations, and rich forms of data access.
As currently conceived, this data coordination platform will provide four key components: ingestion services for submission of data; synchronized data storage across multiple clouds; standardized secondary analysis pipelines; and portals for data access, tertiary analysis, and visualization.
Learn more about the key components below, and check back here soon for more detailed project roadmaps, code repositories, and ways to get involved!
The HCA Ingestion Service will provide the single point of entry for all HCA data. This includes raw data and metadata for projects, experiments, and samples submitted by investigators, as well as derived analyses and quality metrics automatically generated from running vetted secondary analysis pipelines.
Researchers will submit data through one of several data brokers that act as links between labs and the single Ingestion Service API. Brokers might include user-facing websites or other web services. Some may target specific geographical regions for upload efficiency, and some may provide domain- or lab-specific handling or formatting — e.g. data from image-based transcriptomics may require different handling than single cell RNA sequencing. Staging systems in cloud storage will also be developed to enable faster uploads. Upon submission, the Ingestion Service will perform basic quality assurance, and then deposit the data into the Data Store.
The HCA Data Store will provide a multi-site cloud-based storage system for all Human Cell Atlas data, including raw data, metadata, and certain forms of derived data from vetted secondary analysis pipelines.
One of the key goals of the platform is to ensure simple and open access to Human Cell Atlas data, and allow researchers to either download the data or compute on it directly in the cloud. For that reason, all data in the Data Store will be open to the public, and replicated across multiple cloud providers. Researchers will be able to access data directly through the Data Store’s Consumer API, or through Tertiary Portals providing query interfaces, visualization tools, and other forms of data access. Data will be submitted to the Data Store only through the Ingestion Service.
The HCA Secondary Analysis service will provide pipeline execution to process raw data using community-vetted algorithms and generate intermediate derived results that will be deposited back into the Data Store.
Most data types for the Human Cell Atlas will require some processing to support the majority of downstream use cases (e.g. alignment and demultiplexing for single-cell RNA sequencing, detection and segmentation for image-based transcriptomics). The platform will provide robust, community-vetted pipelines that run on all newly submitted data and generate secondary analysis results to be deposited back into the Data Store. The Human Cell Atlas Analysis Working Group will identify which analysis pipelines will be run, including at least one vetted pipeline for each anticipated type of data (e.g. sequencing, imaging, etc). All pipelines will be built using open-source software, and we will ensure that pipelines are available, reproducible, and executable across multiple computing environments.
The HCA Tertiary Portals will include a wide range of downstream user-facing services, including analyses, visualizations, and other forms of access for consuming and working with HCA data.
We intend to provide a few simple portals as a starting point, but new portals can be developed by anyone in the scientific or computational community, for a wide diversity of use cases, including: clustering, differential interference, spatial reconstruction, visualization, and graph-based analysis. These portals may include web-based interfaces, analysis results, custom APIs for performing rich and structured queries, and other novel interfaces. To encourage a fully open ecosystem, there will be no requirements or governance around portal development, and instead we encourage the community to work together to develop best practices and share resources.