Approach for Aim 1
Data Standards and Operational Informatics Infrastructure
This short-term and expected fast-pace project, in response to a declared public health emergency FOA, requires quick deployment of a basic operational informatics infrastructure to describe, standardize, process, analyze, and share the generated datasets while working closely with the consortium and in particular the DCC. For the overall goals of the consortium, data management and stewardship will be guided by FAIR principles (62), which are now widely endorsed and increasingly implemented by funding agencies, industry, academia, and publishers. For the proposed project, we therefore will substantially reuse software tools, workflows, and protocols previously developed by us in the LINCS (25), IDG (46), and NIH Data Commons research consortia; for example, rapid deployment and implementation of metadata standards (59), data exchange specifications, operational informatics components to capture, process and manage metadata (53), full stack software systems, including the LINCS Data Portal (27, 54) and recently developed COVID Tracker (https://xsat.idsc.miami.edu), and various scripts and efficiency tools, for example, to formalize metadata specifications.
Overall Process of Sample and Data Collection, Standardization, Processing, and Sharing
Wastewater samples will be collected under different conditions at various locations and times and then processed, followed by quantitation of SARS-CoV-2 using different technologies and processing/normalization protocols. Metadata and data will be standardized to enable data integration, analysis, and sharing. Fig. 1 illustrates the process we implemented in the NIH Illuminating the Druggable Genome (IDG) consortium to develop data standards that describe samples and datasets, formalizing these standards and then describing the data and metadata in a standardized format (e.g. JSON-LD) and making it accessible via an API. We previously developed and deployed a sustainable end-to-end process and various supporting tools for data submission, validation, standardization and aggregation, and sharing (53). Briefly, collecting standardized sample and dataset descriptions and data files included required sample/reagent, assay/detection, experimental, dataset, and processing pipeline metadata. Data validation is based on data types, allowed data values (e.g., from ontologies), and additional rules. Metadata and data are then standardized, uniquely identified, and further annotated and processed by mapping to ontologies and other external data sources, and stored. Dataset packages, including dataset and all metadata, are created and a unique PURL persistent ID is generated. These dataset packages and all metadata are made accessible via an open Data Portal, APIs, and set of R packages.
Datatypes and Metadata
Wastewater data will include date and time of sampling, sampling location, buildings contributing to sampling location and configuration of the sewage system contributing to the sampling point, basic climatologic properties at time of sampling such as ambient temperature and humidity, the measured physical-chemical properties of the sample (temperature, salinity, turbidity, pH), the non-COVID microbiological characteristics (E. coli by culture an qPCR plus viral indicator polyomavirus), and the levels of SARS-CoV-2 RNA using the various concentration methods (ultrafiltration versus electronegative filtration) and using the various detection methods such as qRT-PCR, qLAMP, FA, RNA sequencing, and metatranscriptomics results.
Human surveillance data will be consolidated at the population level. For the county level, on a daily basis, the number, severity (e.g., requiring hospitalization and ICU) and zip code of the cases will be combined with time of human specimen draw, number and degree of infections contributing to a particular building, and the degree to which those infected utilized the building. At the community level, additional data will include buildings visited, the degree of use of that building (e.g., dormitory versus a classroom), and the results from the re-analysis of biobanked samples based on qRT-PCR, qLAMP, FA, sequencing (Aim 2), and metatranscriptomics (Aim 3).
Data Standards and Data Harmonization
Data standards include reporting guidelines (to describe collection sites, samples, assay methods, protocols, datasets, etc.), terminology artifacts (controlled vocabularies or ontologies), and data models and formats. In two projects, we have been collaborating with and contributing to FAIRSharing to define such standards (50). Standards will be developed in close coordination with the consortium under the guidance of the DCC, likely via a working group. We propose to adopt our previous approach that was successfully implemented for diverse data types (59). To differentiate model and confounder metadata, we will leverage the PhenX Toolkit (21) that can help improve protocols, reporting guidelines and leverage Common Data Elements to report data and metadata (51). To formalize descriptions of data and metadata, we will first look at Bioschemas (bioschemas.org) (16) and then leverage other resources. For example, to describe sample collection and experimental parameters, we will consider Experimental Factor Ontology (EFO) (39) and Ontology for Biomedical Investigation (OBI) (6). For general metadata, we propose Dublin Core (60). Once reporting guidelines, data elements and reference standards (such as ontologies for the various sample and datatypes) are agreed in the consortium, they will be implemented into a schema, formalized in JSON. We have already developed many such schemas and tools to generate JSON schemas from tabular descriptions of data elements (https://github.com/schurerlab/FAIR-Schema-Utils).
Virus and Microbiome Standards
To create a rigorous set of control data for sample collection, sequencing, and analysis, we will leverage the ongoing work to create and benchmark microbiome and genome standards at the National Institute of Standards and Technology (NIST). This includes NIST’s Steering Committee for the Genome in a Bottle Consortium (GIAB) and the International Metagenomics and Microbiome Standards Alliance (IMMSA), for which Mason (PI) is a co-leader. Also, sewage testing of the controls being using in the NIST Coronavirus Standards Working Group and COVID-19 X-Prize will provide needed stability data on titrated, well-characterized controls (including purified RNA and encapsulated viruses) for the water treatment plants.
Data Management and Processing
Once formal JSON schemas are established, sample and dataset descriptions can be collected directly in a standardized manner. We have already developed a Resource Submission System (RSS, http://rss.ccs.miami.edu) that allows users to directly submit descriptions of resources and reagents. RSS is in production in the IDG consortium and currently supports several data and resource generating centers with six resource types and eight data types, plus resource descriptions and batch submissions. The standardized submitted descriptions are then stored in a document store (Mongo DB is used for RSS) in JSON-LD, which includes the formalized description values along with standardized data fields. The JSON-LD formalized data are made available via a Swagger RESTful API; in case of RSS we use SmartAPI specifications (13). Further, we will leverage infrastructure of the LINCS Data Portal (LDP), including the LINCS Data Registry and dataset packaging (27) and the latest iteration of LDP, which includes a signature store that makes signatures such as expression data directly searchable and computable (54). This is relevant for predictive data modeling and metatranscriptomics analysis (Aim 3).
Data Sharing and Publication
Most recently, we have been developing infrastructure at the Sylvester Comprehensive Cancer Center that operationally integrates the functionality of LDP and RSS directly with the Onco-Genomics Share Resource (OGSR), so that generated data can be seamlessly processed and samples can be described using established and formalized metadata. This infrastructure will be used to share data between our UM and WCM sites and with the Consortium DCC. With the UM Institute of Data of Data Science and Computing, we have deployed the COVID tracker eXperimental Situation Awareness Tool (XSAT) application (https://xsat.idsc.miami.edu). Currently, XSAT processes and visualizes COVID-19 new cases, hospitalizations, and deaths by geographical location, time, and patient age and gender. Data are updated every 6 hours, processed from the Florida Department of Health (Fig. 2). XSAT is a full stack software application ingesting, indexing, processing, and visualizing data with geographic detail from FL Health, leveraging PostGIS, Apache Solr and node/express – MapServer GL, and AngularJS/leaflet/bootstrep. We plan to integrate the wastewater surveillance data into XSAT and leverage the same technology and architecture to code, process and visualize geographic and temporal data. These data standards can help the broader RADx-rad SARS-CoV-2 wastewater-based surveillance research consortium, in coordination with NIH COVID-19 DR2 and NIST.
13 Dumontier, M.; Dastgheib, S.; Whetzel, T.; Assisi, P.; Aviilach, P.; Jagodnik, K.; Korodi, G.; Pilarczyk, M.; Schurer, S.; Terryn, R. smartAPI: Towards a More Intelligent Network of Web APIs. in the 25th conference on Intelligent Systems for Molecular Biology and the 16th European Conference on Computational Biology, 2017.
16 Garcia, L.; Giraldo, O.; Garcia, A.; Dumontier, M. Bioschemas: schema. org for the Life Sciences. Proceedings of SWAT4LS, 2017.
21 Hendershot, T.; Pan, H.; Haines, J.; Harlan, W. R.; Marazita, M. L.; McCarty, C. A.; Ramos, E. M.; Hamilton, C. M. Using the PhenX Toolkit to Add Standard Measures to a Study. Current Protocols in Human Genetics 2015, 86, 1
25 Keenan, A. B.; Jenkins, S. L.; Jagodnik, K. M.; Koplev, S.; He, E.; Torre, D.; Wang, Z.; Dohlman, A. B.; Silverstein, M. C.; Lachmann, A.; Kuleshov, M. V.; Ma’ayan, A.; Stathias, V.; Terryn, R.; Cooper, D.; Forlin, M.; Koleti, A.; Vidovic, D.; Chung, C.; Schurer, S. C., et al. (2018) The Library of Integrated Network-Based Cellular Signatures NIH Program: System-Level Cataloging of Human Cells Response to Perturbations. Cell Systems 2018, 6, 13-24. (S2)
27 Koleti, A.; Terryn, R.; Stathias, V.; Chung, C.; Cooper, D. J.; Turner, J. P.; Vidovic, D.; Forlin, M.; Kelley, T. T.; D’Urso, A.; Allen, B. K.; Torre, D.; Jagodnik, K. M.; Wang, L.; Jenkins, S. L.; Mader, C.; Niu, W.; Fazel, M.; Mahi, N.; Pilarczyk, M.; Clark, N.; Shamsaei, B.; Meller, J.; Vasiliauskas, J.; Reichard, J.; Medvedovic, M.; Ma’ayan, A.; Pillai, A.; Schurer, S. C. Data Portal for the Library of Integrated Network-based Cellular Signatures (LINCS) program: integrated access to diverse large-scale cellular perturbation response data. Nucleic Acids Research 2018, 46, D558-D566.
39 Malone, J.; Holloway, E.; Adamusiak, T.; Kapushesky, M.; Zheng, J.; Kolesnikov, N.; Zhukova, A.; Brazma, A.; Parkinson, H. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics 2010, 26, 1112-1118.
46 Oprea, T. I.; Bologa, C. G.; Brunak, S.; Campbell, A.; Gan, G. N.; Gaulton, A.; Gomez, S. M.; Guha, R.; Hersey, A.; Holmes, J.; Jadhav, A.; Jensen, L. J.; Johnson, G. L.; Karlson, A.; Leach, A. R.; Ma’ayan, A.; Malovannaya, A.; Mani, S.; Mathias, S. L.; McManus, M. T.; Meehan, T. F.; von Mering, C.; Muthas, D.; Nguyen, D. T.; Overington, J. P.; Papadatos, G.; Qin, J.; Reich, C.; Roth, B. L.; Schurer, S. C.; Simeonov, A.; Sklar, L. A.; Southall, N.; Tomita, S.; Tudose, I.; Ursu, O.; Vidovic, D.; Waller, A.; Westergaard, D.; Yang, J. J.; Zahoranszky-Kohalmi, G. Unexplored therapeutic opportunities in the human genome. Nature Reviews Drug Discovery 2018, 17, 377
50 Sansone, S. A.; McQuilton, P.; Rocca-Serra, P.; Gonzalez-Beltran, A.; Izzo, M.; Lister, A. L.; Thurston, M.; Community, F. A. FAIRsharing as a community approach to standards, repositories and policies. Nature Biotechnology 2019, 37, 358-367
53 Stathias, V.; Koleti, A.; Vidovic, D.; Cooper, D. J.; Jagodnik, K. M.; Terryn, R.; Forlin, M.; Chung, C.; Torre, D.; Ayad, N.; Medvedovic, M.; Ma’ayan, A.; Pillai, A.; Schürer, S. C. Sustainable data and metadata management at the BD2K-LINCS Data Coordination and Integration Center. Nat Sci Data 2018, 5.
54 Stathias, V.; Turner, J.; Koleti, A.; Vidovic, D.; Cooper, D.; Fazel-Najafabadi, M.; Pilarczyk, M.; Terryn, R.; Chung, C.; Umeano, A.; Clarke, D. J. B.; Lachmann, A.; Evangelista, J. E.; Ma’ayan, A.; Medvedovic, M.; Schurer, S. C. LINCS Data Portal 2.0: next generation access point for perturbation-response signatures. Nucleic Acids Resesearch 2020, 48, D431-D439.
59 Vempati, U. D.; Chung, C.; Mader, C.; Koleti, A.; Datar, N.; Vidovic, D.; Wrobel, D.; Erickson, S.; Muhlich, J. L.; Berriz, G.; Benes, C. H.; Subramanian, A.; Pillai, A.; Shamu, C. E.; Schurer, S. C. Metadata Standard and Data Exchange Specifications to Describe, Model, and Integrate Complex and Diverse High-Throughput Screening Data from the Library of Integrated Network-based Cellular Signatures (LINCS). Journal of Biomolecular Screening 2014, 19, 803-816.
60 Weibel, S. L.; Koch, T. The Dublin core metadata initiative. D-lib Magazine 2000, 6, 1082-9873.
62 Wilkinson, M. D.; Dumontier, M.; Aalbersberg, I. J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J. W.; da Silva Santos, L. B.; Bourne, P. E.; Bouwman, J.; Brookes, A. J.; Clark, T.; Crosas, M.; Dillo, I.; Dumon, O.; Edmunds, S.; Evelo, C. T.; Finkers, R.; Gonzalez-Beltran, A.; Gray, A. J.; Groth, P.; Goble, C.; Grethe, J. S.; Heringa, J.; et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 2016, 3, 160018.