Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

FAIRSCAPE: a Framework for FAIR and Reproducible Biomedical Analytics

FAIRSCAPE: a Framework for FAIR and Reproducible Biomedical Analytics Results of computational analyses require transparent disclosure of their supporting resources, while the analyses themselves often can be very large scale and involve multiple processing steps separated in time. Evidence for the correctness of any analysis should include not only a textual description, but also a formal record of the computations which produced the result, including accessible data and software with runtime parameters, environment, and personnel involved. This article describes FAIRSCAPE, a reusable computational framework, enabling simplified access to modern scalable cloud-based components. FAIRSCAPE fully implements the FAIR data principles and extends them to provide fully FAIR Evidence, including machine-interpretable provenance of datasets, software and computations, as metadata for all computed results. The FAIRSCAPE microservices framework creates a complete Evidence Graph for every computational result, including persistent identifiers with metadata, resolvable to the software, computations, and datasets used in the computation; and stores a URI to the root of the graph in the result’s metadata. An ontology for Evidence Graphs, EVI (https://w3id.org/EVI), supports inferential reasoning over the evidence. FAIRSCAPE can run nested or disjoint workflows and preserves provenance across them. It can run Apache Spark jobs, scripts, workflows, or user-supplied containers. All objects are assigned persistent IDs, including software. All results are annotated with FAIR metadata using the evidence graph model for access, validation, reproducibility, and re-use of archived data and software. . . . . . . Keywords FAIR data FAIR software Digital Commons Evidence graph Provenance Reproducibility Agumentation Introduction Motivation Maxwell Adam Levinson, Justin Niestroy and Sadnan Al Manir Computation is an integral part of the preparation and content contributed equally to this work. of modern biomedical scientific publications, and the findings they report. Computations can range in scale from simple * Timothy Clark twclark@virginia.edu statistical routines run in Excel spreadsheets to massive or- chestrations of very large primary datasets, computational Department of Public Health Sciences (Biomedical Informatics), workflows, software, cloud environments, and services. University of Virginia School of Medicine, Charlottesville, VA, USA They typically produce data and generate images or tables as Department of Pediatrics, University of Virginia School of Medicine, output. Scientific claims of the authors are supported by evi- Charlottesville, VA, USA dence that includes reference to the theoretical constructs em- Center for Advanced Medical Analytics, University of Virginia bodied in existing domain literature, and to the experimental School of Medicine, Charlottesville, VA, USA or observational data and its analysis represented in images or Department of Medicine, University of Virginia School of Medicine, tables. Charlottesville, VA, USA Today, increasingly strict requirements are demanded to Department of Statistics, University of Virginia College and leave a digital footprint of each preparation and analysis step Graduate School of Arts and Sciences, Charlottesville, VA, USA in derivation of a finding to support reproducibility and reuse University of Virginia School of Data Science, Charlottesville, VA, of both data and tools. The widely recommended and often USA 188 Neuroinform (2022) 20:187–202 required practice by publishers and funders today is to archive the objects themselves may be retrieved, given the appropriate and cite one’s own experimental data (Cousijn et al., 2018; permissions. In this model, core metadata retrieved on resolu- Data Citation Synthesis Group, 2014; Fenner et al., 2019; tion of a persistent identifier (PID) (Juty et al., 2020; Starr Groth et al., 2020); and to make it FAIR (Wilkinson et al., et al., 2015) will include an evidence graph for the object 2016). These approaches were developed over more than a referenced by the PID. A link to the object’s evidence graph decade by a significant community of researchers, archivists, can be embedded in its metadata. funders, and publishers, prior to the current recommendations The central goals of FAIRSCAPE can be summarized as (Altman et al., 2001; Altman & King, 2007; Borgman, 2012; (1) to develop reusable cloud-based “data commons” frame- Bourne et al., 2012;Brase, 2009; CODATA/ITSCI Task works adapted for very large-scale data analysis, providing Force on Data Citation, 2013;King, 2007; Starr et al., 2015; significant value to researchers; and (2) to make the computa- Uhlir, 2012). There is increasing support among publishers tions, data, and software in these environments fully transpar- and the data science community to recommend, in addition, ent and FAIR (findable, accessible, interoperable, reusable). archiving and citing the specific software versions used in FAIRSCAPE supports a “data ecosystem” model (Grossman, analysis (Katz et al., 2021a; Smith et al., 2016), with persistent 2019) in which computational results and their provenance are identification and standardized core metadata, to establish transparent, verifiable, citable, and FAIR across the research FAIRness for research software (Katz et al., 2021b; lifecycle. We combined elements of prior work by ourselves Lamprecht et al., 2020); and to require identification via per- and others on provenance, abstract argumentation frame- sistent identifiers, of critical research reagents (A. works, data commons models, and citable research objects, Bandrowski, 2014; A. E. Bandrowski & Martone, 2016; to create the FAIRSCAPE framework. This work very signif- Prager et al., 2018). icantly extends and refactors the identifier and Metadata How do we facilitate and unify these developments? Can Services we and our colleagues developed in the NIH Data we make the recorded digital footprints as broadly useful as Commons Pilot Project Consortium (Clark et al., 2018; possible in the research ecosystem, while their generation oc- Fenner et al., 2018; NIH Data Commons Pilot: Object curs as side-effects of processes inherently useful to the re- Registration Service (ORS), 2018). searcher – for example, in large scale data analytics and data FAIRSCAPE has a unique position in comparison to other commons environments? provenance-related, reproducibility-enabling, and “data com- The solution we developed is a reusable framework for mons” projects. We combine elements of all three approaches, building provenance-aware data commons environments, while providing transparency, FAIRness, validation, and re- which we call FAIRSCAPE. It provides several features di- use of resources; and emphasize reusability of the rectly useful to the computational scientist, by simplifying and FAIRSCAPE platform itself. Our goal is to enable researchers to implement effective and useful provenance-aware compu- accelerating important data management and computational tasks; while providing, as metadata, an integrated evidence tational data commons in their own research environments, at graph of the resources used in performing the work, allowing any scale, while supporting full transparency of results across them to be retrieved, validated, reused, modified, and projects, via Evidence Graphs represented using a formal extended. ontology. Evidence graphs are formal models inspired by a large body of work in abstract argumentation (Bench-Capon & Related Work Dunne, 2007; Brewka et al., 2014; Carrera & Iglesias, 2015; Cayrol & Lagasquie-Schiex, 2009;Dung, 1995;Dung& Works focusing on provenance per se such as (Alterovitz Thang, 2018; Gottifredi et al., 2018; Rahwan, 2009), and et al., 2018; Ellison et al., 2020) and the various workflow analysis of evidence chains in biomedical publications provenance systems such as (Khan et al., 2019; (Clark et al., 2014;Greenberg, 2009, 2011), which shows that Papadimitriou et al., 2021;Yakutovich et al., 2021) are pri- the evidence for correctness of any finding, can be represented marily concerned with very detailed documentation of each as a directed acyclic support graph, an Evidence Graph. When computation on one or more datasets. The W3C PROV model combined with a graph of challenges to statements, or their (Gil et al., 2013; Lebo et al., 2013;Moreau et al., 2013)was evidence, this becomes a bipolar argument graph - or argu- developed initially to support interoperability across the trans- mentation system (Cayrol & Lagasquie-Schiex, 2009, 2010, formation logs of workflow systems. Our prior work on 2013). Micropublications (Clark et al., 2014) extending and The nodes in these graphs can readily provide metadata repurposing several core classes and predicates from W3C about the objects related to the computation, including the PROV, were preliminary work forming a basis for the EVI computation parameters and history. Each set of metadata ontology (Al Manir et al., 2021a, 2021b). may be indexed by one or more persistent identifiers, as spec- The EVI ontology used in FAIRSCAPE to represent evi- ified in the FAIR principles; and may include a URI by which dence graphs, is concerned with creating reasonable Neuroinform (2022) 20:187–202 189 transparency of evidence supporting scientific claims, includ- Enabling Transparency through EVI’sFormalModel ing computational results; it reuses the three major PROV classes Entity, Activity,and Agent as a basis to develop a To enable the necessary results transparency across separate detailed ontology and rule system for reasoning across the computations, we abstracted core elements of our evidence for (and against) results. When a computational re- micropublications model (Clark et al., 2014) to create EVI sult is reused in any new computation, that information is (http://w3id.org/EVI), an ontology of evidence relationships added to the graph, whether or not the operations were con- that extends W3C PROV to support specific evidence types trolled by a workflow manager. Challenges to results, found in biomedical publications; and enable reasoning across datasets, or methods, may also be added to the graph. While deep evidence graphs, and propagation of evidence challenges our current use of EVI is on computational evidence, it is deep in the graph, such as: retractions, reagent contamination, designed to be extensible to objects across the full experimen- errors detected in algorithms, disputed validity of methods, tal and publication lifecycle. challenges to validity of animal models, and others. EVI is Systems providing data commons environments, such as based on the fundamental idea that scientific findings or the various NCI and NHLBI cloud platforms (Birger et al., claims are not facts, but assertions backed by some level of 2017; Brody et al., 2017; Lau et al., 2017; Malhotra et al., evidence, i.e., they are defeasible components of 2017;Wilson et al., 2017) while providing many highly useful argumentation. Therefore, EVI focuses on the structure of specialized capabilities for their domain users, including re- evidence chains that support or challenge a result, and on use of data and software, have not focused extensively on providing access to the resources identified in those chains. providing re-use of their own frameworks, and are centralized. Evidence in a scientific article is in essence, a record of the As noted later in this article, FAIRSCAPE can be – and is provenance of the finding, result, or claim asserted as likely to meant to be - installed on public, private, or hybrid cloud be true; along with the theoretical background material platforms, “bare metal” clusters, and even on high-end lap- supporting the result’s interpretation. tops, for use at varying scopes – personal, institution-wide, If the data and software used in analysis are all registered lab-wide, multi-center, etc. and receive persistent identifiers (PIDs) with appropriate Reproducibility platforms such as Whole Tale and metadata, a provenance-aware computational data lake, i.e., CodeOcean, (Brinckman et al., 2019;Chard etal., 2019; a data lake with provenance-tracking computational services, Merkys et al., 2017) attempt to take on a one-stop-shop role can be built that attaches evidence graphs to the output of each for researchers wishing to demonstrate or at least assert, repro- process. At some point, a citable object - a dataset, image, ducibility of their computational research. Of these, figure, or table will be produced as part of the research. If this, CodeOcean (https://codeocean.com) is a special case – it is too, is archived with its evidence graph as part of the metadata run by a company and appears to be principally described in and the final supporting object is either directly cited in the press releases, and not in any peer reviewed articles. text, or in a figure caption, then the complete evidence graph FAIRSCAPE’s primary goals are to enable construction of may be retrieved as a validation of the object’s derivation and multi-scale computational data lakes, or commons; and to as a set of URIs resolvable to reusable versions of the toolsets make results transparent for reuse across the digital research and data. Evidence graphs are themselves entities that can be ecosystem, via FAIRness of data, software, and computational consumed and extended at each transformation or records. FAIRSCAPE supports reproducibility via computation. transparency. The remainder of this article describes the approach, In very many cases - such as the very large analytic microservices architecture, and interaction model of the workflows in our first use case - we believe that no reviewer FAIRSCAPE framework in detail. will attempt to replicate such large-scale computations, which ran for months on substantial resources. The primary use case will be validation via inspection, and en passant validation via Materials and Methods software reuse. FAIRSCAPE is not meant to be a one-stop shop. It is a FAIRSCAPE Architectural Layers transferable, reusable framework. It is not only intended to enable localized participation in a global, fully FAIR data FAIRSCAPE is built on a multi-layer set of components using and software ecosystem – it is itself FAIR software. The a containerized microservice architecture (MSA) (Balalaie FAIRSCAPE software, including installation and deployment et al., 2016; Larrucea et al., 2018; Lewis & Fowler, 2014; instructions, is available in the CERN Zenodo archive Wan et al., 2018) running under Kubernetes (Burns et al., (Levinson et al., 2021); and in the FAIRSCAPE Github re- 2016). We run our local instance in an OpenStack (Adkins, pository (https://github.com/fairscape/fairscape). 2016) private cloud environment, and maintain it using a DevOps deployment process (Balalaie et al., 2016;Leite 190 Neuroinform (2022) 20:187–202 et al., 2020). FAIRSCAPE may also be installed on laptops API Gateway running minikube in Ubuntu Linux, MacOS, or Windows environments; and on Google Cloud managed Kubernetes. Access to the FAIRSCAPE environment is through an API An architectural sketch of this model is shown in Fig. 1. gateway, mediated by a reverse proxy. Our gateway is medi- Ingress to microservices in the various layers is through a ated by Traefik (https://traefik.io) which dispatches calls to the reverse proxy using an API gateway pattern. The top layer pro- various microservices endpoints. vides an interface to the end users with raw data and the associ- Traefik is a reverse proxy that we configure as a ated metadata. The mid layer is a collection of tightly coupled Kubernetes Ingress Controller, to dynamically configure services that allow end users with proper authorization to submit and expose multiple microservices using a single API. and view their data, metadata, and various types of computations The endpoints of the services are exposed through the performed on them. The bottom layer is built with special pur- OpenAPI specification (formerly Swagger Specification) pose storage and analytics platforms for storing and analyzing (Darrel Miller et al., 2020) which defines the standard and data, metadata and provenance information. All objects are the language-agnostic interface for publishing RESTful APIs assigned PIDs using local ARK (Kunze & Rodgers, 2008)as- and allows service discovery. Accessing the services requires signment for speed, with global resolution for generality. user authentication, which we implement using the Globus Auth authentication broker (Tuecke et al., 2016). Users of GlobusAuth may be authenticated via a number of permitted UI Layer authentication services, and are issued a token which serves as an identity credential. In our current installation we require use The User Interface layer in FAIRSCAPE offers end of the CommonShare authenticator, with site-specific two-fac- users various ways to utilize the functionalities in the tor authentication necessary to obtain an identity token. This framework. A Python client simplifies calls to the token is then used by the microservices to determine a user’s microservices. Data, metadata, software, scripts, permission to access various functionality. workflows, containers, etc. are all submitted and regis- tered by the end users from the UI Layer, which may Authentication and Authorization Layer be configured to include an interactive executable note- book environment such as Binder or Deepnote. Authentication and authorization (authN/authZ) in FAIRSCAPE are handled by Keycloak (Christie et al., Fig. 1 FAIRSCAPE architectural layers and components Neuroinform (2022) 20:187–202 191 2020), a widely-used open source identity and access manage- storage. These objects may include structured or unstructured ment tool. data, application software, workflow, scripts. The associated When Traefik receives a request, it handles an authentica- metadata contains essential descriptive information such as tion check to Keycloak, which then determines whether or not context, type, name, textual description, author, location, the requestor has a valid token for an identity that can perform checksum, etc. about these objects. Metadata are expressed the requested action. as JSON-LD and sent to the Metadata Service for further We distribute FAIRSCAPE with a preconfigured Keycloak processing. for basic username / password authentication & authorization Hashing is used to verify correct transmission of the object of service requests. This can be easily modified to support – users are required to specify a hash which is then alternative identity providers, including LDAP, OpenID recomputed by the Object Service after the object is stored. Connect, and OAuth2.0 for institutional single sign-on. Hash computation is currently based on the SHA-256 secure Services continue to interact the same way, even if you change cryptographic hash algorithm (Dang, 2015). Upon successful the configured identity provider. execution, the service returns a PID of the object in the form of Within our local Keycloak configuration, we chose to define an ARK, which resolves to the metadata. The metadata in- Globus Auth as the identity provider. Globus Auth then serves cludes, as is normal in PID architecture (Starr et al., 2015), a as a dispatching broker amongst multiple other possible final link to the actual data location. identity providers. We selected the login service at the An OpenAPI description of the interface is here: University of Virginia as our final provider, providing two- https://app.swaggerhub.com/apis/FAIRSCAPE/Transfer/ factor authentication and institutional single sign-on. Keycloak 0.1 is very flexible in allowing selection of various authentication schemes, such as LDAP, SAML, OAuth2.0, etc. Selection of Metadata Service authentication schemes is an administrator decision. The Metadata Service handles metadata registration and reso- Microservices Layer lution including identifier minting in association with the ob- ject metadata. The Metadata Service takes user POSTed The microservices layer is composed of seven services: (1) JSON-LD metadata and uploads the metadata to MongoDB Transfer, (2) Metadata, (3) Object, (4) Evidence Graph, (5) and Stardog, and returns a PID. To retrieve metadata for an Compute, (6) Search, and (7) Visualization. These are de- existing PID a user makes a GET call to the service. A PUT scribed in more detail in Section 2. Each microservice does call to the service will update an existing PID with new meta- its own request authorization, subsequent to Keycloak, en- data. While other services may read from MongoDB and abling fine-grained access control. Stardog directly, the Metadata Service handles all writes to MongoDB and Stardog. Storage and Analytic Engine Layer An OpenAPI description of the interface is here: https://app.swaggerhub.com/apis/FAIRSCAPE/Metadata- In FAIRSCAPE, an S3 compatible object store is required for Service/0.1 storing objects, a document store for storing metadata, and a graph store for storing graph data. Persistence for these data- Object Service bases is configured through Kubernetes volumes, which map specific paths on containers to disk storage. The current re- The Object Service provides a direct interface between the lease of FAIRSCAPE uses the S3 compatible MinIO as the Transfer Service and MinIO as well as maintaining consisten- object store, MongoDB as the document store, and Stardog as cy between MinIO and the metadata store. The Object Service the graph store. Computations invoked by the Compute handles uploads of new objects as well as uploading new Service are managed by Kubernetes, Apache SPARK, and versions of existing files. In both cases the Object Service the Nipype neuroinformatics workflow engine. accepts a file and desired file location as inputs and (if the location is available) uploads the file to desired location in FAIRSCAPE Microservice Components MinIO and returns a PID representing the location of the uploaded file. A DELETE call to the service will delete the Transfer Service requested file from MinIO as well as delete the PID with the link to the data, however the PID representing the object meta- This service transfers and registers digital research objects - data remains. datasets, software, etc., − and their associated metadata, to the An OpenAPI description of the interface is here: Commons. These objects are sent to the transfer service as https://app.swaggerhub.com/apis/FAIRSCAPE/Object- binary data streams, which are then stored in MinIO object Service/0.1 192 Neuroinform (2022) 20:187–202 Evidence Graph Service performs a search over all literals in the metadata for exact string matches and returns a list of all PIDs with a literal con- The Evidence Graph Service creates a JSON-LD Evidence taining the query string. It is invoked via the GET method of Graph of all provenance related metadata to a PID of interest. API endpoint to the service with the search string as argument. The Evidence Graph documents all objects such as datasets, An OpenAPI description of the interface is here: software, workflows, and the computations which are directly https://app.swaggerhub.com/apis/FAIRSCAPE/Search/0.1 involved in creating the requested entity. The service accepts a PID as its input, runs a PATH query built on top of the SPARQL Visualization Service query engine in Stardog with the PID of interest as its source to retrieve all supporting nodes. To retrieve an Evidence Graph for This service allows users to visualize Evidence Graphs inter- a PID a user may make a GET call to the service. actively in the form of nodes and directed edges, offering a The Evidence Graph Service plays an important role in consolidated view of the entities and the activities supporting reproducing computations. All resources required to run a correctness of the computed result. Our current visualization computation are exposed using persistent identifiers by the engine is Cytoscape (Shannon, 2003). Each node displays its evidence graph. A user can reproduce the same computation relevant metadata information, including its type and PID, by invoking the appropriate services available through the resolved in real-time. Python client with the help of these identifiers. This feature The Visualization Service renders the graph on an HTML allows a user to verify the accuracy of the results and detect page. any discrepancies. An OpenAPI description of the interface is here: An OpenAPI description of the interface is here: https://app.swaggerhub.com/apis/FAIRSCAPE/ https://app.swaggerhub.com/apis/FAIRSCAPE/Evidence- Visualization/0.1 Graph/0.1 FAIRSCAPE Service Orchestration Compute Service FAIRSCAPE orchestrates a set of containers to provide pat- This service executes user uploaded scripts, workflows, or con- terns for object registration, including identifier minting and tainers, on uploaded data. It currently offers two compute en- resolution; object retrieval; computation; search; evidence gines (Spark, Nipype) in addition to native Kubernetes container graph visualization, and object deletion. These patterns are execution, to meet a variety of computational needs. Users may orchestrated following API ingress, authentication, and ser- execute any script they would like to run as long as they provide vice dispatch, by microservice calls, invoking the relevant a docker container with the required dependencies. To complete service containers. jobs the service spawns specialized pods on Kubernetes de- signed to perform domain specific computations that can be Object Registration scaled to the size of the cluster. This service provides the essen- tial ability to recreate computations based solely on identifiers. Object registration occurs initially via the Transfer Service, For data to be computed on it must first be uploaded via the with an explicit user service call, and again automatically Transfer Service and be issued an associated PID. using the same service, each time a computation generates The service accepts a PID for a dataset, a script, software, output. Objects in FAIRSCAPE may be software, containers, or a container, as input and produces a PID representing the or datasets. Descriptive metadata must be specified for object activity to be completed. The request returns a job identifier registration to occur. from which job progress can be followed. Upon completion of When invoked, the Transfer Service calls the Metadata a job all outputs are automatically uploaded and assigned new Service (MDS) to mint a new persistent identifier, implement- PIDs, with provenance aware metadata. At job termination, ed as an Archival Resource Key (ARK), generated locally, the service performs a ‘cleanup’ operation, where a job is and to store it associated with the descriptive metadata, includ- removed from the queue once it is completed. ing the new registered object location. MDS stores object An OpenAPI description of the interface is here: metadata, including provenance, in both MongoDB and in https://app.swaggerhub.com/apis/FAIRSCAPE/Compute/ the Stardog graph store, allowing subsequent access to the 0.1 object metadata by other services. After minting an identifier and storing the metadata, the Search Service Transfer Service calls the Object Service to persist the new object, and then updates the metadata with the stored object The Search Service allows users to search for object metadata location. Hashing is used to verify correct transmission of the containing strings of interest. It accepts a string as input and object – users are requiredtospecify aSHA256hashon Neuroinform (2022) 20:187–202 193 registration, which is then recomputed by the Object Service workloads in the Compute Service enables all data, results, and verified after the object is stored. Internally computed and methods to be tracked via a connected evidence graph, hashes are provided for re-verification when the object is with persistent identifiers available for every node. accessed. Failure of hashes to match generates an error. The Compute Service executes computations using (a) a container specified by the user, or (b) the Apache Spark ser- Identifier Minting vice, or (c) the Nipype workflow engine. Like datasets and software (including scripts), computations are represented by The Metadata Service mints PIDs in the form of ARKs. persistent identifiers assigned to them. Objects are passed to Multiple alternative PIDs may exist for an object and PIDs the Compute Service by their PIDs and the computation is are resolved to their associated object level metadata including formally linked to the software (or script) by the the object’s Evidence Graph and location with appropriate usedSoftware property, and the input datasets by the permissions. usedDataset property. In the current deployment, ARKs created locally are regis- Runtime parameters may be passed with objects and a sin- tered to an assigned Name Assigning Authority Number. The gle identifier is minted with the given parameters and connect- ARK globally unique identifier ecosystem employs a flexible ed to the computation via the ‘parameters’ property. However, minimalistic standard and existing infrastructure. at this moment these parameters are not incorporated in the evidence graph. Identifier Resolution The Compute Service spawns a Kubernetes pod with the input objects mounted in the /data directory by default. ARK identifier resolution may be handled locally and/or by Upon completion of the job all output files in the /outputs external resolver services such as Name-to-Thing (https://n2t. directory are transferred to the object store and identifiers net). The Name-to-Thing resolver allows for Name Assigning for them are minted with the property generatedBy. The Authority Numbers (NAAN) to have redirect rules for their generatedBy property references the identifier for the ARKs, which forwards requests to the Name Mapping computation. Authority Hostport for the corresponding commons. Each FAIRSCAPE instance should independently obtain a Object Search NAAN, and a DNS name for their local FAIRSCAPE instal- lation, if they wish their ARKs to be resolved by n2t.net. Object searches are performed by the Search Service, DataCite DOI registration and resolution are planned for called directly on Service Dispatch. Search makes use of future work. Stardog’s full text retrieval, which in turn is based on Apache Lucene. Object Retrieval Objects are accessed by their PID, after prior resolution of the Evidence Graph Visualization object’s PID to its metadata (MDS) and authorization of the user’s authentication token for data access on that object. Evidence graphs of any object acquired by the system may be Object access is either directly from the Object Store, or from visualized at any point in this workflow using the wherever else the object may reside. Certain large objects Visualization Service. Nipype provides a chart of the residing in robust external archives, may not be acquired into workflows it executes using the Graphviz package. Our local object storage, but remain in place, up to the point of Evidence Graph Service is interactive, using the Cytoscape computation. package (Shannon, 2003), and allows Evidence graphs of multiple workflows in sequence to be displayed whether or Computation not they have been combined into a single flow. When executing a workload through the compute service, data, software, and containers are referenced through their Object Deletion PIDs, and by no other means. The Compute Service utilizes the stored metadata to dereference the object locations, and Objects are deleted by calls to the Object Service to clear the transfers them to the managed containers. The compute ser- object from storage, which then calls MDS and nulls out the vice also creates a provenance record of its own execution, object location in the metadata record. Metadata is retained associated with an identifier of type evi:Computation. Upon even though the object may cease to be held in the system, in the completion of a job, the Compute Service stores the gen- accordance with the Data Citation Principles (Data Citation erated output through the Transfer Service. Running Synthesis Group, 2014). 194 Neuroinform (2022) 20:187–202 Results series and make it easier for researchers to quickly build models for outcomes where the physiology is not Two use cases are presented here to demonstrate the applica- known. FAIRSCAPE can be used to build such models tion of FAIRSCAPE services. The first use case performs using its services. analysis of time series algorithms while the second runs a A series of steps are required to execute and reproduce the neuroimaging workflow. For each use case, the operations HCTSA of NICU data. They include transferring NICU data, involving data transfer, computation, and evidence graph gen- Python scripts, and associated metadata to the storage to be eration are described below: used later, running the scripts in the FAIRSCAPE compute environment, and generating the evidence graph. The first Use Case Demonstration 1: Highly Comparative Time script runs the time series algorithms on the vital sign data Series Analysis (HCTSA) of NICU Data while the second script performs clustering of these algo- rithms to generate a heatmap image. The evidence graph gen- Researchers at the Neonatal Intensive Care Unit erated for all patients contains over 17,000 nodes. However, a (NICU) at the University of Virginia continuously mon- simplified version of the computational analysis based on a itor infants and collect vital signs such as heart rate single patient is described here as the steps for executing the (HR) and oxygen saturation. Patterns in vital sign mea- analysis are common for all patients. These steps are briefly surements may indicate acute or chronic pathology described below: among infants. In the past a few standard statistics and algorithms specially designed for discovering cer- Transfer Data, Software and Metadata tain pathologies were applied to similar vital sign data. In this work we additionally applied many time series algorithms from other domains with the hope that these Before any computation is performed FAIRSCAPE re- algorithms would be helpful for prediction of unstudied quires each piece of data and software to be uploaded outcomes. A total of 67 time series algorithms have to the object store with its metadata using the POST been recoded as Python scripts and run on the vital method of the transfer service. The raw data file con- signs of 5997 infants collected over 10 years during taining the vital signs and the associated metadata are 2009–2019. The data are then merged, sampled and uploaded first. The scripts and the associated metadata clustered to find representative time series algorithms are uploaded next. The upload_file function shown below is used to transfer which express unique characteristics about the time the raw data file UVA_7219_HR.csv as the first parameter and the associated metadata referenced by the variable dataset_meta as the second parameter: Neuroinform (2022) 20:187–202 195 As part of the transfer, identifiers are minted by the Generating the Evidence Graph Metadata Service for each successfully uploaded object. The variable raw_time_series_fs_id refers to the minted identifier An evidence graph is generated using the GET method of the returned by the function. Each identifier is resolvable to the Evidence Graph Service. Figure 3 illustrates all computations uploaded object which can be accessed only by an authorized and the associated inputs and outputs for a single patient. The user. graph for all patients contains 17,995 nodes of types Image, Computation, Dataset and Software. Each patient has a unique Raw Time Series Feature Set, a Raw Data Analysis Time Series Data Analysis computation, and a Processed Time Series file. The Raw Data Analysis Script, HCTSA Clustering Script, HCTSA Heatmap Once the transfer is complete, the computation for the data Generation, and HCTSA Heatmap are shared among all analysis can be started. The computation takes the raw vital patients. sign measurements as input, groups the measurements into The simplified evidence graph in Fig. 3 contains 7 nodes, 10-min intervals and runs each algorithm on them. each with its own PID, where a Computation (Raw Data FAIRSCAPE makes launching such a computation easy by Analysis) uses a Dataset (Raw Time Series Feature Set) as executing the POST method of the compute service with iden- the input to a Software (Raw Data Analysis Script), tifiers of the data and script as parameters. The compute ser- representing the script to execute all the time series algo- vice creates an identifier with metadata pointing to the provid- ed inputs and launches a Kubernetes pod to perform the com- rithms, and generates the Dataset (Processed Time Series Feature Set) as the output. The next Computation (HCTSA putation. Upon completion of the script, all output files are assigned identifiers and stored in the object store. Cluster Heatmap Generation) uses the processed Dataset gen- erated during the previous computation as the input of the The compute function takes the PIDs of the dataset, soft- Software (HCTSA Clustering Script), which generates an ware/script, and the type of job such Apache Spark, Nipype, Image (HCTSA Cluster Heatmap), representing the clustering or custom containers as parameters. The PID it returns refers of the algorithms as an output. The evidence_graph function to the submitted job which can be used to track the progress of takes the PID of the HCTSA Heatmap image: the computation and its outputs. evidence_graph_jsond = FAIR.evidence_graph The compute function shown below is used to launch a computation on the raw data file raw_ts_fs_id as the first (HCTSA_heathmap_id) and generates the evidence graph for that PID serialized in JSON-LD (shown in Fig. 4). parameter, the analysis script raw_date_analysis_script_id as the second parameter using their identifiers, and the type 2. Use Case Demonstration 2: Neuroimaging Analysis of job spark as the third parameter: raw_date_analysis_jod_id = FAIR.compute(raw_ts_fs_id, raw_date_analysis_script_id, Using Nipype Workflow Engine spark') The PID the compute function returns resolves to the Data analysis in neuroimaging often requires multiple het- submitted jo b and is referenced by. erogeneous algorithms which sometimes lack transparent in- raw_date_analysis_jod_id. teroperability under a uniform platform. Workflows offer so- lutions to this problem by bringing the algorithms and soft- Clustering of Algorithms ware under a single umbrella. The open-source neuroimaging workflow engine Nipype combines heterogenous neuroimag- The next computation step is to perform clustering of algo- ing analysis software packages under a uniform operating rithms. Many algorithms are from similar domains and the platform which resolves the interoperability issues by operations they perform express similar characteristics. The allowing them to talk to each other. Nipype provides access HCTSA Clustering script clusters these algorithms into groups to a detailed representation of the complete execution of a which are highly correlated and a representative algorithm workflow consisting of inputs, output and runtime parameters. could be chosen from each grouping. The compute service is The containerization-friendly release, detailed workflow rep- then invoked with identifiers of the clustering script and the resentation, and minimal effort required to modify existing processed data as parameters. The compute function below services to produce a deep evidence graph have made takes PIDs of the processed time series feature set, clustering Nipype an attractive target for integration within the script, and spark job type as input parameters and returns a PID FAIRSCAPE framework. representing the job to generate the HCTSA heatmap: Among the services in FAIRSCAPE, only the compute service needed to be modified to run and interrogate Nipype. An image showing the clustered algorithms is produced at The modifications include repurposing the service to run the workflow from the Nipype-specific container generated by the the end of this step which is shown in Fig. 2. 196 Neuroinform (2022) 20:187–202 Fig. 2 NICU HCTSA clustering heatmap. X axis and Y axis are operations (algorithms using specific parameter sets), color is correlation between algorithms. The large white squares are clusters of highly correlated operations which suggest the dimension of the data may be greatly diminished by selecting “representative” algorithms from these clusters Neurodocker tool and to capture all entities from the internal contains intermediate inputs and outputs. It provides a detailed graph generated after the workflow is executed. Whereas an understanding of each analysis performed using the computa- evidence graph typically includes the primary inputs and out- tions, software and datasets. puts, the deep evidence graph produced here additionally Fig. 3 Simplified Evidence graph for one patient’s computations. Vital signs = dark blue box bottom right; computations = yellow boxes; processed data = dark blue box in middle; green box = heatmap of correlations Neuroinform (2022) 20:187–202 197 Fig. 4 JSON-LD Evidence Graph for patient computation as illustrated in Fig. 3 A workflow is considered simple if it consists of a sequence within and outside that environment. It supports every of processing steps and complex if there is nesting of workflow requirement defined in the FAIR Principles at a detailed execution such that the output of one workflow is used as the level, as defined in (Wilkinson et al., 2016), including a input to another workflow. The simple neuroimaging prepro- deep and comprehensive provenance model via cessing workflow demonstrated here (Notter, 2020) involves Evidence Graphs, contributing to more transparent sci- steps to correct motion of functional images, co-register function- ence and improved reusability of methods. al images to anatomical images, smooth the co-registered func- Scientific rigor depends on the transparency of methods (in- tional images, and detect artifacts in functional images. As part of cluding software) and materials (including data). The historian data transfer, dataset containing images and the script to run the of science Steven Shapin, described the approach developed processing steps with their associated metadata were uploaded with the first scientific journals as “virtual witnessing” using the file_upload function as shown in the previous use case. (Shapin, 1984), and this is still valid today. The typical scientific The compute function is then used to launch a computation on reader does not actually reproduce the experiment but is invited the image dataset which runs the processing script on the to review mentally every detail of how it was done to the extent repurposed Nipype container. The only exception in the compute that s/he becomes a “virtual witness” to an envisioned live dem- the function is that it uses nipype as the third input parameter onstration. That is clearly how most people read scientific papers instead of spark when the Compute Service was invoked. The - except perhaps when they are citing them, in which case less full evidence graph, generated using the evidence_graph func- care is often taken. Scientists are not really incentivized to rep- tion as demonstrated above, is too large to document due to space licate experiments; their discipline rewards novelty. constraints. Therefore, only the graph of the motion correction of The ultimate validation of any claim once it has been accept- functional images with FSL’sMCFLIRT is showninFig. 5.For ed as reasonable on its face comes with support from multiple additional details on this workflow, please consult the original distinct angles, by different investigators; successful re-use of Nipype tutorial (Notter, 2020). the materials and methods upon which it is based; and consis- tency with some body of theory. If the materials and methods are sufficiently transparent and thoroughly disclosed as to be reusable, and they cannot be made to work, or give bad results, Discussion that debunks the original experiments - precisely the way in which the promising-sounding STAP phenomenon was FAIRSCAPE enables rapid construction of a shared dig- discredited (“RETRACTED ARTICLE: Stimulus-triggered ital commons environment and supports FAIRness 198 Neuroinform (2022) 20:187–202 Fig. 5 Evidence Graph visualization for the neuroimaging workflow execution fate conversion of somatic cells into pluripotency”, 2014;Shiu, FAIRSCAPE is itself reusable and we have taken pains to pro- 2014), before the elaborate formal effort of Riken to replicate vide well-documented straightforward installation procedures. the experiments (Ishii et al., 2014;RIKEN, 2014). All resources on FAIRSCAPE are assigned identifiers As a first step then, it is not only a matter of reproducing which allow them to be shared. FAIRCAPE allows users to experiments but also of producing transparent evidence that capture the complete evidence graph of the tasks performed. the experiments have been done correctly. This permits chal- These evidence graphs show all steps of computations per- lenges to the procedures to develop over time, especially formed and the software and data that went into each compu- through re-use of materials (including data) and methods - tation. Evidence graphs, along with FAIRSCAPE’s other ser- which today significantly include software and computing vices, allow users to review and reproduce an experiment with environments. We definitely view these methods as being significantly less overhead than other standard approaches. extensible to materials such as reagents, using the RRID ap- Users can see all computations that were performed, review proach; and to other computational disciplines. the source code, and download all the data. This allows an- other party to reproduce the exact computations performed, apply the experimenter’s software to their own data, or apply Conclusion their own methods to the experimenter’sdata. The optimal use case for a FAIRSCAPE installation is a local or multi-institution digital commons, in a managed FAIRSCAPE is a reusable framework for scientific computa- Kubernetes environment. It can also be installed on high-end tions that provides a simplified interface for research users to an laptops for testing and development purposes as needed. We array of modern, dynamically scalable, cloud-based are actively looking for collaborators wishing to use, adapt, componentry. Our goal in developing FAIRSCAPE was to pro- and co-develop this software. vide an ease-of-use (and re-use) incentive for researchers, while FAIRSCAPE is not a tool for individual use, it is software rendering all the artifacts marshalled to produce a result, and the for creating a high-efficiency collective environment. As a evidence supporting them, Findable, Accessible, Interoperable, framework for sharing results and methods in the present, it and Reusable. FAIRSCAPE can be used to construct, as we also provides reliable deep provenance records across multiple have done, a provenance-aware computational data lake or Commons. It supports transparent disclosure of the Evidence computations, to support future reuse and to improve guaran- tees of reliability. This is the kind of effort we hope that cen- Graphs of computed results, with access to the persistent identi- ters and institutions will be increasingly making, to support fiers of the cited data or software, and to their stored metadata. reliable and shareable computational research results. We End-users do not need to learn a new programming language have seen increasing examples of philanthropic RFAs to use services provided by FAIRSCAPE. They require no ad- directing investment to this area. In our own research, we have ditional special expertise, other than basic familiarity with found FAIRSCAPE’s methods invaluable in supporting high- Python and the skillsets they already possess in statistics, com- ly productive computational research. putational biology, machine learning, or other data science tech- The major barriers to widespread adoption of digital com- niques. FAIRSCAPE provides an environment that makes large- mons environments, in our view, have been the relative non- scale computational work easier and results FAIRer. Neuroinform (2022) 20:187–202 199 (Harvard University), Satra Ghosh (MIT), Carole Goble (University of reusability (despite claims to the contrary) of existing com- Manchester), John Kunze (California Digital Library), Sherry Lake mons frameworks, and the additional effort required to man- (University of Virginia), Maryann Martone (University of California age a FAIR digital commons. We feel that FAIRSCAPE is a San Diego), and Neal Magee (University of Virginia), for helpful discus- contribution to resolving the reusability issue, and may also sions; and Neal Magee for technical assistance with the University of Virginia computing infrastructure. This work was supported in part by help to simplify some digital commons management issues. the U.S. National Institutes of Health, grants NIH OT3 OD025456-01, Time will tell if the philanthropic investments in these areas NIH 1U01HG009452, NIH R01-HD072071-05, and NIH U01- noted above are continued. We hope they are. HL133708-01; and by a grant from the Coulter Foundation. We plan several enhancements in future research and de- velopment with this project, such as integration of additional Declarations workflow engines, including engines for genomic analysis. We intend to provide support for DataCite DOI and Conflict of Interests The authors declare that they have no conflicts of interest. Software Heritage Identifier (SWHID) (Software Heritage Foundation, 2020) registration; with metadata and data trans- Additional Information Data used in preparing this article was obtained fer to Dataverse instances, in future. Transfer of data, soft- from the University of Virginia Center for Advanced Medical Analytics ware, and metadata to long-term digital archives such as these, and from OpenNeuro.org which are managed at the scale of universities or countries, is important in providing long-term access guarantees, beyond Rights and Permissions This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, shar- the life of an institutional center or institute. ing, adaptation, distribution and reproduction in any medium or format, as Many projects involving overlapping groups have worked long as you give appropriate credit to the original author(s) and the to address parts of the scale, accessibility, verifiability, repro- source, provide a link to the Creative Commons license, and indicate if ducibility, and reuse problems targeted by FAIRSCAPE. Such changes were made. The images or other third-party material in this article are included in the article ' s Creative Commons license, unless challenges are in large part outcomes of the transition of bio- indicated otherwise in a credit line to the material. If material is not medical and other scientific research from print to digital, and included in the article ' s Creative Commons license and your intended our increasing ability to generate data and to run computations use is not permitted by statutory regulation or exceeds the permitted use, on it at enormous scale. We make use of many of these prior you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/ approaches in our FAIRSCAPE framework, providing an in- 4.0/. tegrated model for FAIRness and reproducibility, with ease of use incentives. We believe it will be a helpful tool for con- Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adap- structing provenance-aware FAIR digital commons, as part of tation, distribution and reproduction in any medium or format, as long as an interoperating model for reproducible biomedical science. you give appropriate credit to the original author(s) and the source, pro- vide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a Information Sharing Statement credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain & Code for the microservices and the python client described permission directly from the copyright holder. To view a copy of this in this paper are publicly available in the Zenodo reposi- licence, visit http://creativecommons.org/licenses/by/4.0/. tory at https://doi.org/10.5281/zenodo.4711204 and on GitHub at https://github.com/fairscape/fairscape.All versions are available under MIT license. & Installation instructions, python Demo Notebooks, and References API documentation are available at https://fairscape. github.io, also under MIT license. Adkins, S. (2016). OpenStack: Cloud application development. & The MongoDB noSQL DB Community Version is avail- Indianapolis, IN: Wrox. http://RE5QY4SB7X.search. able under MongoDB’s license terms at https://www. serialssolutions.com/?V=1.0&L=RE5QY4SB7X&S=JCs&C= mongodb.com/try/download/community. TC0001588151&T=marc & The Stardog knowledge graph DB is available under Al Manir, S., Niestroy, J., Levinson, M. A., & Clark, T. (2021a). Evidence graphs: Supporting transparent and FAIR computation, Stardog’s license terms at https://www.stardog.com. with defeasible reasoning on Data, methods and results. BioRXiv, & The EVI ontology OWL2 vocabulary is available at 2021/437561,9. https://doi.org/10.1101/2021.03.29.437561. https://w3id.org/EVI# under MIT license. Al Manir, S., Niestroy, J., Levinson, M., & Clark, T. (2021b). EVI: The evidence graph ontology, OWL 2 vocabulary. Zenodo. https://doi. org/10.5281/zenodo.4630931. Acknowledgements We thank Chris Baker (University of New Alterovitz, G., Dean, D., Goble, C., Crusoe, M. R., Soiland-Reyes, S., Bell, A., Hayes, A., Suresh, A., Purkayastha, A., King, C. H., Brunswick), Caleb Crane (Mitsubishi Corporation), Mercè Crosas 200 Neuroinform (2022) 20:187–202 Taylor, D., Johanson, E., Thompson, E. E., Donaldson, E., discovery in a big-data environment for genetic epidemiology. Nat Genet, 49,1560–1563. https://doi.org/10.1038/ng.3968. Morizono, H., Tsang, H., Vora, J. K., Goecks, J., Yao, J., Almeida, J. S., Keeney, J., Addepalli, K. D., Krampis, K., Smith, Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). K. M., Guo, L., Walderhaug, M., Schito, M., Ezewudo, M., Borg, Omega, and Kubernetes. Commun ACM, 59(5), 50–57. Guimera, N., Walsh, P., Kahsay, R., Gottipati, S., Rodwell, T. C., https://doi.org/10.1145/2890784. Bloom, T., Lai, Y., Simonyan, V., & Mazumder, R. (2018). Carrera, Á., & Iglesias, C. A. (2015). A systematic review of argumen- Enabling precision medicine via standard communication of HTS tation techniques for multi-agent systems research. Artif Intell Rev, provenance, analysis, and results. PLoS Biol, 16(12), e3000099. 44(4), 509–535. https://doi.org/10.1007/s10462-015-9435-9. https://doi.org/10.1371/journal.pbio.3000099. Cayrol, C., & Lagasquie-Schiex, M.-C. (2009). Bipolar abstract argumen- Altman, M., Andreev, L., Diggory, M., King, G., Sone, A., Verba, S., & tation systems. In I. Rahwan & G. R. Simari (Eds.), Argumentation Kiskis, D. L. (2001). A digital library for the dissemination and in artificial intelligence. Springer. replication of quantitative social science research. Soc Sci Comput Cayrol, C., & Lagasquie-Schiex, M.-C. (2010). Coalitions of arguments: Re v, 19 (4), 458 – 470. https://doi.org/10.1177/ A tool for handling bipolar argumentation frameworks. Int J Intell Syst, 25(1), 83–109. https://doi.org/10.1002/int.20389. Altman, M., & King, G. (2007). A proposed standard for the scholarly Cayrol, C., & Lagasquie-Schiex, M.-C. (2013). Bipolarity in argumenta- citation of quantitative Data. DLib Magazine, 13(3/4), march2007- tion graphs: Towards a better understanding. Int J Approx Reason, altman. http://www.dlib.org/dlib/march07/altman/03altman.html 54(7), 876–899. https://doi.org/10.1016/j.ijar.2013.03.001. Balalaie, A., Heydarnoori, A., & Jamshidi, P. (2016). Microservices ar- Chard, K., Willis, C., Gaffney, N., Jones, M. B., Kowalik, K., Ludäscher, chitecture enables DevOps: Migration to a cloud-native architecture. B., et al. (2019). Implementing computational reproducibility in the IEEE Softw, 33(3), 42–52. https://doi.org/10.1109/MS.2016.64. whole tale environment. In Proceedings of the 2nd International Bandrowski, A. (2014). RRID’s are in the wild! Thanks to JCN and Workshop on Practical Reproducible Evaluation of Computer PeerJ. The NIF Blog: Neuroscience Information Framework. Systems - P-RECS ‘19 (pp. 17–22). Presented at the the 2nd inter- http://blog.neuinfo.org/index.php/essays/rrids-are-in-the-wild- national workshop, Phoenix, AZ, USA: ACM press. https://doi.org/ thanks-to-jcn-and-peerj 10.1145/3322790.3330594. Bandrowski, A. E., & Martone, M. E. (2016). RRIDs: A simple step Christie, M. A., Bhandar, A., Nakandala, S., Marru, S., Abeysinghe, E., toward improving reproducibility through rigor and transparency Pamidighantam, S., & Pierce, M. E. (2020). Managing authentica- of experimental methods. Neuron, 90(3), 434–436. https://doi.org/ tion and authorization in distributed science gateway middleware. 10.1016/j.neuron.2016.04.030. Futur Gener Comput Syst, 111,780–785. https://doi.org/10.1016/j. Bench-Capon, T. J. M., & Dunne, P. E. (2007). Argumentation in artifi- future.2019.07.018. cial intelligence. Artif Intell, 171(10–15), 619–641. https://doi.org/ Clark, Tim, Ciccarese, P., & Goble, C. (2014). Micropublications: A 10.1016/j.artint.2007.05.001. semantic model for claims, evidence, arguments and annotations Birger, C., Hanna, M., Salinas, E., Neff, J., Saksena, G., Livitz, D., et al. in biomedical communications. Journal of Biomedical Semantics, (2017). FireCloud, a scalable cloud-based platform for collabora- 5(1). http://www.jbiomedsem.com/content/5/1/28 tive genome analysis: Strategies for reducing and controlling costs Clark, T., Katz, D. S., Bernal Llinares, M., Castillo, C., Chard, K., Crosas, (preprint). Bioinformatics. https://doi.org/10.1101/209494. M., et al. (2018, September 3). DCPPC DRAFT: KC2 Globally Borgman, C. (2012). Why are the attribution and citation of scientific data Unique Identifier Services. National Institutes of Health, Data important? In P. Uhlir & D. Cohen (Eds.), Report from developing Commons Pilot Phase Consortium. https://public. Data attribution and citation PRactices and standards: An interna- nihdatacommons.us/DCPPC-DRAFT-8_KC2/ tional symposium and workshop. Washington DC: National CODATA/ITSCI Task Force on Data Citation. (2013). Out of cite, out of Academy of Sciences’ Board on Research Data and Information. mind: The current state of practice, policy and Technology for Data National Academies Press. http://works.bepress.com/cgi/ Citation. Data Science Journal, 12,1–75. https://doi.org/10.2481/ viewcontent.cgi?article=1286&context=borgman dsj.OSOM13-043. Bourne, P., Clark, T., Dale, R., de Waard, A., Herman, I., Hovy, E., & Cousijn, H., Kenall, A., Ganley, E., Harrison, M., Kernohan, D., Shotton, D. (2012). Improving future research communication and Lemberger, T., Murphy, F., Polischuk, P., Taylor, S., Martone, e-scholarship: A summary of findings. Informatik Spectrum, 35(1), M., & Clark, T. (2018). A data citation roadmap for scientific pub- 56–57. https://doi.org/10.1007/s00287-011-0592-1. lishers. Scientific data, 5, 180259. Brase, J. (2009). DataCite - A Global Registration Agency for Research Dang, Q. H. (2015). Secure Hash Standard (no. NIST FIPS 180-4) (p. Data. In Proceedings of the 2009 Fourth International Conference NIST FIPS 180-4). National Institute of Standards and Technology. on Cooperation and Promotion of Information Resources in Science https://doi.org/10.6028/NIST.FIPS.180-4. and Technology (pp. 257–261). Presented at the Cooperation and Miller, D., Whitlock, J., Gardiner, M., Ralphson, M., Ratovsky, R., Sarid, Promotion of Information Resources in Science and Technology, U.. (2020). OpenAPI specification, version 3.03. OpenAPI. http:// 2009. COINFO ‘09. Fourth International Conference on https:// spec.openapis.org/oas/v3.0.3. Accessed 2 February 2021. doi.org/10.1109/COINFO.2009.66. Data Citation Synthesis Group. (2014). Joint Declaration of Data Brewka, G., Polberg, S., & Woltran, S. (2014). Generalizations of Dung Citation Principles. San Diego CA: Future of research communica- frameworks and their role in formal argumentation. Intelligent tion and e-scholarship (FORCE11). https://doi.org/10.25490/a97f- Systems, IEEE, 29(1), 30–38. https://doi.org/10.1109/MIS.2013. egyk. Dung, P. M. (1995). On the acceptability of arguments and its fundamen- Brinckman, A., Chard, K., Gaffney, N., Hategan, M., Jones, M. B., tal role in nonmonotonic reasoning, logic programming and n- Kowalik, K., Kulasekaran, S., Ludäscher, B., Mecum, B. D., person games. Artif Intell, 77(2), 321–357. https://doi.org/10.1016/ Nabrzyski, J., Stodden, V., Taylor, I. J., Turk, M. J., & Turner, K. 0004-3702(94)00041-x. (2019). Computing environments for reproducibility: Capturing the Dung, P. M., & Thang, P. M. (2018). Representing the semantics of “Whole Tale.”. Futur Gener Comput Syst, 94,854–867. https://doi. abstract dialectical frameworks based on arguments and attacks. org/10.1016/j.future.2017.12.029. Argument & Computation, 9(3), 249–267. https://doi.org/10.3233/ Brody, J. A., Morrison, A. C., Bis, J. C., O’Connell, J. R., Brown, M. R., AAC-180427. Huffman, J. E., et al. (2017). Analysis commons, a team approach to Neuroinform (2022) 20:187–202 201 Ellison, A. M., Boose, E. R., Lerner, B. S., Fong, E., & Seltzer, M. Lau, J. W., Lehnert, E., Sethi, A., Malhotra, R., Kaushik, G., Onder, Z., et al. (2017). The Cancer genomics cloud: Collaborative, reproduc- (2020). The End-to-End Provenance Project. Patterns, 1(2), 100016. https://doi.org/10.1016/j.patter.2020.100016. ible, and democratized-a new paradigm in large-scale computational research. Cancer Res, 77(21), e3–e6. https://doi.org/10.1158/0008- Fenner, M., Clark, T., Katz, D., Crosas, M., Cruse, P., Kunze, J., & 5472.CAN-17-0387. Wimalaratne, S. (2018, July 23). Core metadata for GUIDs. Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Cheney, J., Corsar, National Institutes of Health, Data Commons Pilot Phase D., et al. (2013). PROV-O: The PROV Ontology W3C Consortium. https://public.nihdatacommons.us/DCPPC-DRAFT- Recommendation 30 April 2013. http://www.w3.org/TR/prov-o/ 7_KC2/. Leite, L., Rocha, C., Kon, F., Milojicic, D., & Meirelles, P. (2020). A Fenner, M., Crosas, M., Grethe, J. S., Kennedy, D., Hermjakob, H., survey of DevOps concepts and challenges. ACM Comput Surv, Rocca-Serra, P., Durand, G., Berjon, R., Karcher, S., Martone, M., 52(6), 1–35. https://doi.org/10.1145/3359981. & Clark, T. (2019). A data citation roadmap for scholarly data re- Levinson, M. A., Niestroy, J., Al Manir, S., Fairchild, K. D., Lake, D. E., positories. Scientific Data, 6(1), 28. https://doi.org/10.1038/s41597- Moorman, J. R., & Clark, T. (2021). Fairscape v0.1.0 Release. 019-0031-8. CERN Zenodo. DOI:https://doi.org/10.5281/zenodo.4711204. Gil, Y., Miles, S., Belhajjame, K., Deus, H., Garijo, D., Klyne, G., et al. Lewis, J., & Fowler, M. (2014). Microservices: A definition of this new (2013, April 30). PROV Model Primer: W3C Working Group Note architectural term. MartinFowler.com. https://martinfowler.com/ 30 April 2013. World Wide Web Consortium (W3C). https://www. articles/microservices.html#ProductsNotProjects w3.org/TR/prov-primer/ Malhotra, R., Seth, I., Lehnert, E., Zhao, J., Kaushik, G., Williams, E. H., Gottifredi, S., Cohen, A., García, A. J., & Simari, G. R. (2018). Sethi, A., & Davis-Dusenbery, B. N. (2017). Using the seven brid- Characterizing acceptability semantics of argumentation frame- ges Cancer genomics cloud to access and analyze petabytes of works with recursive attack and support relations. Artif Intell, 262, Cancer Data. Curr Protoc Bioinformatics, 60, 11.16.1–11.16.32. 336–368. https://doi.org/10.1016/j.artint.2018.06.008. https://doi.org/10.1002/cpbi.39. Greenberg, S. A. (2009). How citation distortions create unfounded au- Merkys, A., Mounet, N., Cepellotti, A., Marzari, N., Gražulis, S., & Pizzi, thority: Analysis of a citation network. Br Med J, 339, b2680. G. (2017). A posteriori metadata from automated provenance track- https://doi.org/10.1136/bmj.b2680. ing: Integration of AiiDA and TCOD. Journal of Cheminformatics, Greenberg, S. A. (2011). Understanding belief using citation networks. J 9(1), 56. https://doi.org/10.1186/s13321-017-0242-y. Eval Clin Pract, 17(2), 389–393. https://doi.org/10.1111/j.1365- Moreau, L., Missier, P., Belhajjame, K., B’Far, R., Cheney, J., Coppens, 2753.2011.01646.x. S., et al. (2013). PROV-DM: The PROV Data model: W3C recom- Grossman, R. L. (2019). Data Lakes, clouds, and Commons: A review of mendation 30 April 2013. World Wide Web Consortium. http:// platforms for analyzing and sharing genomic Data. Trends Genet, www.w3.org/TR/prov-dm/ 35(3), 223–234. https://doi.org/10.1016/j.tig.2018.12.006. NIH Data Commons Pilot: Object registration service (ORS). (2018). Groth, P., Cousijn, H., Clark, T., & Goble, C. (2020). FAIR Data reuse – https://github.com/mlev71/ors_wsgi The Path through Data citation. Data Intelligence, 2(1–2), 78–86. Notter, M. (2020). Nipype tutorial. Example 1: Preprocessing workflow. https://doi.org/10.1162/dint_a_00030. Github. https://miykael.github.io/nipype_tutorial/notebooks/ Ishii, S., Iwama, A., Koseki, H., Shinkai, Y., Taga, T., & Watanabe, J. example_preprocessing.html. Accessed 5 February 2021. (2014). Report on STAP Cell Research Paper Investigation (p. 11). Papadimitriou, G., Wang, C., Vahi, K., da Silva, R. F., Mandal, A., Liu, Saitama, JP: RIKEN. http://www3.riken.jp/stap/e/f1document1.pdf Z., Mayani, R., Rynge, M., Kiran, M., Lynch, V. E., Kettimuthu, R., Juty, N., Wimalaratne, S. M., Soiland-Reyes, S., Kunze, J., Goble, C. A., Deelman, E., Vetter, J. S., & Foster, I. (2021). End-to-end online & Clark, T. (2020). Unique, persistent, resolvable: Identifiers as the performance data capture and analysis for scientific workflows. foundation of FAIR. Data Intelligence, 2(1–2), 30–39. https://doi. Futur Gener Comput Syst, 117,387–400. https://doi.org/10.1016/j. org/10.5281/zenodo.3267434. future.2020.11.024. Katz, D., Chue Hong, N., Clark, T., Muench, A., Stall, S., Bouquin, D., Prager,E.M.,Chambers,K.E.,Plotkin,J.L.,McArthur,D. L., et al. (2021a). Recognizing the value of software: A software cita- Bandrowski, A. E., Bansal, N., Martone, M. E., Bergstrom, H. C., tion guide [version 2; peer review: 2 approved]. F1000Research, Bespalov, A., & Graf, C. (2018). Improving transparency and sci- 9(1257). https://doi.org/10.12688/f1000research.26932.2. entific rigor in academic publishing. Brain and Behavior, 9, e01141. Katz, D. S., Gruenpeter, M., Honeyman, T., Hwang, L., Sochat, V., Anzt, https://doi.org/10.1002/brb3.1141. H., & Goble, C. (2021b). A Fresh Look at FAIR for Research Rahwan, I.(Ed.).(2009). Argumentation in artificial intelligence. Software, 35. Springer. Khan, F. Z., Soiland-Reyes, S., Sinnott, R. O., Lonie, A., Goble, C., & RETRACTED ARTICLE: Stimulus-triggered fate conversion of somatic Crusoe, M. R. (2019). Sharing interoperable workflow provenance: cells into pluripotency. (2014). PubPeer: The Online Journal Club. A review of best practices and their practical application in ht tps: //pubpeer. com /public ati o ns/ CWLProv. GigaScience, 8(11). https://doi.org/10.1093/ B9BF2D3E83DF32CAEFFDAC159A2A94#14 gigascience/giz095. RIKEN. (2014). Interim report on the investigation of the Obokata et al. King, G. (2007). An introduction to the Dataverse network as an infra- articles. RIKEN. https://www.riken.jp/en/news_pubs/research_ structure for Data sharing. Sociol Methods Res, 36(2), 173–199. news/pr/2014/20140314_1/ https://doi.org/10.1177/0049124107306660. Shannon, P. (2003). Cytoscape: A software environment for integrated Kunze, J., & Rodgers, R. (2008). The ARK Identifier Scheme. University models of biomolecular interaction networks. Genome Res, 13(11), of California, Office of the President. https://escholarship.org/uc/ 2498–2504. https://doi.org/10.1101/gr.1239303. item/9p9863nc Shapin, S. (1984). Pump and circumstance: Robert Boyle’sliterary tech- Lamprecht, A.-L., Garcia, L., Kuzak, M., Martinez, C., Arcila, R., Martin nology. Soc Stud Sci, 14(4), 481–520 http://sss.sagepub.com/ Del Pico, E., et al. (2020). Towards FAIR principles for research content/14/4/481.abstractN2. software. Data Science, 3(1), 37–59. https://doi.org/10.3233/DS- Shiu, A. (2014). The STAP scandal: A post-pub review success story. 190026. Publons. https://publons.com/blog/the-stap-scandal-a-post-pub- review-success-story/ Larrucea, X., Santamaria, I., Colomo-Palacios, R., & Ebert, C. (2018). Smith, A. M., Katz, D. S., Niemeyer, K. E., & FORCE11 Software Microservices. IEEE Softw, 35(3), 96–100. https://doi.org/10.1109/ MS.2018.2141030. Citation Working Group. (2016). Software citation principles. 202 Neuroinform (2022) 20:187–202 PeerJ Computer Science, 2,e86. https://doi.org/10.7717/peerj-cs. Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Software Heritage Foundation. (2020, May 14). SoftWare Heritage per- Finkers,R.,Gonzalez-Beltran,A.,Gray,A. J.G.,Groth,P., sistent IDentifiers (SWHIDs), version 1.5. Software Heritage Goble, C., Grethe, J. S., Heringa, J., ’t Hoen, P. A. C., Hooft, R., Foundation. https://docs.softwareheritage.org/devel/swh-model/ Kuhn, T., Kok, R., Kok, J., Lusher, S. J., Martone, M. E., Mons, A., persistent-identifiers.html#overview. Accessed 5 February 2021. Packer, A. L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, Starr, J., Castro, E., Crosas, M., Dumontier, M., Downs, R. R., Duerr, R., R., Sansone, S. A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Haak, L. L., Haendel, M., Herman, I., Hodson, S., Hourclé, J., Kratz, Swertz, M. A., Thompson, M., van der Lei, J., van Mulligen, E., J. E., Lin, J., Nielsen, L. H., Nurnberger, A., Proell, S., Rauber, A., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Sacchi, S., Smith, A., Taylor, M., & Clark, T. (2015). Achieving Zhao, J., & Mons, B. (2016). The FAIR guiding principles for sci- human and machine accessibility of cited data in scholarly publica- entific data management and stewardship. Scientific Data, 3, tions. PeerJ Computer Science, 1,1. https://doi.org/10.7717/peerj- 160018. https://doi.org/10.1038/sdata.2016.18. cs.1. Wilson, S., Fitzsimons, M., Ferguson, M., Heath, A., Jensen, M., Miller, Tuecke, S., Ananthakrishnan, R., Chard, K., Lidman, M., McCollam, B., J., Murphy, M. W., Porter, J., Sahni, H., Staudt, L., Tang, Y., Wang, Rosen, S., & Foster, I. (2016). Globus auth: A research identity and Z., Yu, C., Zhang, J., Ferretti, V., Grossman, R. L., & GDC Project. access management platform. In 2016 IEEE 12th International (2017). Developing Cancer informatics applications and tools using Conference on e-Science (e-Science) (pp. 203–212). Presented at the NCI genomic Data Commons API. Cancer Res, 77(21), e15– the 2016 IEEE 12th international conference on e-science (e-sci- e18. https://doi.org/10.1158/0008-5472.CAN-17-0598. ence), Baltimore, MD, USA: IEEE https://doi.org/10.1109/ Yakutovich, A. V., Eimre, K., Schütt, O., Talirz, L., Adorf, C. S., eScience.2016.7870901. Andersen, C. W., Ditler, E., du, D., Passerone, D., Smit, B., Uhlir, P. (2012). For Attribution - Developing Data Attribution and Marzari, N., Pizzi, G., & Pignedoli, C. A. (2021). AiiDAlab – An Citation Practices and Standards: Summary of an International ecosystem for developing, executing, and sharing scientific Workshop (2012) (p. 220). The National Academies Press. http:// workflows. Comput Mater Sci, 188, 110165. https://doi.org/10. www.nap.edu/catalog.php?record_id=13564 1016/j.commatsci.2020.110165. Wan, X., Guan, X., Wang, T., Bai, G., & Choi, B.-Y. (2018). Application deployment using microservice and Docker containers: Framework Publisher’sNote Springer Nature remains neutral with regard to jurisdic- and optimization. J Netw Comput Appl, 119,97–109. https://doi. tional claims in published maps and institutional affiliations. org/10.1016/j.jnca.2018.07.003. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton,M.,Baak,A.,Blomberg,N.,Boiten,J.W.,daSilva http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Neuroinformatics Springer Journals

FAIRSCAPE: a Framework for FAIR and Reproducible Biomedical Analytics

Loading next page...
 
/lp/springer-journals/fairscape-a-framework-for-fair-and-reproducible-biomedical-analytics-Y8RQh7Kcs3
Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2021
ISSN
1539-2791
eISSN
1559-0089
DOI
10.1007/s12021-021-09529-4
Publisher site
See Article on Publisher Site

Abstract

Results of computational analyses require transparent disclosure of their supporting resources, while the analyses themselves often can be very large scale and involve multiple processing steps separated in time. Evidence for the correctness of any analysis should include not only a textual description, but also a formal record of the computations which produced the result, including accessible data and software with runtime parameters, environment, and personnel involved. This article describes FAIRSCAPE, a reusable computational framework, enabling simplified access to modern scalable cloud-based components. FAIRSCAPE fully implements the FAIR data principles and extends them to provide fully FAIR Evidence, including machine-interpretable provenance of datasets, software and computations, as metadata for all computed results. The FAIRSCAPE microservices framework creates a complete Evidence Graph for every computational result, including persistent identifiers with metadata, resolvable to the software, computations, and datasets used in the computation; and stores a URI to the root of the graph in the result’s metadata. An ontology for Evidence Graphs, EVI (https://w3id.org/EVI), supports inferential reasoning over the evidence. FAIRSCAPE can run nested or disjoint workflows and preserves provenance across them. It can run Apache Spark jobs, scripts, workflows, or user-supplied containers. All objects are assigned persistent IDs, including software. All results are annotated with FAIR metadata using the evidence graph model for access, validation, reproducibility, and re-use of archived data and software. . . . . . . Keywords FAIR data FAIR software Digital Commons Evidence graph Provenance Reproducibility Agumentation Introduction Motivation Maxwell Adam Levinson, Justin Niestroy and Sadnan Al Manir Computation is an integral part of the preparation and content contributed equally to this work. of modern biomedical scientific publications, and the findings they report. Computations can range in scale from simple * Timothy Clark twclark@virginia.edu statistical routines run in Excel spreadsheets to massive or- chestrations of very large primary datasets, computational Department of Public Health Sciences (Biomedical Informatics), workflows, software, cloud environments, and services. University of Virginia School of Medicine, Charlottesville, VA, USA They typically produce data and generate images or tables as Department of Pediatrics, University of Virginia School of Medicine, output. Scientific claims of the authors are supported by evi- Charlottesville, VA, USA dence that includes reference to the theoretical constructs em- Center for Advanced Medical Analytics, University of Virginia bodied in existing domain literature, and to the experimental School of Medicine, Charlottesville, VA, USA or observational data and its analysis represented in images or Department of Medicine, University of Virginia School of Medicine, tables. Charlottesville, VA, USA Today, increasingly strict requirements are demanded to Department of Statistics, University of Virginia College and leave a digital footprint of each preparation and analysis step Graduate School of Arts and Sciences, Charlottesville, VA, USA in derivation of a finding to support reproducibility and reuse University of Virginia School of Data Science, Charlottesville, VA, of both data and tools. The widely recommended and often USA 188 Neuroinform (2022) 20:187–202 required practice by publishers and funders today is to archive the objects themselves may be retrieved, given the appropriate and cite one’s own experimental data (Cousijn et al., 2018; permissions. In this model, core metadata retrieved on resolu- Data Citation Synthesis Group, 2014; Fenner et al., 2019; tion of a persistent identifier (PID) (Juty et al., 2020; Starr Groth et al., 2020); and to make it FAIR (Wilkinson et al., et al., 2015) will include an evidence graph for the object 2016). These approaches were developed over more than a referenced by the PID. A link to the object’s evidence graph decade by a significant community of researchers, archivists, can be embedded in its metadata. funders, and publishers, prior to the current recommendations The central goals of FAIRSCAPE can be summarized as (Altman et al., 2001; Altman & King, 2007; Borgman, 2012; (1) to develop reusable cloud-based “data commons” frame- Bourne et al., 2012;Brase, 2009; CODATA/ITSCI Task works adapted for very large-scale data analysis, providing Force on Data Citation, 2013;King, 2007; Starr et al., 2015; significant value to researchers; and (2) to make the computa- Uhlir, 2012). There is increasing support among publishers tions, data, and software in these environments fully transpar- and the data science community to recommend, in addition, ent and FAIR (findable, accessible, interoperable, reusable). archiving and citing the specific software versions used in FAIRSCAPE supports a “data ecosystem” model (Grossman, analysis (Katz et al., 2021a; Smith et al., 2016), with persistent 2019) in which computational results and their provenance are identification and standardized core metadata, to establish transparent, verifiable, citable, and FAIR across the research FAIRness for research software (Katz et al., 2021b; lifecycle. We combined elements of prior work by ourselves Lamprecht et al., 2020); and to require identification via per- and others on provenance, abstract argumentation frame- sistent identifiers, of critical research reagents (A. works, data commons models, and citable research objects, Bandrowski, 2014; A. E. Bandrowski & Martone, 2016; to create the FAIRSCAPE framework. This work very signif- Prager et al., 2018). icantly extends and refactors the identifier and Metadata How do we facilitate and unify these developments? Can Services we and our colleagues developed in the NIH Data we make the recorded digital footprints as broadly useful as Commons Pilot Project Consortium (Clark et al., 2018; possible in the research ecosystem, while their generation oc- Fenner et al., 2018; NIH Data Commons Pilot: Object curs as side-effects of processes inherently useful to the re- Registration Service (ORS), 2018). searcher – for example, in large scale data analytics and data FAIRSCAPE has a unique position in comparison to other commons environments? provenance-related, reproducibility-enabling, and “data com- The solution we developed is a reusable framework for mons” projects. We combine elements of all three approaches, building provenance-aware data commons environments, while providing transparency, FAIRness, validation, and re- which we call FAIRSCAPE. It provides several features di- use of resources; and emphasize reusability of the rectly useful to the computational scientist, by simplifying and FAIRSCAPE platform itself. Our goal is to enable researchers to implement effective and useful provenance-aware compu- accelerating important data management and computational tasks; while providing, as metadata, an integrated evidence tational data commons in their own research environments, at graph of the resources used in performing the work, allowing any scale, while supporting full transparency of results across them to be retrieved, validated, reused, modified, and projects, via Evidence Graphs represented using a formal extended. ontology. Evidence graphs are formal models inspired by a large body of work in abstract argumentation (Bench-Capon & Related Work Dunne, 2007; Brewka et al., 2014; Carrera & Iglesias, 2015; Cayrol & Lagasquie-Schiex, 2009;Dung, 1995;Dung& Works focusing on provenance per se such as (Alterovitz Thang, 2018; Gottifredi et al., 2018; Rahwan, 2009), and et al., 2018; Ellison et al., 2020) and the various workflow analysis of evidence chains in biomedical publications provenance systems such as (Khan et al., 2019; (Clark et al., 2014;Greenberg, 2009, 2011), which shows that Papadimitriou et al., 2021;Yakutovich et al., 2021) are pri- the evidence for correctness of any finding, can be represented marily concerned with very detailed documentation of each as a directed acyclic support graph, an Evidence Graph. When computation on one or more datasets. The W3C PROV model combined with a graph of challenges to statements, or their (Gil et al., 2013; Lebo et al., 2013;Moreau et al., 2013)was evidence, this becomes a bipolar argument graph - or argu- developed initially to support interoperability across the trans- mentation system (Cayrol & Lagasquie-Schiex, 2009, 2010, formation logs of workflow systems. Our prior work on 2013). Micropublications (Clark et al., 2014) extending and The nodes in these graphs can readily provide metadata repurposing several core classes and predicates from W3C about the objects related to the computation, including the PROV, were preliminary work forming a basis for the EVI computation parameters and history. Each set of metadata ontology (Al Manir et al., 2021a, 2021b). may be indexed by one or more persistent identifiers, as spec- The EVI ontology used in FAIRSCAPE to represent evi- ified in the FAIR principles; and may include a URI by which dence graphs, is concerned with creating reasonable Neuroinform (2022) 20:187–202 189 transparency of evidence supporting scientific claims, includ- Enabling Transparency through EVI’sFormalModel ing computational results; it reuses the three major PROV classes Entity, Activity,and Agent as a basis to develop a To enable the necessary results transparency across separate detailed ontology and rule system for reasoning across the computations, we abstracted core elements of our evidence for (and against) results. When a computational re- micropublications model (Clark et al., 2014) to create EVI sult is reused in any new computation, that information is (http://w3id.org/EVI), an ontology of evidence relationships added to the graph, whether or not the operations were con- that extends W3C PROV to support specific evidence types trolled by a workflow manager. Challenges to results, found in biomedical publications; and enable reasoning across datasets, or methods, may also be added to the graph. While deep evidence graphs, and propagation of evidence challenges our current use of EVI is on computational evidence, it is deep in the graph, such as: retractions, reagent contamination, designed to be extensible to objects across the full experimen- errors detected in algorithms, disputed validity of methods, tal and publication lifecycle. challenges to validity of animal models, and others. EVI is Systems providing data commons environments, such as based on the fundamental idea that scientific findings or the various NCI and NHLBI cloud platforms (Birger et al., claims are not facts, but assertions backed by some level of 2017; Brody et al., 2017; Lau et al., 2017; Malhotra et al., evidence, i.e., they are defeasible components of 2017;Wilson et al., 2017) while providing many highly useful argumentation. Therefore, EVI focuses on the structure of specialized capabilities for their domain users, including re- evidence chains that support or challenge a result, and on use of data and software, have not focused extensively on providing access to the resources identified in those chains. providing re-use of their own frameworks, and are centralized. Evidence in a scientific article is in essence, a record of the As noted later in this article, FAIRSCAPE can be – and is provenance of the finding, result, or claim asserted as likely to meant to be - installed on public, private, or hybrid cloud be true; along with the theoretical background material platforms, “bare metal” clusters, and even on high-end lap- supporting the result’s interpretation. tops, for use at varying scopes – personal, institution-wide, If the data and software used in analysis are all registered lab-wide, multi-center, etc. and receive persistent identifiers (PIDs) with appropriate Reproducibility platforms such as Whole Tale and metadata, a provenance-aware computational data lake, i.e., CodeOcean, (Brinckman et al., 2019;Chard etal., 2019; a data lake with provenance-tracking computational services, Merkys et al., 2017) attempt to take on a one-stop-shop role can be built that attaches evidence graphs to the output of each for researchers wishing to demonstrate or at least assert, repro- process. At some point, a citable object - a dataset, image, ducibility of their computational research. Of these, figure, or table will be produced as part of the research. If this, CodeOcean (https://codeocean.com) is a special case – it is too, is archived with its evidence graph as part of the metadata run by a company and appears to be principally described in and the final supporting object is either directly cited in the press releases, and not in any peer reviewed articles. text, or in a figure caption, then the complete evidence graph FAIRSCAPE’s primary goals are to enable construction of may be retrieved as a validation of the object’s derivation and multi-scale computational data lakes, or commons; and to as a set of URIs resolvable to reusable versions of the toolsets make results transparent for reuse across the digital research and data. Evidence graphs are themselves entities that can be ecosystem, via FAIRness of data, software, and computational consumed and extended at each transformation or records. FAIRSCAPE supports reproducibility via computation. transparency. The remainder of this article describes the approach, In very many cases - such as the very large analytic microservices architecture, and interaction model of the workflows in our first use case - we believe that no reviewer FAIRSCAPE framework in detail. will attempt to replicate such large-scale computations, which ran for months on substantial resources. The primary use case will be validation via inspection, and en passant validation via Materials and Methods software reuse. FAIRSCAPE is not meant to be a one-stop shop. It is a FAIRSCAPE Architectural Layers transferable, reusable framework. It is not only intended to enable localized participation in a global, fully FAIR data FAIRSCAPE is built on a multi-layer set of components using and software ecosystem – it is itself FAIR software. The a containerized microservice architecture (MSA) (Balalaie FAIRSCAPE software, including installation and deployment et al., 2016; Larrucea et al., 2018; Lewis & Fowler, 2014; instructions, is available in the CERN Zenodo archive Wan et al., 2018) running under Kubernetes (Burns et al., (Levinson et al., 2021); and in the FAIRSCAPE Github re- 2016). We run our local instance in an OpenStack (Adkins, pository (https://github.com/fairscape/fairscape). 2016) private cloud environment, and maintain it using a DevOps deployment process (Balalaie et al., 2016;Leite 190 Neuroinform (2022) 20:187–202 et al., 2020). FAIRSCAPE may also be installed on laptops API Gateway running minikube in Ubuntu Linux, MacOS, or Windows environments; and on Google Cloud managed Kubernetes. Access to the FAIRSCAPE environment is through an API An architectural sketch of this model is shown in Fig. 1. gateway, mediated by a reverse proxy. Our gateway is medi- Ingress to microservices in the various layers is through a ated by Traefik (https://traefik.io) which dispatches calls to the reverse proxy using an API gateway pattern. The top layer pro- various microservices endpoints. vides an interface to the end users with raw data and the associ- Traefik is a reverse proxy that we configure as a ated metadata. The mid layer is a collection of tightly coupled Kubernetes Ingress Controller, to dynamically configure services that allow end users with proper authorization to submit and expose multiple microservices using a single API. and view their data, metadata, and various types of computations The endpoints of the services are exposed through the performed on them. The bottom layer is built with special pur- OpenAPI specification (formerly Swagger Specification) pose storage and analytics platforms for storing and analyzing (Darrel Miller et al., 2020) which defines the standard and data, metadata and provenance information. All objects are the language-agnostic interface for publishing RESTful APIs assigned PIDs using local ARK (Kunze & Rodgers, 2008)as- and allows service discovery. Accessing the services requires signment for speed, with global resolution for generality. user authentication, which we implement using the Globus Auth authentication broker (Tuecke et al., 2016). Users of GlobusAuth may be authenticated via a number of permitted UI Layer authentication services, and are issued a token which serves as an identity credential. In our current installation we require use The User Interface layer in FAIRSCAPE offers end of the CommonShare authenticator, with site-specific two-fac- users various ways to utilize the functionalities in the tor authentication necessary to obtain an identity token. This framework. A Python client simplifies calls to the token is then used by the microservices to determine a user’s microservices. Data, metadata, software, scripts, permission to access various functionality. workflows, containers, etc. are all submitted and regis- tered by the end users from the UI Layer, which may Authentication and Authorization Layer be configured to include an interactive executable note- book environment such as Binder or Deepnote. Authentication and authorization (authN/authZ) in FAIRSCAPE are handled by Keycloak (Christie et al., Fig. 1 FAIRSCAPE architectural layers and components Neuroinform (2022) 20:187–202 191 2020), a widely-used open source identity and access manage- storage. These objects may include structured or unstructured ment tool. data, application software, workflow, scripts. The associated When Traefik receives a request, it handles an authentica- metadata contains essential descriptive information such as tion check to Keycloak, which then determines whether or not context, type, name, textual description, author, location, the requestor has a valid token for an identity that can perform checksum, etc. about these objects. Metadata are expressed the requested action. as JSON-LD and sent to the Metadata Service for further We distribute FAIRSCAPE with a preconfigured Keycloak processing. for basic username / password authentication & authorization Hashing is used to verify correct transmission of the object of service requests. This can be easily modified to support – users are required to specify a hash which is then alternative identity providers, including LDAP, OpenID recomputed by the Object Service after the object is stored. Connect, and OAuth2.0 for institutional single sign-on. Hash computation is currently based on the SHA-256 secure Services continue to interact the same way, even if you change cryptographic hash algorithm (Dang, 2015). Upon successful the configured identity provider. execution, the service returns a PID of the object in the form of Within our local Keycloak configuration, we chose to define an ARK, which resolves to the metadata. The metadata in- Globus Auth as the identity provider. Globus Auth then serves cludes, as is normal in PID architecture (Starr et al., 2015), a as a dispatching broker amongst multiple other possible final link to the actual data location. identity providers. We selected the login service at the An OpenAPI description of the interface is here: University of Virginia as our final provider, providing two- https://app.swaggerhub.com/apis/FAIRSCAPE/Transfer/ factor authentication and institutional single sign-on. Keycloak 0.1 is very flexible in allowing selection of various authentication schemes, such as LDAP, SAML, OAuth2.0, etc. Selection of Metadata Service authentication schemes is an administrator decision. The Metadata Service handles metadata registration and reso- Microservices Layer lution including identifier minting in association with the ob- ject metadata. The Metadata Service takes user POSTed The microservices layer is composed of seven services: (1) JSON-LD metadata and uploads the metadata to MongoDB Transfer, (2) Metadata, (3) Object, (4) Evidence Graph, (5) and Stardog, and returns a PID. To retrieve metadata for an Compute, (6) Search, and (7) Visualization. These are de- existing PID a user makes a GET call to the service. A PUT scribed in more detail in Section 2. Each microservice does call to the service will update an existing PID with new meta- its own request authorization, subsequent to Keycloak, en- data. While other services may read from MongoDB and abling fine-grained access control. Stardog directly, the Metadata Service handles all writes to MongoDB and Stardog. Storage and Analytic Engine Layer An OpenAPI description of the interface is here: https://app.swaggerhub.com/apis/FAIRSCAPE/Metadata- In FAIRSCAPE, an S3 compatible object store is required for Service/0.1 storing objects, a document store for storing metadata, and a graph store for storing graph data. Persistence for these data- Object Service bases is configured through Kubernetes volumes, which map specific paths on containers to disk storage. The current re- The Object Service provides a direct interface between the lease of FAIRSCAPE uses the S3 compatible MinIO as the Transfer Service and MinIO as well as maintaining consisten- object store, MongoDB as the document store, and Stardog as cy between MinIO and the metadata store. The Object Service the graph store. Computations invoked by the Compute handles uploads of new objects as well as uploading new Service are managed by Kubernetes, Apache SPARK, and versions of existing files. In both cases the Object Service the Nipype neuroinformatics workflow engine. accepts a file and desired file location as inputs and (if the location is available) uploads the file to desired location in FAIRSCAPE Microservice Components MinIO and returns a PID representing the location of the uploaded file. A DELETE call to the service will delete the Transfer Service requested file from MinIO as well as delete the PID with the link to the data, however the PID representing the object meta- This service transfers and registers digital research objects - data remains. datasets, software, etc., − and their associated metadata, to the An OpenAPI description of the interface is here: Commons. These objects are sent to the transfer service as https://app.swaggerhub.com/apis/FAIRSCAPE/Object- binary data streams, which are then stored in MinIO object Service/0.1 192 Neuroinform (2022) 20:187–202 Evidence Graph Service performs a search over all literals in the metadata for exact string matches and returns a list of all PIDs with a literal con- The Evidence Graph Service creates a JSON-LD Evidence taining the query string. It is invoked via the GET method of Graph of all provenance related metadata to a PID of interest. API endpoint to the service with the search string as argument. The Evidence Graph documents all objects such as datasets, An OpenAPI description of the interface is here: software, workflows, and the computations which are directly https://app.swaggerhub.com/apis/FAIRSCAPE/Search/0.1 involved in creating the requested entity. The service accepts a PID as its input, runs a PATH query built on top of the SPARQL Visualization Service query engine in Stardog with the PID of interest as its source to retrieve all supporting nodes. To retrieve an Evidence Graph for This service allows users to visualize Evidence Graphs inter- a PID a user may make a GET call to the service. actively in the form of nodes and directed edges, offering a The Evidence Graph Service plays an important role in consolidated view of the entities and the activities supporting reproducing computations. All resources required to run a correctness of the computed result. Our current visualization computation are exposed using persistent identifiers by the engine is Cytoscape (Shannon, 2003). Each node displays its evidence graph. A user can reproduce the same computation relevant metadata information, including its type and PID, by invoking the appropriate services available through the resolved in real-time. Python client with the help of these identifiers. This feature The Visualization Service renders the graph on an HTML allows a user to verify the accuracy of the results and detect page. any discrepancies. An OpenAPI description of the interface is here: An OpenAPI description of the interface is here: https://app.swaggerhub.com/apis/FAIRSCAPE/ https://app.swaggerhub.com/apis/FAIRSCAPE/Evidence- Visualization/0.1 Graph/0.1 FAIRSCAPE Service Orchestration Compute Service FAIRSCAPE orchestrates a set of containers to provide pat- This service executes user uploaded scripts, workflows, or con- terns for object registration, including identifier minting and tainers, on uploaded data. It currently offers two compute en- resolution; object retrieval; computation; search; evidence gines (Spark, Nipype) in addition to native Kubernetes container graph visualization, and object deletion. These patterns are execution, to meet a variety of computational needs. Users may orchestrated following API ingress, authentication, and ser- execute any script they would like to run as long as they provide vice dispatch, by microservice calls, invoking the relevant a docker container with the required dependencies. To complete service containers. jobs the service spawns specialized pods on Kubernetes de- signed to perform domain specific computations that can be Object Registration scaled to the size of the cluster. This service provides the essen- tial ability to recreate computations based solely on identifiers. Object registration occurs initially via the Transfer Service, For data to be computed on it must first be uploaded via the with an explicit user service call, and again automatically Transfer Service and be issued an associated PID. using the same service, each time a computation generates The service accepts a PID for a dataset, a script, software, output. Objects in FAIRSCAPE may be software, containers, or a container, as input and produces a PID representing the or datasets. Descriptive metadata must be specified for object activity to be completed. The request returns a job identifier registration to occur. from which job progress can be followed. Upon completion of When invoked, the Transfer Service calls the Metadata a job all outputs are automatically uploaded and assigned new Service (MDS) to mint a new persistent identifier, implement- PIDs, with provenance aware metadata. At job termination, ed as an Archival Resource Key (ARK), generated locally, the service performs a ‘cleanup’ operation, where a job is and to store it associated with the descriptive metadata, includ- removed from the queue once it is completed. ing the new registered object location. MDS stores object An OpenAPI description of the interface is here: metadata, including provenance, in both MongoDB and in https://app.swaggerhub.com/apis/FAIRSCAPE/Compute/ the Stardog graph store, allowing subsequent access to the 0.1 object metadata by other services. After minting an identifier and storing the metadata, the Search Service Transfer Service calls the Object Service to persist the new object, and then updates the metadata with the stored object The Search Service allows users to search for object metadata location. Hashing is used to verify correct transmission of the containing strings of interest. It accepts a string as input and object – users are requiredtospecify aSHA256hashon Neuroinform (2022) 20:187–202 193 registration, which is then recomputed by the Object Service workloads in the Compute Service enables all data, results, and verified after the object is stored. Internally computed and methods to be tracked via a connected evidence graph, hashes are provided for re-verification when the object is with persistent identifiers available for every node. accessed. Failure of hashes to match generates an error. The Compute Service executes computations using (a) a container specified by the user, or (b) the Apache Spark ser- Identifier Minting vice, or (c) the Nipype workflow engine. Like datasets and software (including scripts), computations are represented by The Metadata Service mints PIDs in the form of ARKs. persistent identifiers assigned to them. Objects are passed to Multiple alternative PIDs may exist for an object and PIDs the Compute Service by their PIDs and the computation is are resolved to their associated object level metadata including formally linked to the software (or script) by the the object’s Evidence Graph and location with appropriate usedSoftware property, and the input datasets by the permissions. usedDataset property. In the current deployment, ARKs created locally are regis- Runtime parameters may be passed with objects and a sin- tered to an assigned Name Assigning Authority Number. The gle identifier is minted with the given parameters and connect- ARK globally unique identifier ecosystem employs a flexible ed to the computation via the ‘parameters’ property. However, minimalistic standard and existing infrastructure. at this moment these parameters are not incorporated in the evidence graph. Identifier Resolution The Compute Service spawns a Kubernetes pod with the input objects mounted in the /data directory by default. ARK identifier resolution may be handled locally and/or by Upon completion of the job all output files in the /outputs external resolver services such as Name-to-Thing (https://n2t. directory are transferred to the object store and identifiers net). The Name-to-Thing resolver allows for Name Assigning for them are minted with the property generatedBy. The Authority Numbers (NAAN) to have redirect rules for their generatedBy property references the identifier for the ARKs, which forwards requests to the Name Mapping computation. Authority Hostport for the corresponding commons. Each FAIRSCAPE instance should independently obtain a Object Search NAAN, and a DNS name for their local FAIRSCAPE instal- lation, if they wish their ARKs to be resolved by n2t.net. Object searches are performed by the Search Service, DataCite DOI registration and resolution are planned for called directly on Service Dispatch. Search makes use of future work. Stardog’s full text retrieval, which in turn is based on Apache Lucene. Object Retrieval Objects are accessed by their PID, after prior resolution of the Evidence Graph Visualization object’s PID to its metadata (MDS) and authorization of the user’s authentication token for data access on that object. Evidence graphs of any object acquired by the system may be Object access is either directly from the Object Store, or from visualized at any point in this workflow using the wherever else the object may reside. Certain large objects Visualization Service. Nipype provides a chart of the residing in robust external archives, may not be acquired into workflows it executes using the Graphviz package. Our local object storage, but remain in place, up to the point of Evidence Graph Service is interactive, using the Cytoscape computation. package (Shannon, 2003), and allows Evidence graphs of multiple workflows in sequence to be displayed whether or Computation not they have been combined into a single flow. When executing a workload through the compute service, data, software, and containers are referenced through their Object Deletion PIDs, and by no other means. The Compute Service utilizes the stored metadata to dereference the object locations, and Objects are deleted by calls to the Object Service to clear the transfers them to the managed containers. The compute ser- object from storage, which then calls MDS and nulls out the vice also creates a provenance record of its own execution, object location in the metadata record. Metadata is retained associated with an identifier of type evi:Computation. Upon even though the object may cease to be held in the system, in the completion of a job, the Compute Service stores the gen- accordance with the Data Citation Principles (Data Citation erated output through the Transfer Service. Running Synthesis Group, 2014). 194 Neuroinform (2022) 20:187–202 Results series and make it easier for researchers to quickly build models for outcomes where the physiology is not Two use cases are presented here to demonstrate the applica- known. FAIRSCAPE can be used to build such models tion of FAIRSCAPE services. The first use case performs using its services. analysis of time series algorithms while the second runs a A series of steps are required to execute and reproduce the neuroimaging workflow. For each use case, the operations HCTSA of NICU data. They include transferring NICU data, involving data transfer, computation, and evidence graph gen- Python scripts, and associated metadata to the storage to be eration are described below: used later, running the scripts in the FAIRSCAPE compute environment, and generating the evidence graph. The first Use Case Demonstration 1: Highly Comparative Time script runs the time series algorithms on the vital sign data Series Analysis (HCTSA) of NICU Data while the second script performs clustering of these algo- rithms to generate a heatmap image. The evidence graph gen- Researchers at the Neonatal Intensive Care Unit erated for all patients contains over 17,000 nodes. However, a (NICU) at the University of Virginia continuously mon- simplified version of the computational analysis based on a itor infants and collect vital signs such as heart rate single patient is described here as the steps for executing the (HR) and oxygen saturation. Patterns in vital sign mea- analysis are common for all patients. These steps are briefly surements may indicate acute or chronic pathology described below: among infants. In the past a few standard statistics and algorithms specially designed for discovering cer- Transfer Data, Software and Metadata tain pathologies were applied to similar vital sign data. In this work we additionally applied many time series algorithms from other domains with the hope that these Before any computation is performed FAIRSCAPE re- algorithms would be helpful for prediction of unstudied quires each piece of data and software to be uploaded outcomes. A total of 67 time series algorithms have to the object store with its metadata using the POST been recoded as Python scripts and run on the vital method of the transfer service. The raw data file con- signs of 5997 infants collected over 10 years during taining the vital signs and the associated metadata are 2009–2019. The data are then merged, sampled and uploaded first. The scripts and the associated metadata clustered to find representative time series algorithms are uploaded next. The upload_file function shown below is used to transfer which express unique characteristics about the time the raw data file UVA_7219_HR.csv as the first parameter and the associated metadata referenced by the variable dataset_meta as the second parameter: Neuroinform (2022) 20:187–202 195 As part of the transfer, identifiers are minted by the Generating the Evidence Graph Metadata Service for each successfully uploaded object. The variable raw_time_series_fs_id refers to the minted identifier An evidence graph is generated using the GET method of the returned by the function. Each identifier is resolvable to the Evidence Graph Service. Figure 3 illustrates all computations uploaded object which can be accessed only by an authorized and the associated inputs and outputs for a single patient. The user. graph for all patients contains 17,995 nodes of types Image, Computation, Dataset and Software. Each patient has a unique Raw Time Series Feature Set, a Raw Data Analysis Time Series Data Analysis computation, and a Processed Time Series file. The Raw Data Analysis Script, HCTSA Clustering Script, HCTSA Heatmap Once the transfer is complete, the computation for the data Generation, and HCTSA Heatmap are shared among all analysis can be started. The computation takes the raw vital patients. sign measurements as input, groups the measurements into The simplified evidence graph in Fig. 3 contains 7 nodes, 10-min intervals and runs each algorithm on them. each with its own PID, where a Computation (Raw Data FAIRSCAPE makes launching such a computation easy by Analysis) uses a Dataset (Raw Time Series Feature Set) as executing the POST method of the compute service with iden- the input to a Software (Raw Data Analysis Script), tifiers of the data and script as parameters. The compute ser- representing the script to execute all the time series algo- vice creates an identifier with metadata pointing to the provid- ed inputs and launches a Kubernetes pod to perform the com- rithms, and generates the Dataset (Processed Time Series Feature Set) as the output. The next Computation (HCTSA putation. Upon completion of the script, all output files are assigned identifiers and stored in the object store. Cluster Heatmap Generation) uses the processed Dataset gen- erated during the previous computation as the input of the The compute function takes the PIDs of the dataset, soft- Software (HCTSA Clustering Script), which generates an ware/script, and the type of job such Apache Spark, Nipype, Image (HCTSA Cluster Heatmap), representing the clustering or custom containers as parameters. The PID it returns refers of the algorithms as an output. The evidence_graph function to the submitted job which can be used to track the progress of takes the PID of the HCTSA Heatmap image: the computation and its outputs. evidence_graph_jsond = FAIR.evidence_graph The compute function shown below is used to launch a computation on the raw data file raw_ts_fs_id as the first (HCTSA_heathmap_id) and generates the evidence graph for that PID serialized in JSON-LD (shown in Fig. 4). parameter, the analysis script raw_date_analysis_script_id as the second parameter using their identifiers, and the type 2. Use Case Demonstration 2: Neuroimaging Analysis of job spark as the third parameter: raw_date_analysis_jod_id = FAIR.compute(raw_ts_fs_id, raw_date_analysis_script_id, Using Nipype Workflow Engine spark') The PID the compute function returns resolves to the Data analysis in neuroimaging often requires multiple het- submitted jo b and is referenced by. erogeneous algorithms which sometimes lack transparent in- raw_date_analysis_jod_id. teroperability under a uniform platform. Workflows offer so- lutions to this problem by bringing the algorithms and soft- Clustering of Algorithms ware under a single umbrella. The open-source neuroimaging workflow engine Nipype combines heterogenous neuroimag- The next computation step is to perform clustering of algo- ing analysis software packages under a uniform operating rithms. Many algorithms are from similar domains and the platform which resolves the interoperability issues by operations they perform express similar characteristics. The allowing them to talk to each other. Nipype provides access HCTSA Clustering script clusters these algorithms into groups to a detailed representation of the complete execution of a which are highly correlated and a representative algorithm workflow consisting of inputs, output and runtime parameters. could be chosen from each grouping. The compute service is The containerization-friendly release, detailed workflow rep- then invoked with identifiers of the clustering script and the resentation, and minimal effort required to modify existing processed data as parameters. The compute function below services to produce a deep evidence graph have made takes PIDs of the processed time series feature set, clustering Nipype an attractive target for integration within the script, and spark job type as input parameters and returns a PID FAIRSCAPE framework. representing the job to generate the HCTSA heatmap: Among the services in FAIRSCAPE, only the compute service needed to be modified to run and interrogate Nipype. An image showing the clustered algorithms is produced at The modifications include repurposing the service to run the workflow from the Nipype-specific container generated by the the end of this step which is shown in Fig. 2. 196 Neuroinform (2022) 20:187–202 Fig. 2 NICU HCTSA clustering heatmap. X axis and Y axis are operations (algorithms using specific parameter sets), color is correlation between algorithms. The large white squares are clusters of highly correlated operations which suggest the dimension of the data may be greatly diminished by selecting “representative” algorithms from these clusters Neurodocker tool and to capture all entities from the internal contains intermediate inputs and outputs. It provides a detailed graph generated after the workflow is executed. Whereas an understanding of each analysis performed using the computa- evidence graph typically includes the primary inputs and out- tions, software and datasets. puts, the deep evidence graph produced here additionally Fig. 3 Simplified Evidence graph for one patient’s computations. Vital signs = dark blue box bottom right; computations = yellow boxes; processed data = dark blue box in middle; green box = heatmap of correlations Neuroinform (2022) 20:187–202 197 Fig. 4 JSON-LD Evidence Graph for patient computation as illustrated in Fig. 3 A workflow is considered simple if it consists of a sequence within and outside that environment. It supports every of processing steps and complex if there is nesting of workflow requirement defined in the FAIR Principles at a detailed execution such that the output of one workflow is used as the level, as defined in (Wilkinson et al., 2016), including a input to another workflow. The simple neuroimaging prepro- deep and comprehensive provenance model via cessing workflow demonstrated here (Notter, 2020) involves Evidence Graphs, contributing to more transparent sci- steps to correct motion of functional images, co-register function- ence and improved reusability of methods. al images to anatomical images, smooth the co-registered func- Scientific rigor depends on the transparency of methods (in- tional images, and detect artifacts in functional images. As part of cluding software) and materials (including data). The historian data transfer, dataset containing images and the script to run the of science Steven Shapin, described the approach developed processing steps with their associated metadata were uploaded with the first scientific journals as “virtual witnessing” using the file_upload function as shown in the previous use case. (Shapin, 1984), and this is still valid today. The typical scientific The compute function is then used to launch a computation on reader does not actually reproduce the experiment but is invited the image dataset which runs the processing script on the to review mentally every detail of how it was done to the extent repurposed Nipype container. The only exception in the compute that s/he becomes a “virtual witness” to an envisioned live dem- the function is that it uses nipype as the third input parameter onstration. That is clearly how most people read scientific papers instead of spark when the Compute Service was invoked. The - except perhaps when they are citing them, in which case less full evidence graph, generated using the evidence_graph func- care is often taken. Scientists are not really incentivized to rep- tion as demonstrated above, is too large to document due to space licate experiments; their discipline rewards novelty. constraints. Therefore, only the graph of the motion correction of The ultimate validation of any claim once it has been accept- functional images with FSL’sMCFLIRT is showninFig. 5.For ed as reasonable on its face comes with support from multiple additional details on this workflow, please consult the original distinct angles, by different investigators; successful re-use of Nipype tutorial (Notter, 2020). the materials and methods upon which it is based; and consis- tency with some body of theory. If the materials and methods are sufficiently transparent and thoroughly disclosed as to be reusable, and they cannot be made to work, or give bad results, Discussion that debunks the original experiments - precisely the way in which the promising-sounding STAP phenomenon was FAIRSCAPE enables rapid construction of a shared dig- discredited (“RETRACTED ARTICLE: Stimulus-triggered ital commons environment and supports FAIRness 198 Neuroinform (2022) 20:187–202 Fig. 5 Evidence Graph visualization for the neuroimaging workflow execution fate conversion of somatic cells into pluripotency”, 2014;Shiu, FAIRSCAPE is itself reusable and we have taken pains to pro- 2014), before the elaborate formal effort of Riken to replicate vide well-documented straightforward installation procedures. the experiments (Ishii et al., 2014;RIKEN, 2014). All resources on FAIRSCAPE are assigned identifiers As a first step then, it is not only a matter of reproducing which allow them to be shared. FAIRCAPE allows users to experiments but also of producing transparent evidence that capture the complete evidence graph of the tasks performed. the experiments have been done correctly. This permits chal- These evidence graphs show all steps of computations per- lenges to the procedures to develop over time, especially formed and the software and data that went into each compu- through re-use of materials (including data) and methods - tation. Evidence graphs, along with FAIRSCAPE’s other ser- which today significantly include software and computing vices, allow users to review and reproduce an experiment with environments. We definitely view these methods as being significantly less overhead than other standard approaches. extensible to materials such as reagents, using the RRID ap- Users can see all computations that were performed, review proach; and to other computational disciplines. the source code, and download all the data. This allows an- other party to reproduce the exact computations performed, apply the experimenter’s software to their own data, or apply Conclusion their own methods to the experimenter’sdata. The optimal use case for a FAIRSCAPE installation is a local or multi-institution digital commons, in a managed FAIRSCAPE is a reusable framework for scientific computa- Kubernetes environment. It can also be installed on high-end tions that provides a simplified interface for research users to an laptops for testing and development purposes as needed. We array of modern, dynamically scalable, cloud-based are actively looking for collaborators wishing to use, adapt, componentry. Our goal in developing FAIRSCAPE was to pro- and co-develop this software. vide an ease-of-use (and re-use) incentive for researchers, while FAIRSCAPE is not a tool for individual use, it is software rendering all the artifacts marshalled to produce a result, and the for creating a high-efficiency collective environment. As a evidence supporting them, Findable, Accessible, Interoperable, framework for sharing results and methods in the present, it and Reusable. FAIRSCAPE can be used to construct, as we also provides reliable deep provenance records across multiple have done, a provenance-aware computational data lake or Commons. It supports transparent disclosure of the Evidence computations, to support future reuse and to improve guaran- tees of reliability. This is the kind of effort we hope that cen- Graphs of computed results, with access to the persistent identi- ters and institutions will be increasingly making, to support fiers of the cited data or software, and to their stored metadata. reliable and shareable computational research results. We End-users do not need to learn a new programming language have seen increasing examples of philanthropic RFAs to use services provided by FAIRSCAPE. They require no ad- directing investment to this area. In our own research, we have ditional special expertise, other than basic familiarity with found FAIRSCAPE’s methods invaluable in supporting high- Python and the skillsets they already possess in statistics, com- ly productive computational research. putational biology, machine learning, or other data science tech- The major barriers to widespread adoption of digital com- niques. FAIRSCAPE provides an environment that makes large- mons environments, in our view, have been the relative non- scale computational work easier and results FAIRer. Neuroinform (2022) 20:187–202 199 (Harvard University), Satra Ghosh (MIT), Carole Goble (University of reusability (despite claims to the contrary) of existing com- Manchester), John Kunze (California Digital Library), Sherry Lake mons frameworks, and the additional effort required to man- (University of Virginia), Maryann Martone (University of California age a FAIR digital commons. We feel that FAIRSCAPE is a San Diego), and Neal Magee (University of Virginia), for helpful discus- contribution to resolving the reusability issue, and may also sions; and Neal Magee for technical assistance with the University of Virginia computing infrastructure. This work was supported in part by help to simplify some digital commons management issues. the U.S. National Institutes of Health, grants NIH OT3 OD025456-01, Time will tell if the philanthropic investments in these areas NIH 1U01HG009452, NIH R01-HD072071-05, and NIH U01- noted above are continued. We hope they are. HL133708-01; and by a grant from the Coulter Foundation. We plan several enhancements in future research and de- velopment with this project, such as integration of additional Declarations workflow engines, including engines for genomic analysis. We intend to provide support for DataCite DOI and Conflict of Interests The authors declare that they have no conflicts of interest. Software Heritage Identifier (SWHID) (Software Heritage Foundation, 2020) registration; with metadata and data trans- Additional Information Data used in preparing this article was obtained fer to Dataverse instances, in future. Transfer of data, soft- from the University of Virginia Center for Advanced Medical Analytics ware, and metadata to long-term digital archives such as these, and from OpenNeuro.org which are managed at the scale of universities or countries, is important in providing long-term access guarantees, beyond Rights and Permissions This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, shar- the life of an institutional center or institute. ing, adaptation, distribution and reproduction in any medium or format, as Many projects involving overlapping groups have worked long as you give appropriate credit to the original author(s) and the to address parts of the scale, accessibility, verifiability, repro- source, provide a link to the Creative Commons license, and indicate if ducibility, and reuse problems targeted by FAIRSCAPE. Such changes were made. The images or other third-party material in this article are included in the article ' s Creative Commons license, unless challenges are in large part outcomes of the transition of bio- indicated otherwise in a credit line to the material. If material is not medical and other scientific research from print to digital, and included in the article ' s Creative Commons license and your intended our increasing ability to generate data and to run computations use is not permitted by statutory regulation or exceeds the permitted use, on it at enormous scale. We make use of many of these prior you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/ approaches in our FAIRSCAPE framework, providing an in- 4.0/. tegrated model for FAIRness and reproducibility, with ease of use incentives. We believe it will be a helpful tool for con- Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adap- structing provenance-aware FAIR digital commons, as part of tation, distribution and reproduction in any medium or format, as long as an interoperating model for reproducible biomedical science. you give appropriate credit to the original author(s) and the source, pro- vide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a Information Sharing Statement credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain & Code for the microservices and the python client described permission directly from the copyright holder. To view a copy of this in this paper are publicly available in the Zenodo reposi- licence, visit http://creativecommons.org/licenses/by/4.0/. tory at https://doi.org/10.5281/zenodo.4711204 and on GitHub at https://github.com/fairscape/fairscape.All versions are available under MIT license. & Installation instructions, python Demo Notebooks, and References API documentation are available at https://fairscape. github.io, also under MIT license. Adkins, S. (2016). OpenStack: Cloud application development. & The MongoDB noSQL DB Community Version is avail- Indianapolis, IN: Wrox. http://RE5QY4SB7X.search. able under MongoDB’s license terms at https://www. serialssolutions.com/?V=1.0&L=RE5QY4SB7X&S=JCs&C= mongodb.com/try/download/community. TC0001588151&T=marc & The Stardog knowledge graph DB is available under Al Manir, S., Niestroy, J., Levinson, M. A., & Clark, T. (2021a). Evidence graphs: Supporting transparent and FAIR computation, Stardog’s license terms at https://www.stardog.com. with defeasible reasoning on Data, methods and results. BioRXiv, & The EVI ontology OWL2 vocabulary is available at 2021/437561,9. https://doi.org/10.1101/2021.03.29.437561. https://w3id.org/EVI# under MIT license. Al Manir, S., Niestroy, J., Levinson, M., & Clark, T. (2021b). EVI: The evidence graph ontology, OWL 2 vocabulary. Zenodo. https://doi. org/10.5281/zenodo.4630931. Acknowledgements We thank Chris Baker (University of New Alterovitz, G., Dean, D., Goble, C., Crusoe, M. R., Soiland-Reyes, S., Bell, A., Hayes, A., Suresh, A., Purkayastha, A., King, C. H., Brunswick), Caleb Crane (Mitsubishi Corporation), Mercè Crosas 200 Neuroinform (2022) 20:187–202 Taylor, D., Johanson, E., Thompson, E. E., Donaldson, E., discovery in a big-data environment for genetic epidemiology. Nat Genet, 49,1560–1563. https://doi.org/10.1038/ng.3968. Morizono, H., Tsang, H., Vora, J. K., Goecks, J., Yao, J., Almeida, J. S., Keeney, J., Addepalli, K. D., Krampis, K., Smith, Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). K. M., Guo, L., Walderhaug, M., Schito, M., Ezewudo, M., Borg, Omega, and Kubernetes. Commun ACM, 59(5), 50–57. Guimera, N., Walsh, P., Kahsay, R., Gottipati, S., Rodwell, T. C., https://doi.org/10.1145/2890784. Bloom, T., Lai, Y., Simonyan, V., & Mazumder, R. (2018). Carrera, Á., & Iglesias, C. A. (2015). A systematic review of argumen- Enabling precision medicine via standard communication of HTS tation techniques for multi-agent systems research. Artif Intell Rev, provenance, analysis, and results. PLoS Biol, 16(12), e3000099. 44(4), 509–535. https://doi.org/10.1007/s10462-015-9435-9. https://doi.org/10.1371/journal.pbio.3000099. Cayrol, C., & Lagasquie-Schiex, M.-C. (2009). Bipolar abstract argumen- Altman, M., Andreev, L., Diggory, M., King, G., Sone, A., Verba, S., & tation systems. In I. Rahwan & G. R. Simari (Eds.), Argumentation Kiskis, D. L. (2001). A digital library for the dissemination and in artificial intelligence. Springer. replication of quantitative social science research. Soc Sci Comput Cayrol, C., & Lagasquie-Schiex, M.-C. (2010). Coalitions of arguments: Re v, 19 (4), 458 – 470. https://doi.org/10.1177/ A tool for handling bipolar argumentation frameworks. Int J Intell Syst, 25(1), 83–109. https://doi.org/10.1002/int.20389. Altman, M., & King, G. (2007). A proposed standard for the scholarly Cayrol, C., & Lagasquie-Schiex, M.-C. (2013). Bipolarity in argumenta- citation of quantitative Data. DLib Magazine, 13(3/4), march2007- tion graphs: Towards a better understanding. Int J Approx Reason, altman. http://www.dlib.org/dlib/march07/altman/03altman.html 54(7), 876–899. https://doi.org/10.1016/j.ijar.2013.03.001. Balalaie, A., Heydarnoori, A., & Jamshidi, P. (2016). Microservices ar- Chard, K., Willis, C., Gaffney, N., Jones, M. B., Kowalik, K., Ludäscher, chitecture enables DevOps: Migration to a cloud-native architecture. B., et al. (2019). Implementing computational reproducibility in the IEEE Softw, 33(3), 42–52. https://doi.org/10.1109/MS.2016.64. whole tale environment. In Proceedings of the 2nd International Bandrowski, A. (2014). RRID’s are in the wild! Thanks to JCN and Workshop on Practical Reproducible Evaluation of Computer PeerJ. The NIF Blog: Neuroscience Information Framework. Systems - P-RECS ‘19 (pp. 17–22). Presented at the the 2nd inter- http://blog.neuinfo.org/index.php/essays/rrids-are-in-the-wild- national workshop, Phoenix, AZ, USA: ACM press. https://doi.org/ thanks-to-jcn-and-peerj 10.1145/3322790.3330594. Bandrowski, A. E., & Martone, M. E. (2016). RRIDs: A simple step Christie, M. A., Bhandar, A., Nakandala, S., Marru, S., Abeysinghe, E., toward improving reproducibility through rigor and transparency Pamidighantam, S., & Pierce, M. E. (2020). Managing authentica- of experimental methods. Neuron, 90(3), 434–436. https://doi.org/ tion and authorization in distributed science gateway middleware. 10.1016/j.neuron.2016.04.030. Futur Gener Comput Syst, 111,780–785. https://doi.org/10.1016/j. Bench-Capon, T. J. M., & Dunne, P. E. (2007). Argumentation in artifi- future.2019.07.018. cial intelligence. Artif Intell, 171(10–15), 619–641. https://doi.org/ Clark, Tim, Ciccarese, P., & Goble, C. (2014). Micropublications: A 10.1016/j.artint.2007.05.001. semantic model for claims, evidence, arguments and annotations Birger, C., Hanna, M., Salinas, E., Neff, J., Saksena, G., Livitz, D., et al. in biomedical communications. Journal of Biomedical Semantics, (2017). FireCloud, a scalable cloud-based platform for collabora- 5(1). http://www.jbiomedsem.com/content/5/1/28 tive genome analysis: Strategies for reducing and controlling costs Clark, T., Katz, D. S., Bernal Llinares, M., Castillo, C., Chard, K., Crosas, (preprint). Bioinformatics. https://doi.org/10.1101/209494. M., et al. (2018, September 3). DCPPC DRAFT: KC2 Globally Borgman, C. (2012). Why are the attribution and citation of scientific data Unique Identifier Services. National Institutes of Health, Data important? In P. Uhlir & D. Cohen (Eds.), Report from developing Commons Pilot Phase Consortium. https://public. Data attribution and citation PRactices and standards: An interna- nihdatacommons.us/DCPPC-DRAFT-8_KC2/ tional symposium and workshop. Washington DC: National CODATA/ITSCI Task Force on Data Citation. (2013). Out of cite, out of Academy of Sciences’ Board on Research Data and Information. mind: The current state of practice, policy and Technology for Data National Academies Press. http://works.bepress.com/cgi/ Citation. Data Science Journal, 12,1–75. https://doi.org/10.2481/ viewcontent.cgi?article=1286&context=borgman dsj.OSOM13-043. Bourne, P., Clark, T., Dale, R., de Waard, A., Herman, I., Hovy, E., & Cousijn, H., Kenall, A., Ganley, E., Harrison, M., Kernohan, D., Shotton, D. (2012). Improving future research communication and Lemberger, T., Murphy, F., Polischuk, P., Taylor, S., Martone, e-scholarship: A summary of findings. Informatik Spectrum, 35(1), M., & Clark, T. (2018). A data citation roadmap for scientific pub- 56–57. https://doi.org/10.1007/s00287-011-0592-1. lishers. Scientific data, 5, 180259. Brase, J. (2009). DataCite - A Global Registration Agency for Research Dang, Q. H. (2015). Secure Hash Standard (no. NIST FIPS 180-4) (p. Data. In Proceedings of the 2009 Fourth International Conference NIST FIPS 180-4). National Institute of Standards and Technology. on Cooperation and Promotion of Information Resources in Science https://doi.org/10.6028/NIST.FIPS.180-4. and Technology (pp. 257–261). Presented at the Cooperation and Miller, D., Whitlock, J., Gardiner, M., Ralphson, M., Ratovsky, R., Sarid, Promotion of Information Resources in Science and Technology, U.. (2020). OpenAPI specification, version 3.03. OpenAPI. http:// 2009. COINFO ‘09. Fourth International Conference on https:// spec.openapis.org/oas/v3.0.3. Accessed 2 February 2021. doi.org/10.1109/COINFO.2009.66. Data Citation Synthesis Group. (2014). Joint Declaration of Data Brewka, G., Polberg, S., & Woltran, S. (2014). Generalizations of Dung Citation Principles. San Diego CA: Future of research communica- frameworks and their role in formal argumentation. Intelligent tion and e-scholarship (FORCE11). https://doi.org/10.25490/a97f- Systems, IEEE, 29(1), 30–38. https://doi.org/10.1109/MIS.2013. egyk. Dung, P. M. (1995). On the acceptability of arguments and its fundamen- Brinckman, A., Chard, K., Gaffney, N., Hategan, M., Jones, M. B., tal role in nonmonotonic reasoning, logic programming and n- Kowalik, K., Kulasekaran, S., Ludäscher, B., Mecum, B. D., person games. Artif Intell, 77(2), 321–357. https://doi.org/10.1016/ Nabrzyski, J., Stodden, V., Taylor, I. J., Turk, M. J., & Turner, K. 0004-3702(94)00041-x. (2019). Computing environments for reproducibility: Capturing the Dung, P. M., & Thang, P. M. (2018). Representing the semantics of “Whole Tale.”. Futur Gener Comput Syst, 94,854–867. https://doi. abstract dialectical frameworks based on arguments and attacks. org/10.1016/j.future.2017.12.029. Argument & Computation, 9(3), 249–267. https://doi.org/10.3233/ Brody, J. A., Morrison, A. C., Bis, J. C., O’Connell, J. R., Brown, M. R., AAC-180427. Huffman, J. E., et al. (2017). Analysis commons, a team approach to Neuroinform (2022) 20:187–202 201 Ellison, A. M., Boose, E. R., Lerner, B. S., Fong, E., & Seltzer, M. Lau, J. W., Lehnert, E., Sethi, A., Malhotra, R., Kaushik, G., Onder, Z., et al. (2017). The Cancer genomics cloud: Collaborative, reproduc- (2020). The End-to-End Provenance Project. Patterns, 1(2), 100016. https://doi.org/10.1016/j.patter.2020.100016. ible, and democratized-a new paradigm in large-scale computational research. Cancer Res, 77(21), e3–e6. https://doi.org/10.1158/0008- Fenner, M., Clark, T., Katz, D., Crosas, M., Cruse, P., Kunze, J., & 5472.CAN-17-0387. Wimalaratne, S. (2018, July 23). Core metadata for GUIDs. Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Cheney, J., Corsar, National Institutes of Health, Data Commons Pilot Phase D., et al. (2013). PROV-O: The PROV Ontology W3C Consortium. https://public.nihdatacommons.us/DCPPC-DRAFT- Recommendation 30 April 2013. http://www.w3.org/TR/prov-o/ 7_KC2/. Leite, L., Rocha, C., Kon, F., Milojicic, D., & Meirelles, P. (2020). A Fenner, M., Crosas, M., Grethe, J. S., Kennedy, D., Hermjakob, H., survey of DevOps concepts and challenges. ACM Comput Surv, Rocca-Serra, P., Durand, G., Berjon, R., Karcher, S., Martone, M., 52(6), 1–35. https://doi.org/10.1145/3359981. & Clark, T. (2019). A data citation roadmap for scholarly data re- Levinson, M. A., Niestroy, J., Al Manir, S., Fairchild, K. D., Lake, D. E., positories. Scientific Data, 6(1), 28. https://doi.org/10.1038/s41597- Moorman, J. R., & Clark, T. (2021). Fairscape v0.1.0 Release. 019-0031-8. CERN Zenodo. DOI:https://doi.org/10.5281/zenodo.4711204. Gil, Y., Miles, S., Belhajjame, K., Deus, H., Garijo, D., Klyne, G., et al. Lewis, J., & Fowler, M. (2014). Microservices: A definition of this new (2013, April 30). PROV Model Primer: W3C Working Group Note architectural term. MartinFowler.com. https://martinfowler.com/ 30 April 2013. World Wide Web Consortium (W3C). https://www. articles/microservices.html#ProductsNotProjects w3.org/TR/prov-primer/ Malhotra, R., Seth, I., Lehnert, E., Zhao, J., Kaushik, G., Williams, E. H., Gottifredi, S., Cohen, A., García, A. J., & Simari, G. R. (2018). Sethi, A., & Davis-Dusenbery, B. N. (2017). Using the seven brid- Characterizing acceptability semantics of argumentation frame- ges Cancer genomics cloud to access and analyze petabytes of works with recursive attack and support relations. Artif Intell, 262, Cancer Data. Curr Protoc Bioinformatics, 60, 11.16.1–11.16.32. 336–368. https://doi.org/10.1016/j.artint.2018.06.008. https://doi.org/10.1002/cpbi.39. Greenberg, S. A. (2009). How citation distortions create unfounded au- Merkys, A., Mounet, N., Cepellotti, A., Marzari, N., Gražulis, S., & Pizzi, thority: Analysis of a citation network. Br Med J, 339, b2680. G. (2017). A posteriori metadata from automated provenance track- https://doi.org/10.1136/bmj.b2680. ing: Integration of AiiDA and TCOD. Journal of Cheminformatics, Greenberg, S. A. (2011). Understanding belief using citation networks. J 9(1), 56. https://doi.org/10.1186/s13321-017-0242-y. Eval Clin Pract, 17(2), 389–393. https://doi.org/10.1111/j.1365- Moreau, L., Missier, P., Belhajjame, K., B’Far, R., Cheney, J., Coppens, 2753.2011.01646.x. S., et al. (2013). PROV-DM: The PROV Data model: W3C recom- Grossman, R. L. (2019). Data Lakes, clouds, and Commons: A review of mendation 30 April 2013. World Wide Web Consortium. http:// platforms for analyzing and sharing genomic Data. Trends Genet, www.w3.org/TR/prov-dm/ 35(3), 223–234. https://doi.org/10.1016/j.tig.2018.12.006. NIH Data Commons Pilot: Object registration service (ORS). (2018). Groth, P., Cousijn, H., Clark, T., & Goble, C. (2020). FAIR Data reuse – https://github.com/mlev71/ors_wsgi The Path through Data citation. Data Intelligence, 2(1–2), 78–86. Notter, M. (2020). Nipype tutorial. Example 1: Preprocessing workflow. https://doi.org/10.1162/dint_a_00030. Github. https://miykael.github.io/nipype_tutorial/notebooks/ Ishii, S., Iwama, A., Koseki, H., Shinkai, Y., Taga, T., & Watanabe, J. example_preprocessing.html. Accessed 5 February 2021. (2014). Report on STAP Cell Research Paper Investigation (p. 11). Papadimitriou, G., Wang, C., Vahi, K., da Silva, R. F., Mandal, A., Liu, Saitama, JP: RIKEN. http://www3.riken.jp/stap/e/f1document1.pdf Z., Mayani, R., Rynge, M., Kiran, M., Lynch, V. E., Kettimuthu, R., Juty, N., Wimalaratne, S. M., Soiland-Reyes, S., Kunze, J., Goble, C. A., Deelman, E., Vetter, J. S., & Foster, I. (2021). End-to-end online & Clark, T. (2020). Unique, persistent, resolvable: Identifiers as the performance data capture and analysis for scientific workflows. foundation of FAIR. Data Intelligence, 2(1–2), 30–39. https://doi. Futur Gener Comput Syst, 117,387–400. https://doi.org/10.1016/j. org/10.5281/zenodo.3267434. future.2020.11.024. Katz, D., Chue Hong, N., Clark, T., Muench, A., Stall, S., Bouquin, D., Prager,E.M.,Chambers,K.E.,Plotkin,J.L.,McArthur,D. L., et al. (2021a). Recognizing the value of software: A software cita- Bandrowski, A. E., Bansal, N., Martone, M. E., Bergstrom, H. C., tion guide [version 2; peer review: 2 approved]. F1000Research, Bespalov, A., & Graf, C. (2018). Improving transparency and sci- 9(1257). https://doi.org/10.12688/f1000research.26932.2. entific rigor in academic publishing. Brain and Behavior, 9, e01141. Katz, D. S., Gruenpeter, M., Honeyman, T., Hwang, L., Sochat, V., Anzt, https://doi.org/10.1002/brb3.1141. H., & Goble, C. (2021b). A Fresh Look at FAIR for Research Rahwan, I.(Ed.).(2009). Argumentation in artificial intelligence. Software, 35. Springer. Khan, F. Z., Soiland-Reyes, S., Sinnott, R. O., Lonie, A., Goble, C., & RETRACTED ARTICLE: Stimulus-triggered fate conversion of somatic Crusoe, M. R. (2019). Sharing interoperable workflow provenance: cells into pluripotency. (2014). PubPeer: The Online Journal Club. A review of best practices and their practical application in ht tps: //pubpeer. com /public ati o ns/ CWLProv. GigaScience, 8(11). https://doi.org/10.1093/ B9BF2D3E83DF32CAEFFDAC159A2A94#14 gigascience/giz095. RIKEN. (2014). Interim report on the investigation of the Obokata et al. King, G. (2007). An introduction to the Dataverse network as an infra- articles. RIKEN. https://www.riken.jp/en/news_pubs/research_ structure for Data sharing. Sociol Methods Res, 36(2), 173–199. news/pr/2014/20140314_1/ https://doi.org/10.1177/0049124107306660. Shannon, P. (2003). Cytoscape: A software environment for integrated Kunze, J., & Rodgers, R. (2008). The ARK Identifier Scheme. University models of biomolecular interaction networks. Genome Res, 13(11), of California, Office of the President. https://escholarship.org/uc/ 2498–2504. https://doi.org/10.1101/gr.1239303. item/9p9863nc Shapin, S. (1984). Pump and circumstance: Robert Boyle’sliterary tech- Lamprecht, A.-L., Garcia, L., Kuzak, M., Martinez, C., Arcila, R., Martin nology. Soc Stud Sci, 14(4), 481–520 http://sss.sagepub.com/ Del Pico, E., et al. (2020). Towards FAIR principles for research content/14/4/481.abstractN2. software. Data Science, 3(1), 37–59. https://doi.org/10.3233/DS- Shiu, A. (2014). The STAP scandal: A post-pub review success story. 190026. Publons. https://publons.com/blog/the-stap-scandal-a-post-pub- review-success-story/ Larrucea, X., Santamaria, I., Colomo-Palacios, R., & Ebert, C. (2018). Smith, A. M., Katz, D. S., Niemeyer, K. E., & FORCE11 Software Microservices. IEEE Softw, 35(3), 96–100. https://doi.org/10.1109/ MS.2018.2141030. Citation Working Group. (2016). Software citation principles. 202 Neuroinform (2022) 20:187–202 PeerJ Computer Science, 2,e86. https://doi.org/10.7717/peerj-cs. Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Software Heritage Foundation. (2020, May 14). SoftWare Heritage per- Finkers,R.,Gonzalez-Beltran,A.,Gray,A. J.G.,Groth,P., sistent IDentifiers (SWHIDs), version 1.5. Software Heritage Goble, C., Grethe, J. S., Heringa, J., ’t Hoen, P. A. C., Hooft, R., Foundation. https://docs.softwareheritage.org/devel/swh-model/ Kuhn, T., Kok, R., Kok, J., Lusher, S. J., Martone, M. E., Mons, A., persistent-identifiers.html#overview. Accessed 5 February 2021. Packer, A. L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, Starr, J., Castro, E., Crosas, M., Dumontier, M., Downs, R. R., Duerr, R., R., Sansone, S. A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Haak, L. L., Haendel, M., Herman, I., Hodson, S., Hourclé, J., Kratz, Swertz, M. A., Thompson, M., van der Lei, J., van Mulligen, E., J. E., Lin, J., Nielsen, L. H., Nurnberger, A., Proell, S., Rauber, A., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Sacchi, S., Smith, A., Taylor, M., & Clark, T. (2015). Achieving Zhao, J., & Mons, B. (2016). The FAIR guiding principles for sci- human and machine accessibility of cited data in scholarly publica- entific data management and stewardship. Scientific Data, 3, tions. PeerJ Computer Science, 1,1. https://doi.org/10.7717/peerj- 160018. https://doi.org/10.1038/sdata.2016.18. cs.1. Wilson, S., Fitzsimons, M., Ferguson, M., Heath, A., Jensen, M., Miller, Tuecke, S., Ananthakrishnan, R., Chard, K., Lidman, M., McCollam, B., J., Murphy, M. W., Porter, J., Sahni, H., Staudt, L., Tang, Y., Wang, Rosen, S., & Foster, I. (2016). Globus auth: A research identity and Z., Yu, C., Zhang, J., Ferretti, V., Grossman, R. L., & GDC Project. access management platform. In 2016 IEEE 12th International (2017). Developing Cancer informatics applications and tools using Conference on e-Science (e-Science) (pp. 203–212). Presented at the NCI genomic Data Commons API. Cancer Res, 77(21), e15– the 2016 IEEE 12th international conference on e-science (e-sci- e18. https://doi.org/10.1158/0008-5472.CAN-17-0598. ence), Baltimore, MD, USA: IEEE https://doi.org/10.1109/ Yakutovich, A. V., Eimre, K., Schütt, O., Talirz, L., Adorf, C. S., eScience.2016.7870901. Andersen, C. W., Ditler, E., du, D., Passerone, D., Smit, B., Uhlir, P. (2012). For Attribution - Developing Data Attribution and Marzari, N., Pizzi, G., & Pignedoli, C. A. (2021). AiiDAlab – An Citation Practices and Standards: Summary of an International ecosystem for developing, executing, and sharing scientific Workshop (2012) (p. 220). The National Academies Press. http:// workflows. Comput Mater Sci, 188, 110165. https://doi.org/10. www.nap.edu/catalog.php?record_id=13564 1016/j.commatsci.2020.110165. Wan, X., Guan, X., Wang, T., Bai, G., & Choi, B.-Y. (2018). Application deployment using microservice and Docker containers: Framework Publisher’sNote Springer Nature remains neutral with regard to jurisdic- and optimization. J Netw Comput Appl, 119,97–109. https://doi. tional claims in published maps and institutional affiliations. org/10.1016/j.jnca.2018.07.003. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton,M.,Baak,A.,Blomberg,N.,Boiten,J.W.,daSilva

Journal

NeuroinformaticsSpringer Journals

Published: Jan 1, 2022

Keywords: FAIR data; FAIR software; Digital Commons; Evidence graph; Provenance; Reproducibility; Agumentation

References