Motivation and design: public interface for lab data
Sparrow is software for managing the geochemica data created by an individual geochronology laboratory. This software has the goal of managing analytical data for indexing and public access. It is designed for flexibility and extensibility, so that it can be tailored to the needs of individual analytical labs that manage a wide variety of data.
- Lab-level data store
- Standardized basic schema
- Standardized web-facing API
- Flexible and extensible
Modes of access
When data leaves an analytical lab, it is integrated into publications and archived by authors. It is also archived internally by the lab itself. We intend to provide several modes of data access to ease parts of this process.
A project-centric web user interface, managed by the lab and possibly also the researcher. We hope to eventually support several interactions for managing the lifecycle of analytical data:
- Link literature references to laboratory archival data
- Manage sample metadata (locations, sample names, etc.)
- Manage data embargos and public access
- Visualize data (e.g. step-heating plots, age spectra)
- Track measurement versions (e.g. new corrections)
- Download data (for authors' own analysis and archival purposes)
On the server, direct database access and a command line interface will allow the lab to:
- Upload new and legacy data using customized scripts
- Apply new corrections without breaking links to published versions or raw data
- Run global checks for data integrity
- Back up the database
A web frontend will allow users outside the lab to
- Access data directly from the lab through an API for meta-analysis
- Browse a snapshot of the lab's publicly available data, possibly with data visualizations.
- Pull the lab's data into other endpoints, such as the Geochron and Macrostrat databases.
Place within the lab
This software is designed to run on a standard virtualized
UNIX server with a minimum of setup and intervention, and outside
of the data analysis pipeline.
It will be able to accept data from a variety of data
management pipelines through simple import scripts. Generally,
these import scripts will be run on an in-lab machine with access
to the server. Data collection, storage, and analysis tools
sit immediately prior to this system in a typical lab's data production pipeline.
We want this software to be useful to many labs, so a strong and flexible design is crucial. Sparrow will have an extensible core with well-documented interfaces for pluggable components. Key goals from a development perspective will be a clear, concise, well-documented and extensible schema, and a reasonably small and stable code footprint for the core functionality, with clear "hooks" for lab specific functionality.
Sparrow's technology stack consists of several parts:
a Python-based API server
sqlalchemyfor database access
PostgreSQL database backend
- configurable and extensible schema
- stateless schema migrations with
React-based administration interface
- Managed with
gitwith separate branches for analytical types and individual labs.
- Software packaged primarily fro lightweight, containerized (e.g. Docker) instances.
Code and issues for this project are tracked on Github.
Hierarchical levels of analytical data
datum: an individual data point (any numerical parameter and its error)
analysis: an collection of data points measured at the same time (roughly synonymous with aliquot)
session: a set of measurements conducted on the same sample at the same time
sample: A geological sample
- A data-storage schema to store heterogeneous geochronology data
- Flexible to store lab-specific data shapes
- A common core of standardized tables
- Standard vocabularies to manage meaning
Data must be loaded into this standardized core in order to be exposed to the outside world.
Schema → API
The Sparrow API will map lab-specific vocabulary to community standards.