Data Importers

Big picture

Sparrow helps to make your lab data findable, accessible, interoperable, and reusable. To do this, your data needs to be in a form that can be passed to the Sparrow database. This means you will likely need to minimize variability in your files that humans may have added. Some useful information on the source of samples and papers may not be easily accessible in the archive you have, but Sparrow will help with that later.

You should plan on iterative development of an import pipeline, especially if you have a large dataset that may include variation from human input. For instance, if you have 3 common methods, start an import pipeline that extracts data for one of them and then build on it. It is difficult to build an automation that works for an entire archive the first time, but be persistent and work toward completeness later.

  1. Clean your analytical data files to make uniform columns and cells.
  2. Iterate to ingest as much of your archive as possible.
  3. Link metadata to analytical data in Sparrow.
  4. Share data to the outside world!

Data cleaning

Uniform column headers and data types in cells are two of the most important steps in organizing and cleaning data from Excel files. This may be as simple to complete as writing a single regular expression to find all columns with a similar form (see this example website). Simply, the minimum matching structured characters may be what is necessary to always find the needed columns. If you want more control, make your regular expression strings longer, or use a look-up dictionary with all possible columns listed (see WiscSIMS importer example.

For data in cells, it is important to ensure that there are no null values where not allowed in the Sparrow import pipeline. This can be handled in the steps to transform tables into JSON. See the schema for more information on what can not be null.

Iterative workflows

Start simple and build complexity is the fundamental rule for building importers. Unless your initial files are well structured and have minimal variability from user input, it is unreasonable to expect 100% success on the first attempt at importing data. Identify the minimal data necessary to populate your instance of Sparrow with findable, accessible, interoperable, and reusable data. For instance, if standards run with samples and the values of both are necessary to check for the accuracy and precision of data in your lab, make sure to import both.

Summary:

Writing data importers for Sparrow is an iterative and long yet rewarding process that makes your lab's data findable, accessible, and interoperable.

Once data is cleaned and structured through this iterative process, Sparrow provides an easy function for importing data, db.load_data. Examples, below, show this function in more detail.

Two examples for importers can be found:

Tips and tricks

Embedding a shell

When building an importer, you might want to debug your data import code as you. For this, you can embed an IPython shell in your application, which allows you to inspect the state of the application at a given point.

from pathlib import Path
from IPython import embed
from rich import print # Rich is a nice library for pretty-printing data
import pandas
import sparrow

@sparrow.task(name="import-earthchem")
def import_earthchem():
    for file in Path(sparrow.settings.DATA_DIR).glob("*.txt"):
        df = pandas.read_table(file, sep="\t")
        ## The line below embeds a shell prompt in the importer
        embed()
        ## Here we raise an error, so we don't get in an endless loop
        ## of shell embedding :)
        raise

Start simple and get more complex

First, read the data sheet, and inspect it to get a sense of its form.

Then, focus on getting a single row in the proper form to be imported. A second-phase importer for the one above might look like this:

@sparrow.task(name="import-earthchem")
def import_earthchem():
    for file in Path(sparrow.settings.DATA_DIR).glob("*.txt"):
        df = pandas.read_table(file, sep="\t")
        for ix, row in df.itertuples():
            print(f"Importing row {ix} of {df.shape[0]}")
            import_sample(row)

def import_sample(row):
    # Now we can inspect a single row!
    embed()
    raise

Build towards producing "clean" JSON

Now, the focus of your development can be the import_sample function. Your next target is to start producing "clean" JSON that approximates the form of the data in the Sparrow archive.

def import_sample(row):
    # Split year and author from references
    ref = row["REFERENCE"]
    year = re.search(r", (\d{4})$", ref)

    if year is None:
        raise Exception(f"Could not parse reference {ref}")
    authors = ref[: year.start()].strip()
    year = int(year.group(1))

    sample = {
        "name": data["SAMPLE ID"],
        "location": {
            "type": "Point",
            "coordinates": [
                float(data["LONGITUDE"]),
                float(data["LATITUDE"]),
            ],
        },
        "publication": {
            "author": authors,
            "year": year,
        },
    }
    # For now, just print the data and move on
    print(sample)

Regular expressions

Sometimes it's necessary to split strings using "regular expressions". Some helpers: regexr.com, regex101.com, and pythex.org