FairChoices Architecture

Table of Contents
  1. Summary
  2. General Guidelines and Rules of Thumb
  3. Data Cleaning and Pre-Processing
  4. File Structure
  5. Clean
  6. Scripts
  7. Output
  8. Impact Model Implementation (Pending)
  9. Demography Model Implementation (Pending)
  10. Cost Model Implementation (Pending)

Summary

FairChoices runs on four abstracted components:

  1. Data cleaning and pre-processing
  2. Impact model
  3. Demography model
  4. Cost model

See below for a detailed breakdown of how these components are implemented. For details on the specification of these components — that is, their analytical framework — check out our analytical overview.

We use two technologies and/or frameworks to power FairChoices:

  1. R programming language: For writing the component code that computes the end-to-end analytical tool.
  2. Dropbox file server: For storing the end-to-end data (input, mid-process state management, and output).

General Guidelines and Rules of Thumb

Data frame in, data frame out: Generally, any integration between components — for example, the loading of clean data into the impact model or the integration between the impact model and the demography model — should be formatted in a data frame. You can do whatever computation you want within a particular component — transform the data frame into a matrix, break out several individual vectors, whatever is optimal — but generally try to keep the integration between components in a data frame format. This helps with legibility, data storage, logging, and testing as each step along the way will be able to produce a human-readable, simplified output.

Data Cleaning and Pre-Processing

Fig. 1: Schematic of the FairChoices model

File Structure

Raw:

The data/raw folder contains raw data. This data can be from an online source (e.g., GBD, WPP) or created manually if the information it contains comes from the academic literature (e.g., intervention effect sizes). If the data comes from an online source , it should be saved into the data/raw folder exactly as it was downloaded (i.e., without any manual modifications in Excel). All data cleaning must be done in R through a data cleaning script, which must be saved in the scripts/clean folder.

Rules:

  • The beginning of every data/raw file name should be the date of download in YYYYMMDD format.
  • Except for the date of download prefix, the file name should be the default file name when downloaded from the internet.
  • If you put data into the data/raw folder, you are responsible for also cleaning the data in R.

Extracting Data from Global Burden of Disease

https://vizhub.healthdata.org/gbd-results

  • GBD Estimate: Cause of death or injury
    • Measures: Deaths, YLDs (Years of Lived with Disability), Prevalence, Incidence
    • Metric: Number, Rate
    • Cause: Select all causes
    • Location: Select all countries and territories (filter in cleaning)
    • Age: Select all (filter in cleaning)
    • Sex: Male, Female, Both
    • Year: 2019
  • GBD Estimate: Impairment
    • Measures: YLDs (Years of Lived with Disability), Prevalence
    • Metric: Rate
    • Impairment: Select all impairments (filter in cleaning*)
    • Cause: Select all causes
    • Location: Select all countries and territories (filter in cleaning)
    • Age: Select all (filter in cleaning)
    • Sex: Male, Female, Both
    • Year: 2019
  • GBD Estimate: Etiology
    • Measures: Deaths, YLDs (Years of Lived with Disability)
    • Metric: Rate
    • Etiology: Select all etiologiess (filter in cleaning*)
    • Cause: Select all causes
    • Location: Select all countries and territories (filter in cleaning)
    • Age: Select all (filter in cleaning)
    • Sex: Male, Female, Both
    • Year: 2019
  • GBD Estimate: Injuries by nature
    • Measures: YLDs (Years of Lived with Disability), Prevalence, Incidence
    • Metric: Rate
    • Injury: Select all injuries (filter in cleaning*)
    • Cause: Jan-Magnus to figure out
    • Location: Select all countries and territories (filter in cleaning)
    • Age: Select all (filter in cleaning)
    • Sex: Male, Female, Both
    • Year: 2019

Clean

The data/clean folder contains cleaned data that originates from the data/raw folder. The only way that data can be added to the data/clean folder is through a data cleaning script (saved in the scripts/clean folder).

Rules:

  • All data/clean data should be reflected in the PostgreSQL schema.
  • Clean data tables should be stored in long format.
  • Clean data tables should have as little overlapping information (i.e., columns) with other clean data tables.

Scripts

The scripts/clean folder contains scripts that take data from data/raw, clean them, and save them to data/clean. These scripts take data from the data/raw folder, clean it to a basic level, and save it to the data/clean folder. This data will be used by a wide variety of FairChoices users and will be the input data for all subsequent analytic (e.g., demography, epidemiology, etc.) scripts.

Rules:

The beginning of each script must contain basic metadata on where the raw data came from. Include information on the data source, including the URL if relevant, as well as who was responsible for processing the raw data and the date it was last processed (see example below). The scripts/clean folder contains a template (scripts/clean/template.R) to help users getting started.

    # Source:
# - World Population Prospects 2022
# - https://population.un.org/wpp/Download/Standard/MostUsed/
# Proccessor:
# - Sarah Bolongaita
# - 2023-05-12

Output

Tables:

  • Ordinally sorted, then alphabetically sorted
  • 2 significant digits for numbers
  • Comma for thousands separator