Getting started with Phea

Background

Phea is a framework for computing electronic [patient] phenotypes:

using formulas written in SQL language,
and without using SQL joins.

The intended use case is where you have a database with patient records and want to compute a formula. However, in principle, Phea’s approach need not be restricted to this use case.

Phea passes formulas unmodified to the SQL query, so they need not be restricted to mathematical operators. The formulas can include any SQL expression that would be valid in a SELECT statement, such as CASE WHEN ... expressions and function calls.

To use Phea, you need a DBI connection to the SQL server, generated by functions such as DBI::dbConnect() or DatabaseConnector::connect().

Phea also includes a function to create an interactive plot of the phenotype results in one patient. Below is an example of the body mass index formula.

Intuition

Phea goes over patient records in chronological order. Say you want to compute a formula:

body_mass_index = body_weight / (body_height * body_height)

Phea will compute body_mass_index for each person using the most recently available data at each point in time.

In this example, the formula depends on body_weight and body_height. Therefore, their timestamps will be the timestamps for the resulting body_mass_index. This means that at every point in time where there is a body_weight or body_height a record for a given patient, Phea will try to calculate body_mass_index again.

Suppose a patient had a body_weight = 75 kg measurement on May 1st, a body_height = 1.8 meters on August 1st, then a new body_weight = 79 kg on October 1st. Phea would produce 3 result rows as follows:

May 1st:
body_weight = 75 kg
body_height = NULL
body_mass_index = 75 / (NULL * NULL) = NULL

August 1st:
body_weight = 75 kg
body_height = 1.8 meters
body_mass_index = 75 / (1.8 * 1.8) = 23.14815

October 1st:
body_weight = 79 kg
body_height = 1.8 meters
body_mass_index = 79 / (1.8 * 1.8) = 24.38272

The algorithm continues as long as new records are avaialable. For example, if on November 1st a new record body_height = 1.78 meters appears, the result would be:

November 1st:
body_weight = 79 kg
body_height = 1.78 meters
body_mass_index = 75 / (1.78 * 1.78) = 24.93372

At each point in time, the most recently available data is used, unless you specify otherwise when creating a component.

How to set up

First you need to connect to your relational server. While Phea is not restricted to PostgreSQL, so far it has only been fully tested on PostgreSQL version 15.

# library(phea)
devtools::load_all()
#> ℹ Loading phea

# Connect to SQL server
# (fabcred credentials are not declared in this file)
dbcon <- DBI::dbConnect(RPostgres::Postgres(),
  host = fabcred$pg$host, port = fabcred$pg$port, dbname = fabcred$pg$database,
  user = fabcred$pg$user, password = fabcred$pg$pass)

Calling setup_phea() is optional, but it enables the use of convenience functions sql0() and sqlt().

# Setup Phea and define cdm_new_york3 as the default schema.
setup_phea(
  connection = dbcon,
  schema = 'cdm_new_york3')

How to compute a formula

Each formula computation comes from combining three parts:

One or more record sources
One or more components
One or more formulas

Record sources

A record source is any SQL query that provides a table with the data you want to use. For example, select * from procedure_occurrence;. It can be arbitrarily complex, with any number of clauses, subqueries, JOIN, UNION, etc. The record source is assumed to be a long table, that is each row corresponds to one record, and among the columns there is one that contains a patient identifier (pid) and one that contains a timestamp (ts).

For example, suppose you have an OMOP Common Data Model schema called cdm_new_york3, and you would like to compute a formula involving hemoglobin A1c levels.

Such hemoglobin A1c records are in table MEASUREMENT inside schema cdm_new_york3.
They are identified by the OMOP CDM concept ID 3004410, corresponding to LOINC code 4548-4, Hemoglobin A1c/Hemoglobin.total in Blood.
The patient identifiers are in column person_id.
The timestamps are in column measurement_datetime.

That would constitute a record source. Let us create it in Phea:

hemoglobin_query <- sql0('
  select * from cdm_new_york3.measurement
  where measurement_concept_id = 3004410
')

hemoglobin_record_source <- make_record_source(
  records = hemoglobin_query,
  ts = 'measurement_datetime',
  pid = 'person_id')

Components

A component is a specification of what records to pick from record source. You provide a record source to function make_component(), and it returns a component. If the records you want to use in the formula are simply the most recently available at each point in time, just use the default values:

hemoglobin_component <- make_component(hemoglobin_record_source)

The hemoglobin_component can now be used to call calculate_formula().

The other records that can be picked by make_component() are:
- line = N (a.k.a. lag(N)): pick the (N+1)th most recent record. line = 1 gives the 2nd most recent record, line = 2 the third, etc. If you ask for the 2nd most recent (line = 1), but the patient only has 1 instance of that record, the value is NULL.
- delay = T: pick the most recent record that is at least time distance T older than the phenotype date. For example, if you put in a formula a diagnosis component with delay('60 days'), and a lab result component without delay, the result will include only the cases when the timestamp of the diagnosis is at least 60 days older than the timestamp of the lab result. If there is no such record, the value is NULL.

Calculate a formula

For the sake of the example, let us suppose the phenotype that you want to compute is merely the square root of the patient’s hemoglobin A1c level.

a1c_sqrt = sqrt(hemoglobin A1c level)

The numeric value of hemoglobin A1c is located in column value_as_number in the record source. The call to calculate_formula() is then:

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component),
  fml = list(
    a1c_sqrt = 'sqrt(hemo_a1c_value_as_number)'))

Phea automatically recognizes that hemo_a1c_value_as_number means column value_as_number in the SQL table associated with the component you named hemo_a1c in the list() when you called calculate_formula(). This is the syntax for accessing column values.

Object phen, returned by calculate_formula(), is a lazy table in the dplyr/dbplyr framework. It can be materialized into a tibble by calling dplyr::collect(), or manipulated further using Phea or dplyr/dbplyr.

dplyr::collect(head(phen)) |>
  knitr::kable()

row_id	pid	ts	window	hemo_a1c_value_as_number	a1c_sqrt
1	1	2011-05-20	00:00:00	6.9	2.626785
2	1	2014-03-07	00:00:00	6.9	2.626785
3	1	2016-03-11	00:00:00	6.9	2.626785
4	1	2018-03-16	00:00:00	6.9	2.626785
5	1	2020-03-20	00:00:00	6.9	2.626785
6	1	2022-02-11	00:00:00	6.9	2.626785

To obtain just the SQL code, use the convenience function code_shot(), or the option .clip_sql in calculate_formula(), or dbplyr::sql_render():

code_shot(phen)

SELECT
  "row_id",
  "pid",
  "ts",
  "window",
  "hemo_a1c_value_as_number",
  sqrt(hemo_a1c_value_as_number) AS "a1c_sqrt"
FROM (
  SELECT
    *,
    "ts" - "ts" AS "window",
    last_value(row_id) over (partition by pid, ts) AS "phea_ts_row"
  FROM (
    SELECT
      "row_id",
      "pid",
      "ts",
      MAX("hemo_a1c_value_as_number") OVER (PARTITION BY "pid", "..dbplyr_partion_1") AS "hemo_a1c_value_as_number",
      MAX("hemo_a1c_ts") OVER (PARTITION BY "pid", "..dbplyr_partion_2") AS "hemo_a1c_ts"
    FROM (
      SELECT
        *,
        SUM(CASE WHEN (("hemo_a1c_value_as_number" IS NULL)) THEN 0 ELSE 1 END) OVER (PARTITION BY "pid" ORDER BY "pid", "ts" ROWS UNBOUNDED PRECEDING) AS "..dbplyr_partion_1",
        SUM(CASE WHEN (("hemo_a1c_ts" IS NULL)) THEN 0 ELSE 1 END) OVER (PARTITION BY "pid" ORDER BY "pid", "ts" ROWS UNBOUNDED PRECEDING) AS "..dbplyr_partion_2"
      FROM (
        SELECT
          row_number() over (order by "pid", "ts") AS "row_id",
          "pid",
          "ts",
          last_value(case when "name" = 'r5njyfpm' then "value_as_number" else null end) OVER (PARTITION BY "pid", "name" ORDER BY "ts" ROWS UNBOUNDED PRECEDING) AS "hemo_a1c_value_as_number",
          last_value(case when "name" = 'r5njyfpm' then "ts" else null end) OVER (PARTITION BY "pid", "name" ORDER BY "ts" ROWS UNBOUNDED PRECEDING) AS "hemo_a1c_ts"
        FROM (
          SELECT
            'r5njyfpm' AS "name",
            "person_id" AS "pid",
            "measurement_datetime" AS "ts",
            "value_as_number"
          FROM (
select * from cdm_new_york3.measurement
  where measurement_concept_id = 3004410
          ) "q01"
        ) "q02"
      ) "q03"
    ) "q04"
  ) "q05"
) "q06"
WHERE ("row_id" = "phea_ts_row")

Multiple formulas can be computed at once:

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component),
  fml = list(
    a1c_sqrt = 'sqrt(hemo_a1c_value_as_number)',
    a1c_half_sqrt = 'a1c_sqrt / 2'))

dplyr::collect(head(phen)) |> knitr::kable()

row_id	pid	ts	window	hemo_a1c_value_as_number	a1c_sqrt	a1c_half_sqrt
1	1	2011-05-20	00:00:00	6.9	2.626785	1.313393
2	1	2014-03-07	00:00:00	6.9	2.626785	1.313393
3	1	2016-03-11	00:00:00	6.9	2.626785	1.313393
4	1	2018-03-16	00:00:00	6.9	2.626785	1.313393
5	1	2020-03-20	00:00:00	6.9	2.626785	1.313393
6	1	2022-02-11	00:00:00	6.9	2.626785	1.313393

Example with more components

So far the examples showed formulas with only one component. Suppose you would like to compute:
phen = sqrt([hemoglobin a1c]) * [patient's age at the time of measurement]

To obtain the patient’s age, we can collect the date of birth from table person, then create a component from it.

person_records <- make_record_source(
  records = sqlt(person),
  ts = 'birth_datetime',
  pid = 'person_id')

person_component <- make_component(person_records)

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component,
    person = person_component),
  fml = list(
    pat_age = 'age(hemo_a1c_measurement_datetime, person_birth_datetime)',
    a1c_times_age = 'hemo_a1c_value_as_number * extract(year from pat_age)'))

dplyr::collect(head(phen)) |> knitr::kable()

row_id	pid	ts	window	hemo_a1c_measurement_datetime	person_birth_datetime	hemo_a1c_value_as_number	pat_age	a1c_times_age
1	1	1974-03-01	00:00:00	NA	1974-03-01	NA	NA	NA
2	1	2011-05-20	13594 days	2011-05-20	1974-03-01	6.9	37 years 2 mons 19 days	255.3
3	1	2014-03-07	14616 days	2014-03-07	1974-03-01	6.9	40 years 6 days	276.0
4	1	2016-03-11	15351 days	2016-03-11	1974-03-01	6.9	42 years 10 days	289.8
5	1	2018-03-16	16086 days	2018-03-16	1974-03-01	6.9	44 years 15 days	303.6
6	1	2020-03-20	16821 days	2020-03-20	1974-03-01	6.9	46 years 19 days	317.4

Phea includes in the result all data points used in the formula computation. If other columns from the same rows are wanted, export = can be used:

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component,
    person = person_component),
  fml = list(
    pat_age = 'age(hemo_a1c_measurement_datetime, person_birth_datetime)',
    a1c_times_age = 'hemo_a1c_value_as_number * extract(year from pat_age)'),
  export = c(
    'hemo_a1c_provider_id',
    'hemo_a1c_visit_occurrence_id',
    'person_gender_concept_id'))

dplyr::collect(head(phen)) |> knitr::kable()

row_id	pid	ts	window	hemo_a1c_measurement_datetime	person_birth_datetime	hemo_a1c_value_as_number	hemo_a1c_provider_id	hemo_a1c_visit_occurrence_id	person_gender_concept_id	pat_age	a1c_times_age
1	1	1974-03-01	00:00:00	NA	1974-03-01	NA	NA	NA	8507	NA	NA
2	1	2011-05-20	13594 days	2011-05-20	1974-03-01	6.9	22084	21	8507	37 years 2 mons 19 days	255.3
3	1	2014-03-07	14616 days	2014-03-07	1974-03-01	6.9	27	24	8507	40 years 6 days	276.0
4	1	2016-03-11	15351 days	2016-03-11	1974-03-01	6.9	22084	19	8507	42 years 10 days	289.8
5	1	2018-03-16	16086 days	2018-03-16	1974-03-01	6.9	22084	18	8507	44 years 15 days	303.6
6	1	2020-03-20	16821 days	2020-03-20	1974-03-01	6.9	22084	17	8507	46 years 19 days	317.4

Using records other than the one most recently available

When creating a component, arguments line and delay can be used to pick records other than the one most recently available.

line causes records to be skipped in reverse chronological order. line = 0 (default) gives the most recent record available. line = 1 skips 1 record, that is, gives the second most recent. line = 2 skips 2 records, and so on.

hemoglobin_component <- make_component(
  input_source = hemoglobin_record_source)

previous_hemoglobin_component <- make_component(
  input_source = hemoglobin_record_source,
  line = 1)

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component,
    prev_hemo_a1c = previous_hemoglobin_component),
  fml = list(
    a1c_difference = 'hemo_a1c_value_as_number - prev_hemo_a1c_value_as_number'),
  export = c(
    'hemo_a1c_visit_occurrence_id',
    'prev_hemo_a1c_visit_occurrence_id'))

dplyr::collect(head(phen)) |> knitr::kable()

row_id	pid	ts	window	hemo_a1c_value_as_number	prev_hemo_a1c_value_as_number	hemo_a1c_visit_occurrence_id	prev_hemo_a1c_visit_occurrence_id	a1c_difference
1	1	2011-05-20	00:00:00	6.9	NA	21	NA	NA
2	1	2014-03-07	1022 days	6.9	6.9	24	21	0
3	1	2016-03-11	735 days	6.9	6.9	19	24	0
4	1	2018-03-16	735 days	6.9	6.9	18	19	0
5	1	2020-03-20	735 days	6.9	6.9	17	18	0
6	1	2022-02-11	693 days	6.9	6.9	29	17	0

delay imposes a minimum time difference between the timestamp of the component and the timestamp of the phenotype. It is defined in SQL language. For example, delay = "'1 day'::interval gives the most recently available record that is at least 1 day older than the phenotype (in PostgreSQL). If there are no such records, the value is empty.

hemoglobin_component <- make_component(
  input_source = hemoglobin_record_source)

old_hemoglobin_component <- make_component(
  input_source = hemoglobin_record_source,
  delay = "'6 years'::interval")

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component,
    old_hemo_a1c = old_hemoglobin_component),
  fml = list(
    six_year_diff = 'hemo_a1c_value_as_number - old_hemo_a1c_value_as_number'),
  export = c(
    'hemo_a1c_visit_occurrence_id',
    'old_hemo_a1c_visit_occurrence_id'))

dplyr::collect(head(phen)) |> knitr::kable()

row_id	pid	ts	window	hemo_a1c_value_as_number	old_hemo_a1c_value_as_number	hemo_a1c_visit_occurrence_id	old_hemo_a1c_visit_occurrence_id	six_year_diff
1	1	2011-05-20	00:00:00	6.9	NA	21	NA	NA
2	1	2014-03-07	00:00:00	6.9	NA	24	NA	NA
3	1	2016-03-11	00:00:00	6.9	NA	19	NA	NA
4	1	2018-03-16	2492 days	6.9	6.9	18	21	0
5	1	2020-03-20	2205 days	6.9	6.9	17	24	0
6	1	2022-02-11	2898 days	6.9	6.9	29	24	0

line and delay cannot be used simultaneously in the same component. If both are provided, line is considered and delay is ignored. delay is only considered when line = NA.