Phea is a framework for computing electronic [patient] phenotypes:
The intended use case is where you have a database with patient records and want to compute a formula. However, in principle, Phea’s approach need not be restricted to this use case.
Phea passes formulas unmodified to the SQL query, so they need not be
restricted to mathematical operators. The formulas can include any SQL
expression that would be valid in a SELECT
statement, such
as CASE WHEN ...
expressions and function calls.
To use Phea, you need a DBI connection to the SQL server, generated
by functions such as DBI::dbConnect()
or
DatabaseConnector::connect()
.
Phea also includes a function to create an interactive plot of the phenotype results in one patient. Below is an example of the body mass index formula.
Phea goes over patient records in chronological order. Say you want to compute a formula:
body_mass_index = body_weight / (body_height * body_height)
Phea will compute body_mass_index
for each person using
the most recently available data at each point in time.
In this example, the formula depends on body_weight
and
body_height
. Therefore, their timestamps will be the
timestamps for the resulting body_mass_index
. This means
that at every point in time where there is a body_weight
or body_height
a record for a given patient, Phea
will try to calculate body_mass_index
again.
Suppose a patient had a body_weight = 75 kg
measurement
on May 1st, a body_height = 1.8 meters
on August 1st, then
a new body_weight = 79 kg
on October 1st. Phea would
produce 3 result rows as follows:
May 1st:
body_weight = 75 kg
body_height = NULL
body_mass_index = 75 / (NULL * NULL) = NULL
August 1st:
body_weight = 75 kg
body_height = 1.8 meters
body_mass_index = 75 / (1.8 * 1.8) = 23.14815
October 1st:
body_weight = 79 kg
body_height = 1.8 meters
body_mass_index = 79 / (1.8 * 1.8) = 24.38272
The algorithm continues as long as new records are avaialable. For
example, if on November 1st a new record
body_height = 1.78 meters
appears, the result would be:
November 1st:
body_weight = 79 kg
body_height = 1.78 meters
body_mass_index = 75 / (1.78 * 1.78) = 24.93372
At each point in time, the most recently available data is used, unless you specify otherwise when creating a component.
First you need to connect to your relational server. While Phea is not restricted to PostgreSQL, so far it has only been fully tested on PostgreSQL version 15.
# library(phea)
::load_all()
devtools#> ℹ Loading phea
# Connect to SQL server
# (fabcred credentials are not declared in this file)
<- DBI::dbConnect(RPostgres::Postgres(),
dbcon host = fabcred$pg$host, port = fabcred$pg$port, dbname = fabcred$pg$database,
user = fabcred$pg$user, password = fabcred$pg$pass)
Calling setup_phea()
is optional, but it enables the use
of convenience functions sql0()
and
sqlt()
.
# Setup Phea and define cdm_new_york3 as the default schema.
setup_phea(
connection = dbcon,
schema = 'cdm_new_york3')
Each formula computation comes from combining three parts:
record sources
components
formulas
A record source
is any SQL query that provides a table
with the data you want to use. For example,
select * from procedure_occurrence;
. It can be arbitrarily
complex, with any number of clauses, subqueries, JOIN
,
UNION
, etc. The record source
is assumed to be
a long table, that is each row corresponds to one record, and
among the columns there is one that contains a patient identifier
(pid
) and one that contains a timestamp
(ts
).
For example, suppose you have an OMOP Common Data Model schema called
cdm_new_york3
, and you would like to compute a formula
involving hemoglobin A1c levels.
MEASUREMENT
inside schema cdm_new_york3
.3004410
,
corresponding to LOINC code 4548-4
,
Hemoglobin A1c/Hemoglobin.total in Blood
.person_id
.measurement_datetime
.That would constitute a record source. Let us create it in Phea:
<- sql0('
hemoglobin_query select * from cdm_new_york3.measurement
where measurement_concept_id = 3004410
')
<- make_record_source(
hemoglobin_record_source records = hemoglobin_query,
ts = 'measurement_datetime',
pid = 'person_id')
A component
is a specification of what records
to pick from record source
. You provide a
record source
to function make_component()
,
and it returns a component
. If the records you want to use
in the formula are simply the most recently available at each point in
time, just use the default values:
<- make_component(hemoglobin_record_source) hemoglobin_component
The hemoglobin_component
can now be used to call
calculate_formula()
.
The other records that can be picked by make_component()
are:
- line = N
(a.k.a. lag(N)
): pick the
(N+1)th most recent record. line = 1
gives the 2nd
most recent record, line = 2
the third, etc. If you ask for
the 2nd most recent (line = 1
), but the patient only has 1
instance of that record, the value is NULL.
- delay = T
: pick the most recent record that is at
least time distance T
older than the phenotype
date. For example, if you put in a formula a diagnosis
component with delay('60 days')
, and a lab result component
without delay
, the result will include only the cases when
the timestamp of the diagnosis is at least 60 days older than the
timestamp of the lab result. If there is no such record, the value is
NULL.
For the sake of the example, let us suppose the phenotype that you want to compute is merely the square root of the patient’s hemoglobin A1c level.
a1c_sqrt = sqrt(hemoglobin A1c level)
The numeric value of hemoglobin A1c is located in column
value_as_number
in the record source. The call to
calculate_formula()
is then:
<- calculate_formula(
phen components = list(
hemo_a1c = hemoglobin_component),
fml = list(
a1c_sqrt = 'sqrt(hemo_a1c_value_as_number)'))
Phea automatically recognizes that
hemo_a1c_value_as_number
means column
value_as_number
in the SQL table associated with the
component you named hemo_a1c
in the list()
when you called calculate_formula()
. This is the syntax for
accessing column values.
Object phen
, returned by
calculate_formula()
, is a lazy table in the
dplyr
/dbplyr
framework. It can be materialized
into a tibble by calling dplyr::collect()
, or manipulated
further using Phea or dplyr
/dbplyr
.
::collect(head(phen)) |>
dplyr::kable() knitr
row_id | pid | ts | window | hemo_a1c_value_as_number | a1c_sqrt |
---|---|---|---|---|---|
1 | 1 | 2011-05-20 | 00:00:00 | 6.9 | 2.626785 |
2 | 1 | 2014-03-07 | 00:00:00 | 6.9 | 2.626785 |
3 | 1 | 2016-03-11 | 00:00:00 | 6.9 | 2.626785 |
4 | 1 | 2018-03-16 | 00:00:00 | 6.9 | 2.626785 |
5 | 1 | 2020-03-20 | 00:00:00 | 6.9 | 2.626785 |
6 | 1 | 2022-02-11 | 00:00:00 | 6.9 | 2.626785 |
To obtain just the SQL code, use the convenience function
code_shot()
, or the option .clip_sql
in
calculate_formula()
, or
dbplyr::sql_render()
:
code_shot(phen)
SELECT
"row_id",
"pid",
"ts",
"window",
"hemo_a1c_value_as_number",
sqrt(hemo_a1c_value_as_number) AS "a1c_sqrt"
FROM (
SELECT
*,
"ts" - "ts" AS "window",
last_value(row_id) over (partition by pid, ts) AS "phea_ts_row"
FROM (
SELECT
"row_id",
"pid",
"ts",
MAX("hemo_a1c_value_as_number") OVER (PARTITION BY "pid", "..dbplyr_partion_1") AS "hemo_a1c_value_as_number",
MAX("hemo_a1c_ts") OVER (PARTITION BY "pid", "..dbplyr_partion_2") AS "hemo_a1c_ts"
FROM (
SELECT
*,
SUM(CASE WHEN (("hemo_a1c_value_as_number" IS NULL)) THEN 0 ELSE 1 END) OVER (PARTITION BY "pid" ORDER BY "pid", "ts" ROWS UNBOUNDED PRECEDING) AS "..dbplyr_partion_1",
SUM(CASE WHEN (("hemo_a1c_ts" IS NULL)) THEN 0 ELSE 1 END) OVER (PARTITION BY "pid" ORDER BY "pid", "ts" ROWS UNBOUNDED PRECEDING) AS "..dbplyr_partion_2"
FROM (
SELECT
row_number() over (order by "pid", "ts") AS "row_id",
"pid",
"ts",
last_value(case when "name" = 'r5njyfpm' then "value_as_number" else null end) OVER (PARTITION BY "pid", "name" ORDER BY "ts" ROWS UNBOUNDED PRECEDING) AS "hemo_a1c_value_as_number",
last_value(case when "name" = 'r5njyfpm' then "ts" else null end) OVER (PARTITION BY "pid", "name" ORDER BY "ts" ROWS UNBOUNDED PRECEDING) AS "hemo_a1c_ts"
FROM (
SELECT
'r5njyfpm' AS "name",
"person_id" AS "pid",
"measurement_datetime" AS "ts",
"value_as_number"
FROM (
select * from cdm_new_york3.measurement
where measurement_concept_id = 3004410
"q01"
) "q02"
) "q03"
) "q04"
) "q05"
) "q06"
) WHERE ("row_id" = "phea_ts_row")
Multiple formulas can be computed at once:
<- calculate_formula(
phen components = list(
hemo_a1c = hemoglobin_component),
fml = list(
a1c_sqrt = 'sqrt(hemo_a1c_value_as_number)',
a1c_half_sqrt = 'a1c_sqrt / 2'))
::collect(head(phen)) |> knitr::kable() dplyr
row_id | pid | ts | window | hemo_a1c_value_as_number | a1c_sqrt | a1c_half_sqrt |
---|---|---|---|---|---|---|
1 | 1 | 2011-05-20 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
2 | 1 | 2014-03-07 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
3 | 1 | 2016-03-11 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
4 | 1 | 2018-03-16 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
5 | 1 | 2020-03-20 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
6 | 1 | 2022-02-11 | 00:00:00 | 6.9 | 2.626785 | 1.313393 |
So far the examples showed formulas with only one component. Suppose
you would like to compute:
phen = sqrt([hemoglobin a1c]) * [patient's age at the time of measurement]
To obtain the patient’s age, we can collect the date of birth from
table person
, then create a component from it.
<- make_record_source(
person_records records = sqlt(person),
ts = 'birth_datetime',
pid = 'person_id')
<- make_component(person_records)
person_component
<- calculate_formula(
phen components = list(
hemo_a1c = hemoglobin_component,
person = person_component),
fml = list(
pat_age = 'age(hemo_a1c_measurement_datetime, person_birth_datetime)',
a1c_times_age = 'hemo_a1c_value_as_number * extract(year from pat_age)'))
::collect(head(phen)) |> knitr::kable() dplyr
row_id | pid | ts | window | hemo_a1c_measurement_datetime | person_birth_datetime | hemo_a1c_value_as_number | pat_age | a1c_times_age |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1974-03-01 | 00:00:00 | NA | 1974-03-01 | NA | NA | NA |
2 | 1 | 2011-05-20 | 13594 days | 2011-05-20 | 1974-03-01 | 6.9 | 37 years 2 mons 19 days | 255.3 |
3 | 1 | 2014-03-07 | 14616 days | 2014-03-07 | 1974-03-01 | 6.9 | 40 years 6 days | 276.0 |
4 | 1 | 2016-03-11 | 15351 days | 2016-03-11 | 1974-03-01 | 6.9 | 42 years 10 days | 289.8 |
5 | 1 | 2018-03-16 | 16086 days | 2018-03-16 | 1974-03-01 | 6.9 | 44 years 15 days | 303.6 |
6 | 1 | 2020-03-20 | 16821 days | 2020-03-20 | 1974-03-01 | 6.9 | 46 years 19 days | 317.4 |
Phea includes in the result all data points used in the formula
computation. If other columns from the same rows are wanted,
export =
can be used:
<- calculate_formula(
phen components = list(
hemo_a1c = hemoglobin_component,
person = person_component),
fml = list(
pat_age = 'age(hemo_a1c_measurement_datetime, person_birth_datetime)',
a1c_times_age = 'hemo_a1c_value_as_number * extract(year from pat_age)'),
export = c(
'hemo_a1c_provider_id',
'hemo_a1c_visit_occurrence_id',
'person_gender_concept_id'))
::collect(head(phen)) |> knitr::kable() dplyr
row_id | pid | ts | window | hemo_a1c_measurement_datetime | person_birth_datetime | hemo_a1c_value_as_number | hemo_a1c_provider_id | hemo_a1c_visit_occurrence_id | person_gender_concept_id | pat_age | a1c_times_age |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1974-03-01 | 00:00:00 | NA | 1974-03-01 | NA | NA | NA | 8507 | NA | NA |
2 | 1 | 2011-05-20 | 13594 days | 2011-05-20 | 1974-03-01 | 6.9 | 22084 | 21 | 8507 | 37 years 2 mons 19 days | 255.3 |
3 | 1 | 2014-03-07 | 14616 days | 2014-03-07 | 1974-03-01 | 6.9 | 27 | 24 | 8507 | 40 years 6 days | 276.0 |
4 | 1 | 2016-03-11 | 15351 days | 2016-03-11 | 1974-03-01 | 6.9 | 22084 | 19 | 8507 | 42 years 10 days | 289.8 |
5 | 1 | 2018-03-16 | 16086 days | 2018-03-16 | 1974-03-01 | 6.9 | 22084 | 18 | 8507 | 44 years 15 days | 303.6 |
6 | 1 | 2020-03-20 | 16821 days | 2020-03-20 | 1974-03-01 | 6.9 | 22084 | 17 | 8507 | 46 years 19 days | 317.4 |
When creating a component, arguments line
and
delay
can be used to pick records other than the one most
recently available.
line
causes records to be skipped in reverse
chronological order. line = 0
(default) gives the most
recent record available. line = 1
skips 1 record, that is,
gives the second most recent. line = 2
skips 2 records, and
so on.
<- make_component(
hemoglobin_component input_source = hemoglobin_record_source)
<- make_component(
previous_hemoglobin_component input_source = hemoglobin_record_source,
line = 1)
<- calculate_formula(
phen components = list(
hemo_a1c = hemoglobin_component,
prev_hemo_a1c = previous_hemoglobin_component),
fml = list(
a1c_difference = 'hemo_a1c_value_as_number - prev_hemo_a1c_value_as_number'),
export = c(
'hemo_a1c_visit_occurrence_id',
'prev_hemo_a1c_visit_occurrence_id'))
::collect(head(phen)) |> knitr::kable() dplyr
row_id | pid | ts | window | hemo_a1c_value_as_number | prev_hemo_a1c_value_as_number | hemo_a1c_visit_occurrence_id | prev_hemo_a1c_visit_occurrence_id | a1c_difference |
---|---|---|---|---|---|---|---|---|
1 | 1 | 2011-05-20 | 00:00:00 | 6.9 | NA | 21 | NA | NA |
2 | 1 | 2014-03-07 | 1022 days | 6.9 | 6.9 | 24 | 21 | 0 |
3 | 1 | 2016-03-11 | 735 days | 6.9 | 6.9 | 19 | 24 | 0 |
4 | 1 | 2018-03-16 | 735 days | 6.9 | 6.9 | 18 | 19 | 0 |
5 | 1 | 2020-03-20 | 735 days | 6.9 | 6.9 | 17 | 18 | 0 |
6 | 1 | 2022-02-11 | 693 days | 6.9 | 6.9 | 29 | 17 | 0 |
delay
imposes a minimum time difference between the
timestamp of the component and the timestamp of the phenotype. It is
defined in SQL language. For example,
delay = "'1 day'::interval
gives the most recently
available record that is at least 1 day older than the phenotype (in
PostgreSQL). If there are no such records, the value is empty.
<- make_component(
hemoglobin_component input_source = hemoglobin_record_source)
<- make_component(
old_hemoglobin_component input_source = hemoglobin_record_source,
delay = "'6 years'::interval")
<- calculate_formula(
phen components = list(
hemo_a1c = hemoglobin_component,
old_hemo_a1c = old_hemoglobin_component),
fml = list(
six_year_diff = 'hemo_a1c_value_as_number - old_hemo_a1c_value_as_number'),
export = c(
'hemo_a1c_visit_occurrence_id',
'old_hemo_a1c_visit_occurrence_id'))
::collect(head(phen)) |> knitr::kable() dplyr
row_id | pid | ts | window | hemo_a1c_value_as_number | old_hemo_a1c_value_as_number | hemo_a1c_visit_occurrence_id | old_hemo_a1c_visit_occurrence_id | six_year_diff |
---|---|---|---|---|---|---|---|---|
1 | 1 | 2011-05-20 | 00:00:00 | 6.9 | NA | 21 | NA | NA |
2 | 1 | 2014-03-07 | 00:00:00 | 6.9 | NA | 24 | NA | NA |
3 | 1 | 2016-03-11 | 00:00:00 | 6.9 | NA | 19 | NA | NA |
4 | 1 | 2018-03-16 | 2492 days | 6.9 | 6.9 | 18 | 21 | 0 |
5 | 1 | 2020-03-20 | 2205 days | 6.9 | 6.9 | 17 | 24 | 0 |
6 | 1 | 2022-02-11 | 2898 days | 6.9 | 6.9 | 29 | 24 | 0 |
line
and delay
cannot be
used simultaneously in the same component. If both are provided,
line
is considered and delay
is ignored.
delay
is only considered when line = NA
.