Stress-testing Phea, test A

This vignette assumes a SQL server at localhost (we use PostgreSQL), with data in OMOP Common Data Model v5.4 format in schema cdm_new_york3. The patient records shown in this example are synthetic data from Synthea(TM) Patient Generator.

How complex can a phenotype be in Phea? And how does performance change as you increase the complexity? Do the queries ever become too complex?

In this vignette we will stress-test Phea in dimension A. We are calling this stress test A:

TL;DR

These are this vignette’s takeaways:

The time Phea needs for assembling the query is negligible in practice. What matters is query execution time.
The number of input rows processed per second (of query time) decreases as the number of components increase.
Query time increases at sub-exponential rate with the number of components and of input rows.

The dataset

We will be using table MEASUREMENT from cdm_new_york3. For context, let us measure the size of that table.

size_stats <- sqlt(measurement) |>
  group_by() |>
  summarise(
    rows = n(),
    patients = n_distinct(person_id)) |>
  collect()

measurement_summary <- sql0('
    select measurement_concept_id as concept_id, concept_code, concept_name, vocabulary_id,
      count(*) as records, count(distinct person_id) as patients
    from cdm_new_york3.measurement a
    inner join cdm_new_york3.concept b
    on a.measurement_concept_id = b.concept_id
    group by measurement_concept_id, concept_code, concept_name, vocabulary_id
    order by patients desc, records desc
  ') |>
  collect()

MEASUREMENT has 9,668,839 rows and 23,065 patients. Let’s see the most popular concepts in it.

head(measurement_summary, 10) |>
  kable()

concept_id	concept_code	concept_name	vocabulary_id	records	patients
43055141	72514-3	Pain severity - 0-10 verbal numeric rating [Score] - Reported	LOINC	548651	23046
3000963	718-7	Hemoglobin [Mass/volume] in Blood	LOINC	98650	23044
3009744	786-4	MCHC [Mass/volume] by Automated count	LOINC	98245	23044
3020416	789-8	Erythrocytes [#/volume] in Blood by Automated count	LOINC	98245	23044
3023599	787-2	MCV [Entitic volume] by Automated count	LOINC	98245	23044
3024929	777-3	Platelets [#/volume] in Blood by Automated count	LOINC	98245	23044
3000905	6690-2	Leukocytes [#/volume] in Blood by Automated count	LOINC	98245	23044
3012030	785-6	MCH [Entitic mass] by Automated count	LOINC	98245	23044
3025315	29463-7	Body weight	LOINC	470537	23042
3023314	4544-3	Hematocrit [Volume Fraction] of Blood by Automated count	LOINC	94026	23042

That is the data we will use to benchmark Phea. Each component will be exactly like this:

SELECT *
FROM cdm_new_york3.measurement
WHERE measurement_concept_id = [chosen concept]

In other words, each component uses a different measurement_concept_id, from the summary table, to SELECT from MEASUREMENT.

We will be picking rows (concepts) from the summary table starting from the first, so, to reduce bias, let us shuffle them.

# Shuffle the rows
set.seed(42)
measurement_summary <- measurement_summary[sample(1:nrow(measurement_summary)),]

Computer used to run this test

This was the machine used to execute this test:

Processor Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz 2.59 GHz
Installed RAM 64.0 GB (63.8 GB usable)
System type 64-bit operating system, x64-based processor

Formula used for testing

For each run of test A, we will pick N concepts, build a component for each (C₁ to C_n), and compute the following formula:

test_a = C₁^{value_as_number} + C₂^{value_as_number} + C₂^{value_as_number} + … + C_n^{value_as_number}

In other words, we will just add up the value_as_number of each component.

We do not want to put a NULL into that sum, because it would cause the entire result to permanently collapse to NULL. This is because any number plus NULL equals to NULL. To circumvent that, we will use function coalesce(). If a component C_x is NULL, the number 0 will be used instead. So the actual formula we’ll use is:

test_a = coalesce(C₁^{value_as_number}, 0) + coalesce(C₂^{value_as_number}, 0) + coalesce(C₂^{value_as_number}, 0) + … + coalesce(C_n^{value_as_number}, 0)

From the finalized phenotype, we will just count the rows, unique patients, and sum the result across patients. This is just to force the SQL server to compute the phenotype entirely, for all patients, so we can measure how long it takes.

In each run, we use library tictoc to measure the time spent.

library(tictoc)

Functions used for the test

Function produce_N_components(N) produces N components using the first N rows of measurement_summary.

produce_N_components <- function(N) {
  make_test_component <- function(i) {
    concept_id <- measurement_summary$concept_id[i]
    records <- sqlt(measurement) |>
      filter(measurement_concept_id == local(concept_id))
    component <- make_component(records,
      .ts = measurement_datetime,
      .pid = person_id)
    return(component)
  }
  # Pick the first N lab tests
  components <- lapply(1:N, make_test_component)
  
  # Their names will be simply "C<sub>n</sub>"
  names(components) <- paste0('c', 1:length(components))
  
  return(components)
}

Function run_test_a(N) runs Test A using N components.

run_test_a <- function(N) {
  wrap(paste0('test_a_N', N), by_name = TRUE, pass_val = TRUE, {
    components <- produce_N_components(N)
    
    formula_string <- paste0('coalesce(', names(components), '_value_as_number, 0)', collapse = ' + ')
  
    tic()
    phen <- calculate_formula(
      components = components,
      fml = list(test_a = formula_string))
    phea_time <- toc(quiet = TRUE)
    phea_time <- phea_time$toc - phea_time$tic
    
    tic()
    result <- phen |>
      group_by() |>
      summarise(
        rows = n(),
        patients = n_distinct(pid),
        test_a_sum = sum(test_a/100000, na.rm = TRUE)) |>
      collect()
    query_time <- toc(quiet = TRUE)
    query_time <- query_time$toc - query_time$tic
    
    result_rows <- result$rows
    result_patients <- result$patients
    test_a_sum <- result$test_a_sum
    
    # Compute component stats
    input_rows <- lapply(components, \(x) summarise(x$rec_source$records, rows = n()) |> pull('rows')) |>
      purrr::reduce(`+`)
    
    input_patients <- lapply(components, \(x) select(x$rec_source$records, person_id)) |>
      purrr::reduce(union_all) |>
      summarise(patients = n_distinct(person_id)) |>
      pull('patients')
    
    phen_head <- head(phen, 5) |> collect()
    
    res <- tibble(
      N = N,
      
      input_rows = input_rows,
      result_rows = result_rows,
      test_a_sum = test_a_sum,
      
      input_patients = input_patients,
      result_patients = result_patients,
      
      phea_time = phea_time,
      query_time = query_time,
      
      pps_query = input_patients/query_time,
      rps_query = input_rows/query_time,
      
      phen_head = list(phen_head)
    )
    
    return(res)
  })
}

Function print_test_a_results() will also be used. For brevity it is not printed here. You can see it by downloading the markdown file stress_test_a.Rmd.

Run N = 1

test_a_runs = list()

Let’s first run Test A with N = 1, that is, a single component.

test_a_runs[[length(test_a_runs)+1]] <- run_test_a(1)

Done! Now let us briefly peek at the results of the phenotype itself.

test_a_runs[[length(test_a_runs)]]$phen_head[[1]] |>
  kable()

row_id	pid	ts	window	c1_value_as_number
1	6	2011-07-13	00:00:00	NA
2	6	2014-07-30	00:00:00	NA
3	17	2021-12-26	00:00:00	NA
4	26	2017-11-05	00:00:00	NA
5	28	2017-04-28	00:00:00	NA

Column test_a is the result of the formula, i.e. the sum of the components. As it appears, there are many NAs in value_as_number in our table MEASUREMENT, but that is not an issue.

How was the performance? Let’s see.

print_test_a_results()

N	in. rows	res. rows	test_a_sum	in. patients	res. patients	Phea time (s)	query time (s)	query pps	query rps
1	12.9k	12.9k	0	5136	5136	0.5	0.8	6420	16.1k

Here are the meanings of the columns:

N: Number of components included in the run.
in. rows: Total number of rows in the record sources provided to calculate_formula().
res. rows: Total number of rows in the phenotype returned by calculate_formula().
test_a_sum: Sum of the result of the formula across all patients.
in. patients: Number of patients in the record sources.
res. patients: Number of patients in the resulting phenotype.
Phea time: Time spent by calculate_formula(), in seconds.
query time: Time spent to run (collect()) the phenotype (aggregated), in seconds.
query pps: Input patients processed per second of query time.
query rps: Input rows processed per second of query time.

With N = 1 (a single component), Phea took 0.5 seconds to assemble the phenotype, and the SQL server took 0.8 seconds to run the query. That is 16,086 input rows per second of query time.

Run N = 5

test_a_runs[[length(test_a_runs)+1]] <- run_test_a(5)

print_test_a_results()

N	in. rows	res. rows	test_a_sum	in. patients	res. patients	Phea time (s)	query time (s)	query pps	query rps
1	12.9k	12.9k	0.0	5136	5136	0.5	0.8	6420.0	16.1k
5	70.7k	65.5k	3.5	9548	9548	1.2	4.4	2150.5	15.9k

With N = 5 (five components), Phea took 1.17 sec to assemble the phenotype, and the SQL server took 4.44 sec to run the query. That is 15,916 input rows per second of query time.

Runs N = 10 to N = 150

Now that we have a grip on how Test A works, let’s run it for N = 10 to 150, increasing 10 at a time.

N <- 10
N_max <- 150
while(N <= N_max) {
  test_a_runs[[length(test_a_runs)+1]] <- run_test_a(N)
  N <- N + 10
}

Test A results

print_test_a_results()

N	in. rows	res. rows	test_a_sum	in. patients	res. patients	Phea time (s)	query time (s)	query pps	query rps
1	12.9k	12.9k	0.0	5136	5136	0.5	0.8	6420.0	16.1k
5	70.7k	65.5k	3.5	9548	9548	1.2	4.4	2150.5	15.9k
10	205.9k	161.8k	4.4	17673	17673	1.9	10.1	1755.0	20.4k
20	576.1k	377.7k	11.2	20867	20867	3.4	33.4	624.6	17.2k
30	1118.5k	439.9k	975.0	23052	23052	5.3	121.9	189.1	9.2k
40	1536.7k	574.2k	1743.8	23053	23053	7.4	234.7	98.2	6.5k
50	1797.3k	575k	3383.9	23053	23053	8.2	348.4	66.2	5.2k
60	2153.1k	575.3k	3766.4	23053	23053	9.9	612.9	37.6	3.5k
70	3539.8k	637.1k	5422.1	23053	23053	11.5	1275.4	18.1	2.8k
80	3568.5k	637.4k	5485.4	23053	23053	13.4	1570.2	14.7	2.3k
90	3752.7k	662.5k	7690.8	23056	23056	17.6	2017.3	11.4	1.9k
100	3809.5k	662.5k	7708.4	23056	23056	18.2	2383.0	9.7	1.6k
110	4590.1k	682.7k	8057.5	23059	23059	20.7	3477.5	6.6	1.3k
120	5279.3k	709.7k	10484.0	23059	23059	21.9	5483.3	4.2	1k
130	5703.5k	722.8k	10721.4	23064	23064	30.5	7381.4	3.1	0.8k
140	6536.1k	722.9k	11997.7	23064	23064	26.7	9472.8	2.4	0.7k
150	6843.3k	722.9k	12752.9	23064	23064	29.6	11618.8	2.0	0.6k

That is a lot of numbers. Let us do some plots. For brevity, the code that makes the plots is hidden. Please download stress_test_a.Rmd from GitHub if you’d like to see it.

library(plotly)

As we increase the number of components, the query time increases at sub-exponential rate (exponential growth would be a straight line).

Same as before: as we increase the number of input rows (and number of components), the query time increases at sub-exponential rate.

As we increase the number of components, less rows per second are computed, because each row takes longer. This is on top of the fact that, the more components you add, the more input rows, because each component is coming from a different SQL query (in Phea’s parlance, a different record source). Nevertheless, these two factors are not enough to cause exponential growth of query time. The rate of growth in query time, with respect to the number of rows or components, tapers off as N increases.

Takeaways from the results:

The time Phea needs for assembling the query is negligible in practice.
The number of input rows processed per second decreases as the number of components increase.
Query time increases at sub-exponential rate with number of components and number of input rows.

Author contact

Fabrício Kury
2022/Dec/16
Be always welcome to reach me at fab@kury.dev.