Detect event streaks based on specified thresholds

R
analysis
MHW
This vignette demonstrates how to use a heatwaveR function to the analysis of experimental data for finding the run lengths of events that meet certain criteria.
Published

November 22, 2023

Modified

November 23, 2023

library(heatwaveR)
library(plyr) # because I like plyr
library(dplyr)

Here is a question we received via heatwaveR’s GitHub page regarding using our package’s functions to detect streaks of outcomes based on a threshold in some lab experiment. The question is as follows:

“I was looking for a function that could help me with data wrangling and I believe the exceedance() function is exactly what I’m looking for. I have a dataframe with person IDs, date of treatment, day since treatment, and a lab value. Essentially, I am hoping to find the longest streak of consecutive days where the lab value is below a certain threshold, and extending the streak if the lab value does not rise above the threshold for >3 days.

I wrangled my dataframe into a dataframe with the IDs, date, and lab value as follows …”

Prepare a dataframe with the IDs, date, and lab values

num_subjects <- 5
num_days <- 20
set.seed(13)
subject_ids <- rep(1:num_subjects, each = num_days)
dates <- rep(seq(as.Date("2023-01-01"), by = "1 day", length.out = num_days),
             times = num_subjects)
lab_values <- runif(num_subjects * num_days, min = 0, max = 3000)
df <- data.frame(
  SubjectID = factor(subject_ids),
  Date = as.Date(dates),
  Lab = lab_values
)

Calculate streaks lengths based on linear (over time) thresholds

It turns out that one of the other built-in functions can exactly do what you need. We simply need to make some adjustments to the dataframe fed to the detect_event() function, which is the function that will count the streak lengths.

Calculate a mean value and a threshold; you can calculate an overall mean and threshold, or a mean and threshold for each group— this will depend on your experimental design and hypotheses.

I calculate a mean and threshold based on the pooled data (across A-C):

df2 <- df |> 
  mutate(seas = mean(Lab),
         thresh = 500) # your threshold value here

Even though we calculated the mean value, this is not used; only the threshold is used.

results <- plyr::dlply(.data = df2, .variables = "SubjectID", function(sub_df) {
  detect_event(sub_df,
               x = Date,
               y = Lab,
               seasClim = seas,
               threshClim = thresh,
               minDuration = 1,
               maxGap = 3, 
               coldSpell = TRUE,
               protoEvents = FALSE)
})

results is a list of dataframes, one pair of dataframes for each subject. Let us look at the first list element, which is for SubjectID == 1:

results[[1]]
$climatology
   SubjectID       Date        Lab     seas thresh threshCriterion
1          1 2023-01-01 2130.96734 1484.172    500           FALSE
2          1 2023-01-02  738.41191 1484.172    500           FALSE
3          1 2023-01-03 1168.90333 1484.172    500           FALSE
4          1 2023-01-04  274.15102 1484.172    500            TRUE
5          1 2023-01-05 2886.19363 1484.172    500           FALSE
6          1 2023-01-06   32.79999 1484.172    500            TRUE
7          1 2023-01-07 1722.88553 1484.172    500           FALSE
8          1 2023-01-08 2293.19397 1484.172    500           FALSE
9          1 2023-01-09 2620.14693 1484.172    500           FALSE
10         1 2023-01-10  123.19006 1484.172    500            TRUE
11         1 2023-01-11 1983.36480 1484.172    500           FALSE
12         1 2023-01-12 2635.11255 1484.172    500           FALSE
13         1 2023-01-13 2671.67709 1484.172    500           FALSE
14         1 2023-01-14 1698.84140 1484.172    500           FALSE
15         1 2023-01-15 1780.64203 1484.172    500           FALSE
16         1 2023-01-16 1093.54355 1484.172    500           FALSE
17         1 2023-01-17 1072.23855 1484.172    500           FALSE
18         1 2023-01-18 1774.37116 1484.172    500           FALSE
19         1 2023-01-19 2596.35438 1484.172    500           FALSE
20         1 2023-01-20 2041.57098 1484.172    500           FALSE
   durationCriterion event event_no
1              FALSE FALSE       NA
2              FALSE FALSE       NA
3              FALSE FALSE       NA
4               TRUE  TRUE        1
5              FALSE  TRUE        1
6               TRUE  TRUE        1
7              FALSE  TRUE        1
8              FALSE  TRUE        1
9              FALSE  TRUE        1
10              TRUE  TRUE        1
11             FALSE FALSE       NA
12             FALSE FALSE       NA
13             FALSE FALSE       NA
14             FALSE FALSE       NA
15             FALSE FALSE       NA
16             FALSE FALSE       NA
17             FALSE FALSE       NA
18             FALSE FALSE       NA
19             FALSE FALSE       NA
20             FALSE FALSE       NA

$event
  event_no index_start index_peak index_end duration date_start  date_peak
1        1           4          6        10        7 2023-01-04 2023-01-06
    date_end intensity_mean intensity_max intensity_var intensity_cumulative
1 2023-01-10       -62.3774     -1451.372      1249.218            -436.6417
  intensity_mean_relThresh intensity_max_relThresh intensity_var_relThresh
1                 921.7944                  -467.2                1249.218
  intensity_cumulative_relThresh intensity_mean_abs intensity_max_abs
1                       6452.561           1421.794              32.8
  intensity_var_abs intensity_cumulative_abs rate_onset rate_decline
1          1249.218                 9952.561  -275.4909    -226.7728

The first dataframe is called climatology and the other is called events. Don’t worry about the names as the function was initially written for climate events. The climatology dataframe contains all the data for SubjectID that were initially supplied in df2 and a few new columns, threshCriterion, durationCriterion, event, and event_no are added at the end. When the Lab value dips below the thresh, threshCriterion will flag as TRUE regardless of how long it remains below the threshold. durationCriterion flags as TRUE if the number of times threshCriterion is equal to or greater than minDuration. event flags as TRUE if threshCriterion is TRUE AND durationCriterion is TRUE. A unique identifier is given for each event in event_no.

The events dataframe contains the event_no, the start and end dates of the event, and the event duration (your ‘streaks’). Various other summary stats are also calculated, but these might not be relevant for your question. Or are they?

If you are only interested in the event dataframe and want to combine all the streaks into one table with the results, do this—note the use of ddply() rather than dlply():

results <- plyr::ddply(.data = df2, .variables = "SubjectID", function(sub_df) {
  detect_event(sub_df,
               x = Date,
               y = Lab,
               seasClim = seas,
               threshClim = thresh,
               minDuration = 1,
               maxGap = 3, 
               coldSpell = TRUE,
               protoEvents = FALSE)$event
})
results |> 
  select(SubjectID, event_no, duration)
  SubjectID event_no duration
1         1        1        7
2         2        1        7
3         2        2        5
4         3        1        1
5         3        2        8
6         4        1        1
7         4        2        1
8         5        1        3

Reuse

Citation

BibTeX citation:
@online{j._smit2023,
  author = {J. Smit, Albertus},
  title = {Detect Event Streaks Based on Specified Thresholds},
  date = {2023-11-22},
  url = {http://tangledbank.netlify.app/blog/2023-11-22-run-lengths/},
  langid = {en}
}
For attribution, please cite this work as:
J. Smit A (2023) Detect event streaks based on specified thresholds. http://tangledbank.netlify.app/blog/2023-11-22-run-lengths/.