Detecting Events in Gridded Data

Marine heatwaves and cold spells as per Hobday et al (2016) and Schlegel et al (2017).

This vignette demonstrates how to detect marine heatwaves (MHWs) in dataframes of gridded data in which each pixel is a separate time series.

Author

AJ Smit and Robert W Schlegel

Published

February 15, 2023

Overview

This vignette uses the data we acquired earlier in Downloading and Preparing NOAA OISST Data: ERDDAP. We will use these subsetted data for our example on how to detect MHWs in gridded data.

library(dplyr) # For basic data manipulation
library(ggplot2) # For visualising data
library(heatwaveR) # For detecting MHWs
library(tidync) # For easily dealing with NetCDF data
library(doParallel) # For parallel processing

Loading data

Because we saved our data as an .Rds file, loading it into R is easy.

OISST <- readRDS("~/Desktop/OISST_vignette.Rds")

Event detection

Two good choices: `dplyr` vs. `plyr`

When we want to make the same calculation across multiple groups of data within one dataframe we have two good options available to us. The first is to make use of the map() suite of functions found in the purrr package, and now implemented in dplyr. This is a very fast tidyverse friendly approach to splitting up tasks. The other good option is to go back in time a bit and use the ddply() function from the plyr package. This is arguably a better approach as it allows us to very easily use multiple cores to detect the MHWs. The problem with this approach is that one must never load the plyr library directly as it has some fundamental inconsistencies with the tidyverse. We will see below how to perform these two different techniques without causing ourselves any headaches.

It is a little clumsy to use multiple functions at once with the two methods so we will combine the calculations we want to make into one wrapper function.

event_only <- function(df){
  # First calculate the climatologies
  clim <- ts2clm(data = df, climatologyPeriod = c("1982-01-01", "2011-01-01"))
  # Then the events
  event <- detect_event(data = clim)
  # Return only the event metric dataframe of results
  return(event$event)
}

The `dplyr` method

This method requires no special consideration and is performed just as any other friendly tidyverse code chunk would be.

system.time(
# First we start by choosing the 'OISST' dataframe
MHW_dplyr <- OISST %>% 
  # Then we group the data by the 'lon' and 'lat' columns
  group_by(lon, lat) %>% 
  # Then we run our MHW detecting function on each group
  group_modify(~event_only(.x))
) # ~123 seconds

Running the above calculations with only one of the 2.8 GHz cores on a modern laptop took ~123 seconds. It must be noted however that a recent update to the dplyr package now allows it to interrogate one’s computer to determine how many cores it has at it’s disposal. It then uses one core at full capacity and the other cores usually at half capacity.

The `plyr` technique

This method requires that we first tell our machine how many of its processor cores to give us for our calculation.

# NB: One should never use ALL available cores, save at least 1 for other essential tasks
# The computer I'm writing this vignette on has 8 cores, so I use 7 here
registerDoParallel(cores = 7)

# Detect events
system.time(
MHW_plyr <- plyr::ddply(.data = OISST, .variables = c("lon", "lat"), .fun = event_only, .parallel = TRUE)
) # 33 seconds

The plyr technique took 33 seconds using seven cores. This technique is not seven times faster because when using multiple cores there is a certain amount of loss in efficiency due to the computer needing to remember which results are meant to go where so that it can stitch everything back together again for you. This takes very little memory, but over large jobs it can start to become problematic. Occasionally ‘slippage’ can occur as well where an entire task can be forgotten. This is very rare but does happen. This is partly what makes dplyr a viable option as it does not have this problem. The other reason is that dplyr performs more efficient calculations than plyr. But what if we could have the best of both worlds?

A harmonious third option

As one may see above, running these calculations on a very large (or even global) gridded dataset can quickly become very heavy. While running these calculations myself on the global OISST dataset I have found that the fastest option is to combine the two options above. In my workflow I have saved each longitude segment of the global OISST dataset as separate files and use the dplyr method on each individual file, while using the plyr method to be running the multiple calculations on as many files as my core limit will allow. One may not do this the other way around and use dplyr to run multiple plyr calculations at once. This will confuse your computer and likely cause a stack overflow. Which sounds more fun than it actually is… as I have had to learn.

In order to happily combine these two options into one we will need to convert the dplyr code we wrote above into it’s own wrapper function, which we will then call on a stack of files using the plyr technique. Before we do that we must first create the aforementioned stack of files.

for(i in 1:length(unique(OISST$lon))){
  OISST_sub <- OISST %>% 
    filter(lon == unique(lon)[i])
  saveRDS(object = OISST_sub, file = paste0("~/Desktop/OISST_lon_",i,".Rds"))
}

This may initially seem like an unnecessary extra step, but when one is working with time series data it is necessary to have all of the dates at a given pixel loaded at once. Unless one is working from a server/virtual machine/supercomputer this means that one will often not be able to comfortably hold an entire grid for a study area in memory at once. Having the data accessible as thin strips like this makes life easier. And as we see in the code chunk below it also (arguably) allows us to perform the most efficient calculations on our data.

# The 'dplyr' wrapper function to pass to 'plyr'
dplyr_wraper <- function(file_name){
  MHW_dplyr <- readRDS(file_name) %>% 
    group_by(lon, lat) %>% 
    group_modify(~event_only(.x))
}
# Create a vector of the files we want to use
OISST_files <- dir("~/Desktop", pattern = "OISST_lon_*", full.names = T)

# Use 'plyr' technique to run 'dplyr' technique with multiple cores
system.time(
MHW_result <- plyr::ldply(OISST_files, .fun = dplyr_wraper, .parallel = T)
) # 31 seconds

# Save for later use as desired
saveRDS(MHW_result, "~/Desktop/MHW_result.Rds")

Even though this technique is not much faster computationally, it is much lighter on our memory (RAM) as it only loads one longitude slice of our data at a time. To maximise efficiency even further I would recommend writing out this full workflow in a stand-alone script and then running it using source() directly from an R terminal. The gain in speed here appears nominal, but as one scales this up the speed boost becomes apparent.

As mentioned above, recent changes to how dplyr interacts with one’s computer has perhaps slowed down the plyr + dplyr workflow shown here. It may be now that simply using plyr by itself is the better option. It depends on the number of cores and the amount of RAM that one has available.

Case study

Because of human-induced climate change, we anticipate that extreme events will occur more frequently and that they will become greater in intensity. Here we investigate this hypothesis by using gridded SST data, which is the only way that we can assess if this trend is unfolding across large ocean regions. Using the gridded 0.25 degree Reynolds OISST, we will detect marine heatwaves (MHWs) around South Africa by applying the detect_event() function pixel-by-pixel to the data we downloaded in the previous vignette. After detecting the events, we will fit a generalised linear model (GLM) to each pixel to calculate rates of change in some MHW metrics, and then plot the estimated trends.

Trend detection

With our MHW detected we will now look at how to fit some GLMs to the results in order to determine long-term trends in MHW occurrence.

Up first we see how to calculate the number of events that occurred per pixel.

# summarise the number of unique longitude, latitude and year combination:
OISST_n <- MHW_result %>% 
  mutate(year = lubridate::year(date_start)) %>% 
  group_by(lon, lat, year) %>% 
  summarise(n = n(), .groups = "drop") %>% 
  group_by(lon, lat) %>%
  tidyr::complete(year = c(1982:2019)) %>% # Note that these dates may differ
  mutate(n = ifelse(is.na(n), 0, n))
head(OISST_n)

Then we specify the particulars of the GLM we are going to use.

lin_fun <- function(ev) {
  mod1 <- glm(n ~ year, family = poisson(link = "log"), data = ev)
  # extract slope coefficient and its p-value
  tr <- data.frame(slope = summary(mod1)$coefficients[2,1],
                   p = summary(mod1)$coefficients[2,4])
  return(tr)
}

Lastly we make the calculations.

OISST_nTrend <- plyr::ddply(OISST_n, c("lon", "lat"), lin_fun, .parallel = T)
OISST_nTrend$pval <- cut(OISST_nTrend$p, breaks = c(0, 0.001, 0.01, 0.05, 1))
head(OISST_nTrend)

Visualising the results

Let’s finish this vignette by visualising the long-term trends in the annual occurrence of MHWs per pixel in the chosen study area. First we will grab the base global map from the maps package.

# The base map
map_base <- ggplot2::fortify(maps::map(fill = TRUE, plot = FALSE)) %>% 
  dplyr::rename(lon = long)

Then we will create two maps that we will stick together using ggpubr. The first map will show the slope of the count of events detected per year over time as shades of red, and the second map will show the significance (p-value) of these trends in shades of grey.

map_slope <- ggplot(OISST_nTrend, aes(x = lon, y = lat)) +
  geom_rect(size = 0.2, fill = NA,
       aes(xmin = lon - 0.1, xmax = lon + 0.1, ymin = lat - 0.1, ymax = lat + 0.1,
           colour = pval)) +
  geom_raster(aes(fill = slope), interpolate = FALSE, alpha = 0.9) +
  scale_fill_gradient2(name = "count/year (slope)", high = "red", mid = "white",
                       low = "darkblue", midpoint = 0,
                       guide = guide_colourbar(direction = "horizontal",
                                               title.position = "top")) +
  scale_colour_manual(breaks = c("(0,0.001]", "(0.001,0.01]", "(0.01,0.05]", "(0.05,1]"),
                      values = c("firebrick1", "firebrick2", "firebrick3", "white"),
                      name = "p-value", guide = FALSE) +
  geom_polygon(data = map_base, aes(group = group), 
               colour = NA, fill = "grey80") +
  coord_fixed(ratio = 1, xlim = c(13.0, 23.0), ylim = c(-33, -42), expand = TRUE) +
  labs(x = "", y = "") +
  theme_bw() +
  theme(legend.position = "bottom")

map_p <- ggplot(OISST_nTrend, aes(x = lon, y = lat)) +
  geom_raster(aes(fill = pval), interpolate = FALSE) +
  scale_fill_manual(breaks = c("(0,0.001]", "(0.001,0.01]", "(0.01,0.05]",
                               "(0.05,0.1]", "(0.1,0.5]", "(0.5,1]"),
                    values = c("black", "grey20", "grey40",
                               "grey80", "grey90", "white"),
                    name = "p-value",
                    guide = guide_legend(direction = "horizontal",
                                               title.position = "top")) +
  geom_polygon(data = map_base, aes(group = group), 
               colour = NA, fill = "grey80") +
  coord_fixed(ratio = 1, xlim = c(13.0, 23.0), ylim = c(-33, -42), expand = TRUE) +
  labs(x = "", y = "") +
  theme_bw() +
  theme(legend.position = "bottom")

map_both <- ggpubr::ggarrange(map_slope, map_p, align = "hv")
map_both

From the figure above we may see that the entire study area shows significant (p<= 0.05) increases in the count of MHWs per year. This is generally the case for the entire globe. Not shown here is the significant increase in the intensity of MHWs as well.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.,
  author = {Smit, A. J., and Smit and Robert W Schlegel, AJ},
  title = {Detecting {Events} in {Gridded} {Data}},
  date = {},
  url = {http://tangledbank.netlify.app/vignettes/gridded_data.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J., Smit and Robert W Schlegel A Detecting Events in Gridded Data. http://tangledbank.netlify.app/vignettes/gridded_data.html.

--- author: "AJ Smit and Robert W Schlegel" date: "`r Sys.Date()`" title: "Detecting Events in Gridded Data" subtitle: "Marine heatwaves and cold spells as per Hobday et al (2016) and Schlegel et al (2017)." description: "This vignette demonstrates how to detect marine heatwaves (MHWs) in dataframes of gridded data in which each pixel is a separate time series." bibliography: ../references.bib csl: ../marine-biology.csl format: html: code-fold: false toc-title: "On this page" standalone: true toc-location: right page-layout: full --- ```{r global_options, include = FALSE} knitr::opts_chunk$set(fig.width = 4, fig.align = 'center', echo = TRUE, warning = FALSE, message = FALSE, eval = FALSE, tidy = FALSE) ``` ## Overview This vignette uses the data we acquired earlier in [Downloading and Preparing NOAA OISST Data: ERDDAP](../vignettes/prep_NOAA_OISST.qmd). We will use these subsetted data for our example on how to detect MHWs in gridded data. ```{r load-pkg} library(dplyr) # For basic data manipulation library(ggplot2) # For visualising data library(heatwaveR) # For detecting MHWs library(tidync) # For easily dealing with NetCDF data library(doParallel) # For parallel processing ``` ## Loading data Because we saved our data as an `.Rds` file, loading it into R is easy. ```{r load-data} OISST <- readRDS("~/Desktop/OISST_vignette.Rds") ``` ## Event detection ### Two good choices: __`dplyr`__ vs. __`plyr`__ When we want to make the same calculation across multiple groups of data within one dataframe we have two good options available to us. The first is to make use of the `map()` suite of functions found in the __`purrr`__ package, and now implemented in __`dplyr`__. This is a very fast __`tidyverse`__ friendly approach to splitting up tasks. The other good option is to go back in time a bit and use the `ddply()` function from the __`plyr`__ package. This is arguably a better approach as it allows us to very easily use multiple cores to detect the MHWs. The problem with this approach is that one must _never_ load the __`plyr`__ library directly as it has some fundamental inconsistencies with the __`tidyverse`__. We will see below how to perform these two different techniques without causing ourselves any headaches. It is a little clumsy to use multiple functions at once with the two methods so we will combine the calculations we want to make into one wrapper function. ```{r detect-func} event_only <- function(df){ # First calculate the climatologies clim <- ts2clm(data = df, climatologyPeriod = c("1982-01-01", "2011-01-01")) # Then the events event <- detect_event(data = clim) # Return only the event metric dataframe of results return(event$event) } ``` #### The __`dplyr`__ method This method requires no special consideration and is performed just as any other friendly __`tidyverse`__ code chunk would be. ```{r detect-purrr} system.time( # First we start by choosing the 'OISST' dataframe MHW_dplyr <- OISST %>% # Then we group the data by the 'lon' and 'lat' columns group_by(lon, lat) %>% # Then we run our MHW detecting function on each group group_modify(~event_only(.x)) ) # ~123 seconds ``` Running the above calculations with only one of the 2.8 GHz cores on a modern laptop took ~123 seconds. It must be noted however that a recent update to the __`dplyr`__ package now allows it to interrogate one's computer to determine how many cores it has at it's disposal. It then uses one core at full capacity and the other cores usually at half capacity. #### The __`plyr`__ technique This method requires that we first tell our machine how many of its processor cores to give us for our calculation. ```{r detect-plyr} # NB: One should never use ALL available cores, save at least 1 for other essential tasks # The computer I'm writing this vignette on has 8 cores, so I use 7 here registerDoParallel(cores = 7) # Detect events system.time( MHW_plyr <- plyr::ddply(.data = OISST, .variables = c("lon", "lat"), .fun = event_only, .parallel = TRUE) ) # 33 seconds ``` The __`plyr`__ technique took 33 seconds using seven cores. This technique is not seven times faster because when using multiple cores there is a certain amount of loss in efficiency due to the computer needing to remember which results are meant to go where so that it can stitch everything back together again for you. This takes very little memory, but over large jobs it can start to become problematic. Occasionally 'slippage' can occur as well where an entire task can be forgotten. This is very rare but does happen. This is partly what makes __`dplyr`__ a viable option as it does not have this problem. The other reason is that __`dplyr`__ performs more efficient calculations than __`plyr`__. But what if we could have the best of both worlds? ### A harmonious third option As one may see above, running these calculations on a very large (or even global) gridded dataset can quickly become very heavy. While running these calculations myself on the global OISST dataset I have found that the fastest option is to combine the two options above. In my workflow I have saved each longitude segment of the global OISST dataset as separate files and use the __`dplyr`__ method on each individual file, while using the __`plyr`__ method to be running the multiple calculations on as many files as my core limit will allow. One may not do this the other way around and use __`dplyr`__ to run multiple __`plyr`__ calculations at once. This will confuse your computer and likely cause a stack overflow. Which sounds more fun than it actually is... as I have had to learn. In order to happily combine these two options into one we will need to convert the __`dplyr`__ code we wrote above into it's own wrapper function, which we will then call on a stack of files using the __`plyr`__ technique. Before we do that we must first create the aforementioned stack of files. ```{r lon-files} for(i in 1:length(unique(OISST$lon))){ OISST_sub <- OISST %>% filter(lon == unique(lon)[i]) saveRDS(object = OISST_sub, file = paste0("~/Desktop/OISST_lon_",i,".Rds")) } ``` This may initially seem like an unnecessary extra step, but when one is working with time series data it is necessary to have all of the dates at a given pixel loaded at once. Unless one is working from a server/virtual machine/supercomputer this means that one will often not be able to comfortably hold an entire grid for a study area in memory at once. Having the data accessible as thin strips like this makes life easier. And as we see in the code chunk below it also (arguably) allows us to perform the most efficient calculations on our data. ```{r detect-both} # The 'dplyr' wrapper function to pass to 'plyr' dplyr_wraper <- function(file_name){ MHW_dplyr <- readRDS(file_name) %>% group_by(lon, lat) %>% group_modify(~event_only(.x)) } # Create a vector of the files we want to use OISST_files <- dir("~/Desktop", pattern = "OISST_lon_*", full.names = T) # Use 'plyr' technique to run 'dplyr' technique with multiple cores system.time( MHW_result <- plyr::ldply(OISST_files, .fun = dplyr_wraper, .parallel = T) ) # 31 seconds # Save for later use as desired saveRDS(MHW_result, "~/Desktop/MHW_result.Rds") ``` Even though this technique is not much faster computationally, it is much lighter on our memory (RAM) as it only loads one longitude slice of our data at a time. To maximise efficiency even further I would recommend writing out this full workflow in a stand-alone script and then running it using `source()` directly from an R terminal. The gain in speed here appears nominal, but as one scales this up the speed boost becomes apparent. As mentioned above, recent changes to how __`dplyr`__ interacts with one's computer has perhaps slowed down the __`plyr`__ + __`dplyr`__ workflow shown here. It may be now that simply using __`plyr`__ by itself is the better option. It depends on the number of cores and the amount of RAM that one has available. ## Case study Because of human-induced climate change, we anticipate that extreme events will occur more frequently and that they will become greater in intensity. Here we investigate this hypothesis by using gridded SST data, which is the only way that we can assess if this trend is unfolding across large ocean regions. Using the gridded 0.25 degree Reynolds OISST, we will detect marine heatwaves (MHWs) around South Africa by applying the `detect_event()` function pixel-by-pixel to the data we downloaded in [the previous vignette](https://robwschlegel.github.io/heatwaveR/articles/OISST_preparation.html). After detecting the events, we will fit a generalised linear model (GLM) to each pixel to calculate rates of change in some MHW metrics, and then plot the estimated trends. ### Trend detection With our MHW detected we will now look at how to fit some GLMs to the results in order to determine long-term trends in MHW occurrence. Up first we see how to calculate the number of events that occurred per pixel. ```{r event-tally} # summarise the number of unique longitude, latitude and year combination: OISST_n <- MHW_result %>% mutate(year = lubridate::year(date_start)) %>% group_by(lon, lat, year) %>% summarise(n = n(), .groups = "drop") %>% group_by(lon, lat) %>% tidyr::complete(year = c(1982:2019)) %>% # Note that these dates may differ mutate(n = ifelse(is.na(n), 0, n)) head(OISST_n) ``` Then we specify the particulars of the GLM we are going to use. ```{r trend-fun} lin_fun <- function(ev) { mod1 <- glm(n ~ year, family = poisson(link = "log"), data = ev) # extract slope coefficient and its p-value tr <- data.frame(slope = summary(mod1)$coefficients[2,1], p = summary(mod1)$coefficients[2,4]) return(tr) } ``` Lastly we make the calculations. ```{r apply-trend-fun-plyr} OISST_nTrend <- plyr::ddply(OISST_n, c("lon", "lat"), lin_fun, .parallel = T) OISST_nTrend$pval <- cut(OISST_nTrend$p, breaks = c(0, 0.001, 0.01, 0.05, 1)) head(OISST_nTrend) ``` ### Visualising the results Let's finish this vignette by visualising the long-term trends in the annual occurrence of MHWs per pixel in the chosen study area. First we will grab the base global map from the __`maps`__ package. ```{r prep-geography} # The base map map_base <- ggplot2::fortify(maps::map(fill = TRUE, plot = FALSE)) %>% dplyr::rename(lon = long) ``` Then we will create two maps that we will stick together using __`ggpubr`__. The first map will show the slope of the count of events detected per year over time as shades of red, and the second map will show the significance (_p_-value) of these trends in shades of grey. ```{r the-figure} map_slope <- ggplot(OISST_nTrend, aes(x = lon, y = lat)) + geom_rect(size = 0.2, fill = NA, aes(xmin = lon - 0.1, xmax = lon + 0.1, ymin = lat - 0.1, ymax = lat + 0.1, colour = pval)) + geom_raster(aes(fill = slope), interpolate = FALSE, alpha = 0.9) + scale_fill_gradient2(name = "count/year (slope)", high = "red", mid = "white", low = "darkblue", midpoint = 0, guide = guide_colourbar(direction = "horizontal", title.position = "top")) + scale_colour_manual(breaks = c("(0,0.001]", "(0.001,0.01]", "(0.01,0.05]", "(0.05,1]"), values = c("firebrick1", "firebrick2", "firebrick3", "white"), name = "p-value", guide = FALSE) + geom_polygon(data = map_base, aes(group = group), colour = NA, fill = "grey80") + coord_fixed(ratio = 1, xlim = c(13.0, 23.0), ylim = c(-33, -42), expand = TRUE) + labs(x = "", y = "") + theme_bw() + theme(legend.position = "bottom") map_p <- ggplot(OISST_nTrend, aes(x = lon, y = lat)) + geom_raster(aes(fill = pval), interpolate = FALSE) + scale_fill_manual(breaks = c("(0,0.001]", "(0.001,0.01]", "(0.01,0.05]", "(0.05,0.1]", "(0.1,0.5]", "(0.5,1]"), values = c("black", "grey20", "grey40", "grey80", "grey90", "white"), name = "p-value", guide = guide_legend(direction = "horizontal", title.position = "top")) + geom_polygon(data = map_base, aes(group = group), colour = NA, fill = "grey80") + coord_fixed(ratio = 1, xlim = c(13.0, 23.0), ylim = c(-33, -42), expand = TRUE) + labs(x = "", y = "") + theme_bw() + theme(legend.position = "bottom") map_both <- ggpubr::ggarrange(map_slope, map_p, align = "hv") map_both ``` From the figure above we may see that the entire study area shows significant (_p_<= 0.05) increases in the count of MHWs per year. This is generally the case for the entire globe. Not shown here is the significant increase in the intensity of MHWs as well.