BCB744 Intro R Example 2

Author

Affiliation

Below is an example of a test or exam question similar to those you may encounter in the BCB744 Intro R course.

This is a practice exercise. While I will not assess your script, I will provide a rubric to guide your self-evaluation. You are expected to complete the task within the allocated time, submit your script to iKamva by the deadline. This allows me to track participation and I have reason to believe that engagement with these practice tasks correlates with improved performance in the final exam—a hypothesis supported by prior observations.

For your own benefit, I strongly encourage you to work independently. Doing so will ensure that you develop the problem-solving skills necessary for success in the final assessment.

Due date: Thursday, 13 March 2025, 17:00.

1 Question 1

The datasets::UKDriverDeaths and datasets::Seatbelts datasets

These datasets are meant to be used together—UKDriverDeaths has the same data as is provided in the variable drivers in seatbelts, but it also provides information about the temporal structure of the Seatbelts dataset. You will have to devise a way to use this temporal information in your analysis.

Produce a dataframe that combines the temporal information provided in UKDriverDeaths with the other information in Seatbelts.
Produce a faceted graph (using facet_wrap(), placing drivers, front, rear, VanKilled in facets) showing a timeline of monthly means of deaths (means taken across years) whilst distinguishing between the two levels of law.
What do you conclude from your analysis?

2 Question 2

Please use the nycflights13 package for this exercise.

2.1 2.a

What are the 10 most common destinations for flights from NYC airports in 2013, and what is the total distance travelled to each of these airports? Make a 2-panel figure and display these data graphically.

2.2 2.b

Which airlines have the most flights departing from NYC airports in 2013? Make a table that lists these in descending order of frequency shows the number of flights for each airline. In your table, list the names of the airlines as well. Hint: you can use the airlines dataset to look up the airline name based on carrier code.

2.3 2.c

Consider only flights that have non-missing arrival delay information. Your answer should include the name of the carrier in addition to the carrier code and the values asked.

Which carrier had the highest mean arrival delay?
Which carrier had the lowest mean arrival delay?

Make sure that your answer includes the name of the carrier and the calculated mean (±SD) delay times, and use a sensible number of decimal digits.

2.4 2.d

What were the mean values for the weather variables at the origin airport on the top 10 days with the highest departure delays? Contrast this with a similar view on the 10 days with the lowest departure delays. Your table(s) should include the names of origin airports, the dates with the highest (lowest) departure delays, and the mean (±SD) weather variables on these days.

Can you make any inferences about the effect of weather conditions on flight delays? Are there any problems with this analysis, and how might you improve this analysis for a clearer view of the effect of weather conditions on the ability of flights to depart on time?

2.5 2.e

Partition each day into four equal time intervals, e.g. 00:01-06:00, 06:01-12:00, 12:01-18:00, and 18:01-00:00.

At each time interval, what is the proportion of flights delayed at departure? Illustrate your finding in a figure.

Based on your analysis, does the chance of being delayed change throughout the day?

See answer to Question 2.e.i.

For each weekday (1-7) aggregated over 2013, which of the time intervals has the most flights? Create a figure to show your finding.

2.6 2.f

Find the 10 planes that spend the longest time (cumulatively) in the air.

For each model, what are the cumulative, mean flight times? In this table and also mention their type, manufacturer, model, number of engines, and speed.

Create a table that lists, for each air-plane identified in (i.), each flight (and associated destination) that it undertook during 2013.

Summarise all the in formation in (ii.) on a map of the USA. Use lines to connect departure and destination locations (each labelled). Different facets in the figure must be used for each of the 10 planes. You can use the alpha value in ggplot2 such that the colour intensity of overlapping flight lines is proportional to the number of flights taken along the path. For bonus marks, ensure that the curvature of Earth is indicated in the flight lines. Hint: such lines would display as curves, not straight lines.

2.7 2.g

Limit this analysis to only the coldest three winter and warmest three summer months (show evidence for how this is decided). For each of these two seasons, create a visualisation to explore if there is a relationship between the mean daily departure delay and the mean daily temperature. Be as economical with your code as possible.

Discuss your answer.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.,
  author = {Smit, A. J.,},
  title = {BCB744 {Intro} {R} {Example} 2},
  url = {http://tangledbank.netlify.app/BCB744/assessments/BCB744_Intro_R_Example_2.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J. BCB744 Intro R Example 2. http://tangledbank.netlify.app/BCB744/assessments/BCB744_Intro_R_Example_2.html.

--- title: "BCB744 Intro R Example 2" format: html params: hide_answers: true --- Below is an example of a test or exam question similar to those you may encounter in the BCB744 Intro R course. This is a practice exercise. While I will not assess your script, I will provide a rubric to guide your self-evaluation. You are expected to complete the task within the allocated time, submit your script to iKamva by the deadline. This allows me to track participation and I have reason to believe that engagement with these practice tasks correlates with improved performance in the final exam—a hypothesis supported by prior observations. For your own benefit, I strongly encourage you to work independently. Doing so will ensure that you develop the problem-solving skills necessary for success in the final assessment. [**Due date:** Thursday, 13 March 2025, 17:00.]{.my-highlight} # Question 1 **The `datasets::UKDriverDeaths` and `datasets::Seatbelts` datasets** These datasets are meant to be used together---`UKDriverDeaths` has the same data as is provided in the variable `drivers` in `seatbelts`, but it also provides information about the temporal structure of the `Seatbelts` dataset. You will have to devise a way to use this temporal information in your analysis. * Produce a dataframe that combines the temporal information provided in `UKDriverDeaths` with the other information in `Seatbelts`. * Produce a faceted graph (using `facet_wrap()`, placing `drivers`, `front`, `rear`, `VanKilled` in facets) showing a timeline of monthly means of deaths (means taken across years) whilst distinguishing between the two levels of `law`. * What do you conclude from your analysis? `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} library(tidyverse) deaths <- datasets::UKDriverDeaths seatbelts <- datasets::Seatbelts # checking the data (not necessary) plot(deaths) plot(seatbelts) # make the combined dataframe df <- data.frame(date = seq.Date(as.Date("1969-01-01"), as.Date("1984-12-01"), by = "month"), as.matrix(seatbelts)) # make long dataframe df |> select(-DriversKilled, -kms, -PetrolPrice) |> pivot_longer(cols = c(drivers, front, rear, VanKilled), names_to = "type", values_to = "killed") |> mutate(year = year(date)) |> group_by(year, type, law) |> summarise(mean_killed = round(mean(killed), 2), .groups = "drop") |> ggplot(aes(x = year, y = mean_killed)) + geom_line(colour = "grey70") + geom_point(aes(colour = as.factor(law)), size = 0.3) + facet_wrap(~type, ncol = 2, scales = "free") + guides(colour = guide_legend("Law")) + theme_minimal() + labs(x = "Year", y = "Mean deaths (count)", title = "UK Car Driver and Passenger Deaths", subtitle = "Jan 1969 to Dec 1984") ``` The data indicate that mandatory the seatbelt law, introduced on January 31, 1983, appears to have reduced the number of deaths resulting from car crashes for drivers, front passengers of passenger cars and for the drivers of vans. However, there was no apparent reduction in deaths for rear passengers, likely because the law requiring them to wear seatbelts did not come into effect until 1991. It should be noted, however, that the interpretation of these results is not entirely clear. From 1969 to December 1982, there was a consistent reduction in deaths, which may have been due to other advancements in car safety features, road safety legislation. Additionally and only two data points are available following the introduction of mandatory seatbelt laws in 1983, and more data are needed to determine whether this reduction was statistically significant. Nevertheless, it is worth noting that the reduction did prove to be statistically significant. `r if (params$hide_answers) ":::"` # Question 2 Please use the **nycflights13** package for this exercise. ## 2.a What are the 10 most common destinations for flights from NYC airports in 2013, and what is the total distance travelled to each of these airports? Make a 2-panel figure and display these data graphically. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| echo: true #| eval: true #| fig-cap: "The figure for Question 2." library(ggpubr) library(nycflights13) tab1 <- flights %>% group_by(dest) %>% summarise(n = n(), total_distance = sum(distance)) %>% arrange(desc(n)) %>% slice(1:10) plt1 <- ggplot(tab1) + geom_col(aes(x = dest, y = n), fill = "grey90", alpha = 0.7, colour = "indianred") + theme_minimal() + labs(x = "Airport code", y = "Number of\ndestimations") plt2 <- ggplot(tab1) + geom_col(aes(x = dest, y = total_distance * 0.621371), # convert to km fill = "grey90", alpha = 0.7, colour = "indianred") + theme_minimal() + labs(x = "Airport code", y = "Total\ndistance (km)") ggarrange(plt1, plt2, ncol = 1, labels = "AUTO") ``` `r if (params$hide_answers) ":::"` ## 2.b Which airlines have the most flights departing from NYC airports in 2013? Make a table that lists these in descending order of frequency shows the number of flights for each airline. In your table, list the names of the airlines as well. *Hint:* you can use the `airlines` dataset to look up the airline name based on `carrier` code. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| echo: true #| eval: true popular_destinations <- flights %>% count(carrier) %>% arrange(desc(n)) %>% inner_join(airlines, by = "carrier") %>% as_tibble() head(popular_destinations, n = 16) ``` The carrier with the highest number of flights departing from NYC airports in 2013 is United Airlines, followed by JetBlue Airways, ExpressJet Airlines. `r if (params$hide_answers) ":::"` ## 2.c Consider only flights that have non-missing arrival delay information. Your answer should include the name of the carrier in addition to the carrier code and the values asked. i. Which carrier had the highest mean arrival delay? ii. Which carrier had the lowest mean arrival delay? Make sure that your answer includes the name of the carrier and the calculated mean (±SD) delay times, and use a sensible number of decimal digits. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| echo: true #| eval: true flights %>% filter(!is.na(arr_delay)) %>% group_by(carrier) %>% summarise(mean_arr_delay = round(mean(arr_delay), 1), sd_arr_delay = round(sd(arr_delay), 1)) %>% arrange(desc(mean_arr_delay)) %>% inner_join(airlines, by = "carrier") %>% slice(c(1, n())) %>% as_tibble() ``` The longest arrival delay was on Frontier Airlines that, on average, arrived 21.9 ± 61.5 (mean ± SD) minutes late. On the contrary, Alaska Airlines typically arrived earlier than anticipated by 9.9 ± 36.5 (mean ± SD) minutes. `r if (params$hide_answers) ":::"` ## 2.d What were the mean values for the weather variables at the origin airport on the top 10 days with the highest departure delays? Contrast this with a similar view on the 10 days with the lowest departure delays. Your table(s) should include the names of origin airports, the dates with the highest (lowest) departure delays, and the mean (±SD) weather variables on these days. Can you make any inferences about the effect of weather conditions on flight delays? Are there any problems with this analysis, and how might you improve this analysis for a clearer view of the effect of weather conditions on the ability of flights to depart on time? `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| echo: true #| eval: true #| warning: false top_delay <- bind_rows( flights %>% slice_max(order_by = dep_delay, n = 10), flights %>% slice_min(order_by = dep_delay, n = 10), ) %>% select(carrier, flight, tailnum, dep_delay, month, day, origin) %>% inner_join(weather, by = c("origin", "month", "day")) %>% pivot_longer(cols = temp:visib, names_to = "weather_var", values_to = "value") %>% na.omit() %>% group_by(carrier, flight, tailnum, dep_delay, weather_var) %>% summarise(mean_weather_var = round(mean(value, na.rm = TRUE), ), sd_weather_var = round(sd(value, na.rm = TRUE), 1)) %>% unite("mean_sd", mean_weather_var:sd_weather_var, sep = " ± ") %>% arrange(desc(dep_delay)) %>% pivot_wider(names_from = weather_var, values_from = mean_sd) %>% as_tibble() top_delay ``` Nothing obvious I can see about the effect of weather variables in affecting the departure delay. We would need to do some multivariate stats to assess. `r if (params$hide_answers) ":::"` ## 2.e Partition each day into four equal time intervals, e.g. 00:01-06:00, 06:01-12:00, 12:01-18:00, and 18:01-00:00. i. At each time interval, what is the proportion of flights delayed at departure? Illustrate your finding in a figure. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| echo: true #| eval: true # Create time of day variable flights_tod <- flights %>% mutate(time_of_day = case_when( sched_dep_time >= 001 & sched_dep_time <= 600 ~ "00:01-06:00", sched_dep_time >= 601 & sched_dep_time <= 1200 ~ "06:01-12:00", sched_dep_time >= 1201 & sched_dep_time <= 1800 ~ "12:01-18:00", sched_dep_time >= 1801 ~ "18:01-00:00" )) %>% mutate(day_of_week = wday(time_hour)) # Find proportion of delayed flights for each time of day delay_time <- flights_tod %>% filter(!is.na(dep_delay)) %>% mutate(dep_delayed = ifelse(dep_delay > 0, "delayed", "ontime")) %>% count(time_of_day, dep_delayed) %>% group_by(time_of_day) %>% mutate(prop_delayed = n / sum(n)) %>% filter(dep_delayed == "delayed") %>% arrange(prop_delayed) %>% as_tibble() head(delay_time) ``` ```{r} #| fig-cap: "The figure for Question 6a." #| echo: true #| eval: true ggplot(delay_time, aes(x = time_of_day, y = prop_delayed)) + geom_col(fill = "grey90", alpha = 0.7, colour = "indianred") + theme_minimal() + labs(x = "Time of day", y = "Proportion of\nflights delayed") ``` About 21% of flights are delayed between midnight and 6:00, 26% are delayed between 6:00-12:00, 46% delays between 12:00-18:00, and 52% delays between 18:00pm, midnight. As the day progresses and the better the chance is of there being a delay. `r if (params$hide_answers) ":::"` ii. Based on your analysis, does the chance of being delayed change throughout the day? See answer to Question 2.e.i. iii. For each weekday (1-7) aggregated over 2013, which of the time intervals has the most flights? Create a figure to show your finding. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| fig-cap: "The figure for Question 6c." #| echo: true #| eval: true flights_tod %>% group_by(time_of_day, day_of_week) %>% summarise(n_flights = n()) %>% ggplot(aes(x = day_of_week, y = n_flights)) + geom_col(fill = "grey90", alpha = 0.7, colour = "indianred") + theme_minimal() + labs(x = "Time of day", y = "Number of\nflights") + facet_wrap(~time_of_day) ``` Midnight to 6-am has the fewest flights, regardless of what day of the week we are looking at — although Sundays (day 1), Saturday (day 7) have slightly fewer flights. Similarly and Sun?Sat have slightly fewer flights between 6-am to 6-pm, although the total number of flights are much higher (closer to 20000 flights per day). Evenings from 6-pm to midnight the flights decrease in numbers, and Saturdays have fewer flights during this time than other days. `r if (params$hide_answers) ":::"` ## 2.f Find the 10 planes that spend the longest time (cumulatively) in the air. i. For each model, what are the cumulative, mean flight times? In this table and also mention their type, manufacturer, model, number of engines, and speed. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| echo: true #| eval: true cum_flights <- flights %>% group_by(tailnum) %>% summarise(cum_air_time = sum(air_time), mean_air_time = round(mean(air_time, na.rm = TRUE), 1)) %>% slice_max(order_by = cum_air_time, n = 10) %>% inner_join(planes, by = "tailnum") %>% select(tailnum, cum_air_time, mean_air_time, type, manufacturer, model, engines, speed) %>% as_tibble() head(cum_flights) ``` `r if (params$hide_answers) ":::"` ii. Create a table that lists, for each air-plane identified in (i.), each flight (and associated destination) that it undertook during 2013. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r} #| echo: true #| eval: true flight_dest <- flights %>% filter(tailnum %in% cum_flights$tailnum) %>% select(tailnum, origin, dest, time_hour) %>% as_tibble() head(flight_dest) ``` `r if (params$hide_answers) ":::"` iii. Summarise all the in formation in (ii.) on a map of the USA. Use lines to connect departure and destination locations (each labelled). Different facets in the figure must be used for each of the 10 planes. You can use the `alpha` value in **ggplot2** such that the colour intensity of overlapping flight lines is proportional to the number of flights taken along the path. For bonus marks, ensure that the curvature of Earth is indicated in the flight lines. *Hint:* such lines would display as curves, not straight lines. `r if (params$hide_answers) "::: {.content-hidden}"` ```{r} # to be done in due course... ``` `r if (params$hide_answers) ":::"` ## 2.g Limit this analysis to only the coldest three winter and warmest three summer months (show evidence for how this is decided). For each of these two seasons, create a visualisation to explore if there is a relationship between the mean daily departure delay and the mean daily temperature. Be as economical with your code as possible. Discuss your answer. `r if (params$hide_answers) "::: {.content-hidden}"` **Answer** ```{r ec} #| echo: true #| eval: true seas <- weather %>% group_by(month) %>% summarise(mean_temp = round(mean(temp, na.rm = TRUE), 1)) %>% as_tibble() seas ggplot(seas, aes(x = month, y = (mean_temp - 32)*5/9)) + geom_col(fill = "grey90", alpha = 0.7, colour = "indianred") + scale_x_continuous(breaks = seq(2, 12, by = 2)) + theme_minimal() + labs(x = "Month", y = "Mean temp. (°C)") ``` The coldest months are December, January, and February. The warmest time of year is during June, July, August. ```{r} #| echo: true #| eval: true flights %>% filter(month %in% c(12, 1, 2, 6, 7, 8)) %>% inner_join(weather, by = c("origin", "month", "day")) %>% group_by(month) %>% summarise(mean_dep_delay = mean(dep_delay, na.rm = TRUE), mean_temp = round(mean(temp, na.rm = TRUE), 1)) %>% mutate(seas = c("winter", "winter", "summer", "summer", "summer", "winter")) %>% ggplot(aes(x = (mean_temp - 32)*5/9, y = mean_dep_delay)) + geom_point(aes(col = seas)) + theme_minimal() + labs(x = "Mean temp. (°C)", y = "Mean delay (min)") ``` It seems that, in general, shorter delays are experienced during winter months. To fully assess the effect of weather variables on delays, a more detailed statistical analysis will be required. `r if (params$hide_answers) ":::"`