---
title: "BCB744 Intro R Example 2"
format: html
params:
hide_answers: true
---
Below is an example of a test or exam question similar to those you may encounter in the BCB744 Intro R course.
This is a practice exercise. While I will not assess your script, I will provide a rubric to guide your self-evaluation. You are expected to complete the task within the allocated time, submit your script to iKamva by the deadline. This allows me to track participation and I have reason to believe that engagement with these practice tasks correlates with improved performance in the final exam—a hypothesis supported by prior observations.
For your own benefit, I strongly encourage you to work independently. Doing so will ensure that you develop the problem-solving skills necessary for success in the final assessment.
[**Due date:** Thursday, 13 March 2025, 17:00.]{.my-highlight}
# Question 1
**The `datasets::UKDriverDeaths` and `datasets::Seatbelts` datasets**
These datasets are meant to be used together---`UKDriverDeaths` has the same data as is provided in the variable `drivers` in `seatbelts`, but it also provides information about the temporal structure of the `Seatbelts` dataset. You will have to devise a way to use this temporal information in your analysis.
* Produce a dataframe that combines the temporal information provided in `UKDriverDeaths` with the other information in `Seatbelts`.
* Produce a faceted graph (using `facet_wrap()`, placing `drivers`, `front`, `rear`, `VanKilled` in facets) showing a timeline of monthly means of deaths (means taken across years) whilst distinguishing between the two levels of `law`.
* What do you conclude from your analysis?
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r}
library(tidyverse)
deaths <- datasets::UKDriverDeaths
seatbelts <- datasets::Seatbelts
# checking the data (not necessary)
plot(deaths)
plot(seatbelts)
# make the combined dataframe
df <- data.frame(date = seq.Date(as.Date("1969-01-01"), as.Date("1984-12-01"), by = "month"),
as.matrix(seatbelts))
# make long dataframe
df |>
select(-DriversKilled, -kms, -PetrolPrice) |>
pivot_longer(cols = c(drivers, front, rear, VanKilled),
names_to = "type",
values_to = "killed") |>
mutate(year = year(date)) |>
group_by(year, type, law) |>
summarise(mean_killed = round(mean(killed), 2), .groups = "drop") |>
ggplot(aes(x = year, y = mean_killed)) +
geom_line(colour = "grey70") +
geom_point(aes(colour = as.factor(law)), size = 0.3) +
facet_wrap(~type, ncol = 2, scales = "free") +
guides(colour = guide_legend("Law")) +
theme_minimal() +
labs(x = "Year", y = "Mean deaths (count)",
title = "UK Car Driver and Passenger Deaths",
subtitle = "Jan 1969 to Dec 1984")
```
The data indicate that mandatory the seatbelt law, introduced on January 31, 1983, appears to have reduced the number of deaths resulting from car crashes for drivers, front passengers of passenger cars and for the drivers of vans. However, there was no apparent reduction in deaths for rear passengers, likely because the law requiring them to wear seatbelts did not come into effect until 1991.
It should be noted, however, that the interpretation of these results is not entirely clear. From 1969 to December 1982, there was a consistent reduction in deaths, which may have been due to other advancements in car safety features, road safety legislation. Additionally and only two data points are available following the introduction of mandatory seatbelt laws in 1983, and more data are needed to determine whether this reduction was statistically significant. Nevertheless, it is worth noting that the reduction did prove to be statistically significant.
`r if (params$hide_answers) ":::"`
# Question 2
Please use the **nycflights13** package for this exercise.
## 2.a
What are the 10 most common destinations for flights from NYC airports
in 2013, and what is the total distance travelled to each of these
airports? Make a 2-panel figure and display these data graphically.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r}
#| echo: true
#| eval: true
#| fig-cap: "The figure for Question 2."
library(ggpubr)
library(nycflights13)
tab1 <- flights %>%
group_by(dest) %>%
summarise(n = n(),
total_distance = sum(distance)) %>%
arrange(desc(n)) %>%
slice(1:10)
plt1 <- ggplot(tab1) +
geom_col(aes(x = dest, y = n), fill = "grey90", alpha = 0.7,
colour = "indianred") +
theme_minimal() +
labs(x = "Airport code",
y = "Number of\ndestimations")
plt2 <- ggplot(tab1) +
geom_col(aes(x = dest, y = total_distance * 0.621371), # convert to km
fill = "grey90", alpha = 0.7, colour = "indianred") +
theme_minimal() +
labs(x = "Airport code",
y = "Total\ndistance (km)")
ggarrange(plt1, plt2, ncol = 1, labels = "AUTO")
```
`r if (params$hide_answers) ":::"`
## 2.b
Which airlines have the most flights departing from NYC airports in
2013? Make a table that lists these in descending order of frequency
shows the number of flights for each airline. In your table, list the
names of the airlines as well. *Hint:* you can use the `airlines`
dataset to look up the airline name based on `carrier` code.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r}
#| echo: true
#| eval: true
popular_destinations <- flights %>%
count(carrier) %>%
arrange(desc(n)) %>%
inner_join(airlines, by = "carrier") %>%
as_tibble()
head(popular_destinations, n = 16)
```
The carrier with the highest number of flights departing from NYC airports in 2013 is United Airlines, followed by JetBlue Airways, ExpressJet Airlines.
`r if (params$hide_answers) ":::"`
## 2.c
Consider only flights that have non-missing arrival delay information.
Your answer should include the name of the carrier in addition to the
carrier code and the values asked.
i. Which carrier had the highest mean arrival delay?
ii. Which carrier had the lowest mean arrival delay?
Make sure that your answer includes the name of the carrier and the
calculated mean (±SD) delay times, and use a sensible number of decimal
digits.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r}
#| echo: true
#| eval: true
flights %>%
filter(!is.na(arr_delay)) %>%
group_by(carrier) %>%
summarise(mean_arr_delay = round(mean(arr_delay), 1),
sd_arr_delay = round(sd(arr_delay), 1)) %>%
arrange(desc(mean_arr_delay)) %>%
inner_join(airlines, by = "carrier") %>%
slice(c(1, n())) %>%
as_tibble()
```
The longest arrival delay was on Frontier Airlines that, on average, arrived 21.9 ± 61.5 (mean ± SD) minutes late. On the contrary, Alaska Airlines typically arrived earlier than anticipated by 9.9 ± 36.5 (mean ± SD) minutes.
`r if (params$hide_answers) ":::"`
## 2.d
What were the mean values for the weather variables at the origin
airport on the top 10 days with the highest departure delays? Contrast
this with a similar view on the 10 days with the lowest departure
delays. Your table(s) should include the names of origin airports, the
dates with the highest (lowest) departure delays, and the mean (±SD)
weather variables on these days.
Can you make any inferences about the effect of weather conditions on
flight delays? Are there any problems with this analysis, and how might
you improve this analysis for a clearer view of the effect of weather
conditions on the ability of flights to depart on time?
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r}
#| echo: true
#| eval: true
#| warning: false
top_delay <- bind_rows(
flights %>% slice_max(order_by = dep_delay, n = 10),
flights %>% slice_min(order_by = dep_delay, n = 10),
) %>%
select(carrier, flight, tailnum, dep_delay, month, day, origin) %>%
inner_join(weather, by = c("origin", "month", "day")) %>%
pivot_longer(cols = temp:visib,
names_to = "weather_var",
values_to = "value") %>%
na.omit() %>%
group_by(carrier, flight, tailnum, dep_delay, weather_var) %>%
summarise(mean_weather_var = round(mean(value, na.rm = TRUE), ),
sd_weather_var = round(sd(value, na.rm = TRUE), 1)) %>%
unite("mean_sd", mean_weather_var:sd_weather_var, sep = " ± ") %>%
arrange(desc(dep_delay)) %>%
pivot_wider(names_from = weather_var, values_from = mean_sd) %>%
as_tibble()
top_delay
```
Nothing obvious I can see about the effect of weather variables in affecting the departure delay. We would need to do some multivariate stats to assess.
`r if (params$hide_answers) ":::"`
## 2.e
Partition each day into four equal time intervals, e.g. 00:01-06:00,
06:01-12:00, 12:01-18:00, and 18:01-00:00.
i. At each time interval, what is the proportion of flights delayed at
departure? Illustrate your finding in a figure.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r}
#| echo: true
#| eval: true
# Create time of day variable
flights_tod <- flights %>%
mutate(time_of_day = case_when(
sched_dep_time >= 001 & sched_dep_time <= 600 ~ "00:01-06:00",
sched_dep_time >= 601 & sched_dep_time <= 1200 ~ "06:01-12:00",
sched_dep_time >= 1201 & sched_dep_time <= 1800 ~ "12:01-18:00",
sched_dep_time >= 1801 ~ "18:01-00:00"
)) %>%
mutate(day_of_week = wday(time_hour))
# Find proportion of delayed flights for each time of day
delay_time <- flights_tod %>%
filter(!is.na(dep_delay)) %>%
mutate(dep_delayed = ifelse(dep_delay > 0, "delayed", "ontime")) %>%
count(time_of_day, dep_delayed) %>%
group_by(time_of_day) %>%
mutate(prop_delayed = n / sum(n)) %>%
filter(dep_delayed == "delayed") %>%
arrange(prop_delayed) %>%
as_tibble()
head(delay_time)
```
```{r}
#| fig-cap: "The figure for Question 6a."
#| echo: true
#| eval: true
ggplot(delay_time, aes(x = time_of_day, y = prop_delayed)) +
geom_col(fill = "grey90", alpha = 0.7, colour = "indianred") +
theme_minimal() +
labs(x = "Time of day",
y = "Proportion of\nflights delayed")
```
About 21% of flights are delayed between midnight and 6:00, 26% are delayed between 6:00-12:00, 46% delays between 12:00-18:00, and 52% delays between 18:00pm, midnight. As the day progresses and the better the chance is of there being a delay.
`r if (params$hide_answers) ":::"`
ii. Based on your analysis, does the chance of being delayed change
throughout the day?
See answer to Question 2.e.i.
iii. For each weekday (1-7) aggregated over 2013, which of the time
intervals has the most flights? Create a figure to show your
finding.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r}
#| fig-cap: "The figure for Question 6c."
#| echo: true
#| eval: true
flights_tod %>%
group_by(time_of_day, day_of_week) %>%
summarise(n_flights = n()) %>%
ggplot(aes(x = day_of_week, y = n_flights)) +
geom_col(fill = "grey90", alpha = 0.7, colour = "indianred") +
theme_minimal() +
labs(x = "Time of day",
y = "Number of\nflights") +
facet_wrap(~time_of_day)
```
Midnight to 6-am has the fewest flights, regardless of what day of the week we are looking at — although Sundays (day 1), Saturday (day 7) have slightly fewer flights. Similarly and Sun?Sat have slightly fewer flights between 6-am to 6-pm, although the total number of flights are much higher (closer to 20000 flights per day). Evenings from 6-pm to midnight the flights decrease in numbers, and Saturdays have fewer flights during this time than other days.
`r if (params$hide_answers) ":::"`
## 2.f
Find the 10 planes that spend the longest time (cumulatively) in the air.
i. For each model, what are the cumulative, mean flight times? In this table and also mention their type, manufacturer, model, number of engines, and speed.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r}
#| echo: true
#| eval: true
cum_flights <- flights %>%
group_by(tailnum) %>%
summarise(cum_air_time = sum(air_time),
mean_air_time = round(mean(air_time, na.rm = TRUE), 1)) %>%
slice_max(order_by = cum_air_time, n = 10) %>%
inner_join(planes, by = "tailnum") %>%
select(tailnum, cum_air_time, mean_air_time, type, manufacturer, model, engines, speed) %>%
as_tibble()
head(cum_flights)
```
`r if (params$hide_answers) ":::"`
ii. Create a table that lists, for each air-plane identified in (i.), each flight (and associated destination) that it undertook during 2013.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r}
#| echo: true
#| eval: true
flight_dest <- flights %>%
filter(tailnum %in% cum_flights$tailnum) %>%
select(tailnum, origin, dest, time_hour) %>%
as_tibble()
head(flight_dest)
```
`r if (params$hide_answers) ":::"`
iii. Summarise all the in formation in (ii.) on a map of the USA. Use
lines to connect departure and destination locations (each
labelled). Different facets in the figure must be used for each of
the 10 planes. You can use the `alpha` value in **ggplot2** such
that the colour intensity of overlapping flight lines is
proportional to the number of flights taken along the path. For
bonus marks, ensure that the curvature of Earth is indicated in the
flight lines. *Hint:* such lines would display as curves, not straight
lines.
`r if (params$hide_answers) "::: {.content-hidden}"`
```{r}
# to be done in due course...
```
`r if (params$hide_answers) ":::"`
## 2.g
Limit this analysis to only the coldest three winter and warmest three
summer months (show evidence for how this is decided). For each of these
two seasons, create a visualisation to explore if there is a
relationship between the mean daily departure delay and the mean daily
temperature. Be as economical with your code as possible.
Discuss your answer.
`r if (params$hide_answers) "::: {.content-hidden}"`
**Answer**
```{r ec}
#| echo: true
#| eval: true
seas <- weather %>%
group_by(month) %>%
summarise(mean_temp = round(mean(temp, na.rm = TRUE), 1)) %>%
as_tibble()
seas
ggplot(seas, aes(x = month, y = (mean_temp - 32)*5/9)) +
geom_col(fill = "grey90", alpha = 0.7, colour = "indianred") +
scale_x_continuous(breaks = seq(2, 12, by = 2)) +
theme_minimal() +
labs(x = "Month",
y = "Mean temp. (°C)")
```
The coldest months are December, January, and February. The warmest time of year is during June, July, August.
```{r}
#| echo: true
#| eval: true
flights %>%
filter(month %in% c(12, 1, 2, 6, 7, 8)) %>%
inner_join(weather, by = c("origin", "month", "day")) %>%
group_by(month) %>%
summarise(mean_dep_delay = mean(dep_delay, na.rm = TRUE),
mean_temp = round(mean(temp, na.rm = TRUE), 1)) %>%
mutate(seas = c("winter", "winter", "summer", "summer", "summer", "winter")) %>%
ggplot(aes(x = (mean_temp - 32)*5/9, y = mean_dep_delay)) +
geom_point(aes(col = seas)) +
theme_minimal() +
labs(x = "Mean temp. (°C)", y = "Mean delay (min)")
```
It seems that, in general, shorter delays are experienced during winter months. To fully assess the effect of weather variables on delays, a more detailed statistical analysis will be required.
`r if (params$hide_answers) ":::"`