We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.
Below is an example of a test or exam question similar to those you may encounter in the BCB744 Intro R course.
This is a practice exercise. While I will not assess your script, I will provide a rubric to guide your self-evaluation. You are expected to complete the task within the allocated time and submit your script to iKamva by the deadline. This allows me to track participation, and I have reason to believe that engagement with these practice tasks correlates with improved performance in the final exam—a hypothesis supported by prior observations.
For your own benefit, I strongly encourage you to work independently. Doing so will ensure that you develop the problem-solving skills necessary for success in the final assessment.
These datasets are meant to be used together—UKDriverDeaths has the same data as is provided in the variable drivers in seatbelts, but it also provides information about the temporal structure of the Seatbelts dataset. You will have to devise a way to use this temporal information in your analysis.
Produce a dataframe that combines the temporal information provided in UKDriverDeaths with the other information in Seatbelts.
Produce a faceted graph (using facet_wrap(), placing drivers, front, rear, and VanKilled in facets) showing a timeline of monthly means of deaths (means taken across years) whilst distinguishing between the two levels of law.
What do you conclude from your analysis?
Question 2
Please use the nycflights13 package for this exercise.
2.a
What are the 10 most common destinations for flights from NYC airports in 2013, and what is the total distance travelled to each of these airports? Make a 2-panel figure and display these data graphically.
2.b
Which airlines have the most flights departing from NYC airports in 2013? Make a table that lists these in descending order of frequency and shows the number of flights for each airline. In your table, list the names of the airlines as well. Hint: you can use the airlines dataset to look up the airline name based on carrier code.
2.c
Consider only flights that have non-missing arrival delay information. Your answer should include the name of the carrier in addition to the carrier code and the values asked.
Which carrier had the highest mean arrival delay?
Which carrier had the lowest mean arrival delay?
Make sure that your answer includes the name of the carrier and the calculated mean (±SD) delay times, and use a sensible number of decimal digits.
2.d
What were the mean values for the weather variables at the origin airport on the top 10 days with the highest departure delays? Contrast this with a similar view on the 10 days with the lowest departure delays. Your table(s) should include the names of origin airports, the dates with the highest (lowest) departure delays, and the mean (±SD) weather variables on these days.
Can you make any inferences about the effect of weather conditions on flight delays? Are there any problems with this analysis, and how might you improve this analysis for a clearer view of the effect of weather conditions on the ability of flights to depart on time?
2.e
Partition each day into four equal time intervals, e.g. 00:01-06:00, 06:01-12:00, 12:01-18:00, and 18:01-00:00.
At each time interval, what is the proportion of flights delayed at departure? Illustrate your finding in a figure.
Based on your analysis, does the chance of being delayed change throughout the day?
See answer to Question 2.e.i.
For each weekday (1-7) aggregated over 2013, which of the time intervals has the most flights? Create a figure to show your finding.
2.f
Find the 10 planes that spend the longest time (cumulatively) in the air.
For each model, what are the cumulative and mean flight times? In this table, also mention their type, manufacturer, model, number of engines, and speed.
Create a table that lists, for each air-plane identified in (i.), each flight (and associated destination) that it undertook during 2013.
Summarise all the in formation in (ii.) on a map of the USA. Use lines to connect departure and destination locations (each labelled). Different facets in the figure must be used for each of the 10 planes. You can use the alpha value in ggplot2 such that the colour intensity of overlapping flight lines is proportional to the number of flights taken along the path. For bonus marks, ensure that the curvature of Earth is indicated in the flight lines. Hint: such lines would display as curves, not straight lines.
2.g
Limit this analysis to only the coldest three winter and warmest three summer months (show evidence for how this is decided). For each of these two seasons, create a visualisation to explore if there is a relationship between the mean daily departure delay and the mean daily temperature. Be as economical with your code as possible.
---title: "BCB744 Intro R Example 2"format: htmlparams: hide_answers: true---Below is an example of a test or exam question similar to those you may encounter in the BCB744 Intro R course.This is a practice exercise. While I will not assess your script, I will provide a rubric to guide your self-evaluation. You are expected to complete the task within the allocated time and submit your script to iKamva by the deadline. This allows me to track participation, and I have reason to believe that engagement with these practice tasks correlates with improved performance in the final exam—a hypothesis supported by prior observations.For your own benefit, I strongly encourage you to work independently. Doing so will ensure that you develop the problem-solving skills necessary for success in the final assessment.[**Due date:** Thursday, 13 March 2025, 17:00.]{.my-highlight}# Question 1**The `datasets::UKDriverDeaths` and `datasets::Seatbelts` datasets**These datasets are meant to be used together---`UKDriverDeaths` has the same data as is provided in the variable `drivers` in `seatbelts`, but it also provides information about the temporal structure of the `Seatbelts` dataset. You will have to devise a way to use this temporal information in your analysis.* Produce a dataframe that combines the temporal information provided in `UKDriverDeaths` with the other information in `Seatbelts`.* Produce a faceted graph (using `facet_wrap()`, placing `drivers`, `front`, `rear`, and `VanKilled` in facets) showing a timeline of monthly means of deaths (means taken across years) whilst distinguishing between the two levels of `law`.* What do you conclude from your analysis?`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}library(tidyverse)deaths <- datasets::UKDriverDeathsseatbelts <- datasets::Seatbelts# checking the data (not necessary)plot(deaths)plot(seatbelts)# make the combined dataframedf <-data.frame(date =seq.Date(as.Date("1969-01-01"), as.Date("1984-12-01"), by ="month"),as.matrix(seatbelts))# make long dataframedf |>select(-DriversKilled, -kms, -PetrolPrice) |>pivot_longer(cols =c(drivers, front, rear, VanKilled),names_to ="type",values_to ="killed") |>mutate(year =year(date)) |>group_by(year, type, law) |>summarise(mean_killed =round(mean(killed), 2), .groups ="drop") |>ggplot(aes(x = year, y = mean_killed)) +geom_line(colour ="grey70") +geom_point(aes(colour =as.factor(law)), size =0.3) +facet_wrap(~type, ncol =2, scales ="free") +guides(colour =guide_legend("Law")) +theme_minimal() +labs(x ="Year", y ="Mean deaths (count)",title ="UK Car Driver and Passenger Deaths",subtitle ="Jan 1969 to Dec 1984")```The data indicate that mandatory the seatbelt law, introduced on January 31, 1983, appears to have reduced the number of deaths resulting from car crashes for drivers and front passengers of passenger cars, and for the drivers of vans. However, there was no apparent reduction in deaths for rear passengers, likely because the law requiring them to wear seatbelts did not come into effect until 1991.It should be noted, however, that the interpretation of these results is not entirely clear. From 1969 to December 1982, there was a consistent reduction in deaths, which may have been due to other advancements in car safety features and road safety legislation. Additionally, only two data points are available following the introduction of mandatory seatbelt laws in 1983, and more data are needed to determine whether this reduction was statistically significant. Nevertheless, it is worth noting that the reduction did prove to be statistically significant.`r if (params$hide_answers) ":::"`# Question 2Please use the **nycflights13** package for this exercise.## 2.aWhat are the 10 most common destinations for flights from NYC airportsin 2013, and what is the total distance travelled to each of theseairports? Make a 2-panel figure and display these data graphically.`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| echo: true#| eval: true#| fig-cap: "The figure for Question 2."library(ggpubr)library(nycflights13)tab1 <- flights %>%group_by(dest) %>%summarise(n =n(),total_distance =sum(distance)) %>%arrange(desc(n)) %>%slice(1:10)plt1 <-ggplot(tab1) +geom_col(aes(x = dest, y = n), fill ="grey90", alpha =0.7,colour ="indianred") +theme_minimal() +labs(x ="Airport code",y ="Number of\ndestimations")plt2 <-ggplot(tab1) +geom_col(aes(x = dest, y = total_distance *0.621371), # convert to kmfill ="grey90", alpha =0.7, colour ="indianred") +theme_minimal() +labs(x ="Airport code",y ="Total\ndistance (km)")ggarrange(plt1, plt2, ncol =1, labels ="AUTO")````r if (params$hide_answers) ":::"`## 2.bWhich airlines have the most flights departing from NYC airports in2013? Make a table that lists these in descending order of frequency andshows the number of flights for each airline. In your table, list thenames of the airlines as well. *Hint:* you can use the `airlines`dataset to look up the airline name based on `carrier` code.`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| echo: true#| eval: truepopular_destinations <- flights %>%count(carrier) %>%arrange(desc(n)) %>%inner_join(airlines, by ="carrier") %>%as_tibble()head(popular_destinations, n =16)```The carrier with the highest number of flights departing from NYC airports in 2013 is United Airlines, followed by JetBlue Airways and ExpressJet Airlines.`r if (params$hide_answers) ":::"`## 2.cConsider only flights that have non-missing arrival delay information.Your answer should include the name of the carrier in addition to thecarrier code and the values asked.i. Which carrier had the highest mean arrival delay?ii. Which carrier had the lowest mean arrival delay?Make sure that your answer includes the name of the carrier and thecalculated mean (±SD) delay times, and use a sensible number of decimaldigits.`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| echo: true#| eval: trueflights %>%filter(!is.na(arr_delay)) %>%group_by(carrier) %>%summarise(mean_arr_delay =round(mean(arr_delay), 1),sd_arr_delay =round(sd(arr_delay), 1)) %>%arrange(desc(mean_arr_delay)) %>%inner_join(airlines, by ="carrier") %>%slice(c(1, n())) %>%as_tibble()```The longest arrival delay was on Frontier Airlines that, on average, arrived 21.9 ± 61.5 (mean ± SD) minutes late. On the contrary, Alaska Airlines typically arrived earlier than anticipated by 9.9 ± 36.5 (mean ± SD) minutes.`r if (params$hide_answers) ":::"`## 2.dWhat were the mean values for the weather variables at the originairport on the top 10 days with the highest departure delays? Contrastthis with a similar view on the 10 days with the lowest departuredelays. Your table(s) should include the names of origin airports, thedates with the highest (lowest) departure delays, and the mean (±SD)weather variables on these days.Can you make any inferences about the effect of weather conditions onflight delays? Are there any problems with this analysis, and how mightyou improve this analysis for a clearer view of the effect of weatherconditions on the ability of flights to depart on time?`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| echo: true#| eval: true#| warning: falsetop_delay <-bind_rows( flights %>%slice_max(order_by = dep_delay, n =10), flights %>%slice_min(order_by = dep_delay, n =10),) %>%select(carrier, flight, tailnum, dep_delay, month, day, origin) %>%inner_join(weather, by =c("origin", "month", "day")) %>%pivot_longer(cols = temp:visib,names_to ="weather_var",values_to ="value") %>%na.omit() %>%group_by(carrier, flight, tailnum, dep_delay, weather_var) %>%summarise(mean_weather_var =round(mean(value, na.rm =TRUE), ),sd_weather_var =round(sd(value, na.rm =TRUE), 1)) %>%unite("mean_sd", mean_weather_var:sd_weather_var, sep =" ± ") %>%arrange(desc(dep_delay)) %>%pivot_wider(names_from = weather_var, values_from = mean_sd) %>%as_tibble()top_delay```Nothing obvious I can see about the effect of weather variables in affecting the departure delay. We would need to do some multivariate stats to assess.`r if (params$hide_answers) ":::"`## 2.ePartition each day into four equal time intervals, e.g. 00:01-06:00,06:01-12:00, 12:01-18:00, and 18:01-00:00.i. At each time interval, what is the proportion of flights delayed at departure? Illustrate your finding in a figure.`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| echo: true#| eval: true# Create time of day variableflights_tod <- flights %>%mutate(time_of_day =case_when( sched_dep_time >=001& sched_dep_time <=600~"00:01-06:00", sched_dep_time >=601& sched_dep_time <=1200~"06:01-12:00", sched_dep_time >=1201& sched_dep_time <=1800~"12:01-18:00", sched_dep_time >=1801~"18:01-00:00" )) %>%mutate(day_of_week =wday(time_hour))# Find proportion of delayed flights for each time of daydelay_time <- flights_tod %>%filter(!is.na(dep_delay)) %>%mutate(dep_delayed =ifelse(dep_delay >0, "delayed", "ontime")) %>%count(time_of_day, dep_delayed) %>%group_by(time_of_day) %>%mutate(prop_delayed = n /sum(n)) %>%filter(dep_delayed =="delayed") %>%arrange(prop_delayed) %>%as_tibble()head(delay_time)``````{r}#| fig-cap: "The figure for Question 6a."#| echo: true#| eval: trueggplot(delay_time, aes(x = time_of_day, y = prop_delayed)) +geom_col(fill ="grey90", alpha =0.7, colour ="indianred") +theme_minimal() +labs(x ="Time of day",y ="Proportion of\nflights delayed")```About 21% of flights are delayed between midnight and 6:00, 26% are delayed between 6:00-12:00, 46% delays between 12:00-18:00, and 52% delays between 18:00pm and midnight. As the day progresses, the better the chance is of there being a delay.`r if (params$hide_answers) ":::"`ii. Based on your analysis, does the chance of being delayed change throughout the day?See answer to Question 2.e.i.iii. For each weekday (1-7) aggregated over 2013, which of the time intervals has the most flights? Create a figure to show your finding.`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| fig-cap: "The figure for Question 6c."#| echo: true#| eval: trueflights_tod %>%group_by(time_of_day, day_of_week) %>%summarise(n_flights =n()) %>%ggplot(aes(x = day_of_week, y = n_flights)) +geom_col(fill ="grey90", alpha =0.7, colour ="indianred") +theme_minimal() +labs(x ="Time of day",y ="Number of\nflights") +facet_wrap(~time_of_day)```Midnight to 6-am has the fewest flights, regardless of what day of the week we are looking at -- although Sundays (day 1) and Saturday (day 7) have slightly fewer flights. Similarly, Sun?Sat have slightly fewer flights between 6-am to 6-pm, although the total number of flights are much higher (closer to 20000 flights per day). Evenings from 6-pm to midnight the flights decrease in numbers, and Saturdays have fewer flights during this time than other days.`r if (params$hide_answers) ":::"`## 2.fFind the 10 planes that spend the longest time (cumulatively) in the air.i. For each model, what are the cumulative and mean flight times? In this table, also mention their type, manufacturer, model, number of engines, and speed.`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| echo: true#| eval: truecum_flights <- flights %>%group_by(tailnum) %>%summarise(cum_air_time =sum(air_time),mean_air_time =round(mean(air_time, na.rm =TRUE), 1)) %>%slice_max(order_by = cum_air_time, n =10) %>%inner_join(planes, by ="tailnum") %>%select(tailnum, cum_air_time, mean_air_time, type, manufacturer, model, engines, speed) %>%as_tibble()head(cum_flights)````r if (params$hide_answers) ":::"`ii. Create a table that lists, for each air-plane identified in (i.), each flight (and associated destination) that it undertook during 2013. `r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r}#| echo: true#| eval: trueflight_dest <- flights %>%filter(tailnum %in% cum_flights$tailnum) %>%select(tailnum, origin, dest, time_hour) %>%as_tibble()head(flight_dest)````r if (params$hide_answers) ":::"`iii. Summarise all the in formation in (ii.) on a map of the USA. Use lines to connect departure and destination locations (each labelled). Different facets in the figure must be used for each of the 10 planes. You can use the `alpha` value in **ggplot2** such that the colour intensity of overlapping flight lines is proportional to the number of flights taken along the path. For bonus marks, ensure that the curvature of Earth is indicated in the flight lines. *Hint:* such lines would display as curves, not straight lines.`r if (params$hide_answers) "::: {.content-hidden}"````{r}# to be done in due course...````r if (params$hide_answers) ":::"`## 2.gLimit this analysis to only the coldest three winter and warmest threesummer months (show evidence for how this is decided). For each of thesetwo seasons, create a visualisation to explore if there is arelationship between the mean daily departure delay and the mean dailytemperature. Be as economical with your code as possible.Discuss your answer.`r if (params$hide_answers) "::: {.content-hidden}"`**Answer**```{r ec}#| echo: true#| eval: trueseas <- weather %>%group_by(month) %>%summarise(mean_temp =round(mean(temp, na.rm =TRUE), 1)) %>%as_tibble()seasggplot(seas, aes(x = month, y = (mean_temp -32)*5/9)) +geom_col(fill ="grey90", alpha =0.7, colour ="indianred") +scale_x_continuous(breaks =seq(2, 12, by =2)) +theme_minimal() +labs(x ="Month",y ="Mean temp. (°C)")```The coldest months are December, January, and February. The warmest time of year is during June, July, August.```{r}#| echo: true#| eval: trueflights %>%filter(month %in%c(12, 1, 2, 6, 7, 8)) %>%inner_join(weather, by =c("origin", "month", "day")) %>%group_by(month) %>%summarise(mean_dep_delay =mean(dep_delay, na.rm =TRUE),mean_temp =round(mean(temp, na.rm =TRUE), 1)) %>%mutate(seas =c("winter", "winter", "summer", "summer", "summer", "winter")) %>%ggplot(aes(x = (mean_temp -32)*5/9, y = mean_dep_delay)) +geom_point(aes(col = seas)) +theme_minimal() +labs(x ="Mean temp. (°C)", y ="Mean delay (min)")```It seems that, in general, shorter delays are experienced during winter months. To fully assess the effect of weather variables on delays, a more detailed statistical analysis will be required.`r if (params$hide_answers) ":::"`