PCA: WHO SDGs

Published

January 1, 2021

Material required for this chapter
Type Name Link
Data The WHO data 💾 WHO.zip
Tasks to complete in this Chapter

The United Nations adopted an agenda for sustainable development and lists 17 development goals to achieve by 2030. These are called the Sustainable Development Goals (SDGs). The World Health Organization assembles a collection of indicators to track how countries are progressing towards these goals so as to achieve “a world free of poverty, hunger, disease and want” (WHO).

This is an ordination analysis of the SDG 3, “Good Health and Well-Being.”

Load packages

library(tidyverse)
library(vegan)
library(missMDA) # to impute missing values
library(ggcorrplot) # for the correlations

# setting up a 'root' file path so I don't have to keep doing it later...
root <- "../data/WHO/"

Define and load the data

Note The combined data and SDG descriptors are in the zip file. They are called SDG_complete.csv and SDG_description.csv, respectively. There is no need to work through the entire process below; you can simply start with loading the combined data. See the section Scale and centre the data and do the PCA, below.

SDG 1.a Domestic general government health expenditure (GGHE-D) as percentage of general government expenditure (GGE) (%)

# define base location of data files

SDG1.a <- read.csv(paste0(root, "WHO_SDG1.a_domestic_health_expenditure.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG1.a")

SDG 3.1 Maternal mortality ratio (per 100 000 live births)

SDG3.1_1 <- read.csv(paste0(root, "WHO_SDG3.1_maternal_mort.csv")) %>%
  filter(Period == 2016,
         Indicator == "Maternal mortality ratio (per 100 000 live births)") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.1_1")

SDG 3.1 Births attended by skilled health personnel (%)

SDG3.1_2 <- read.csv(paste0(root, "WHO_SDG3.1_skilled_births.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.1_2")

SDG 3.2 Number of neonatal deaths (Child mortality)

SDG3.2_1 <- read.csv(paste0(root, "WHO_SDG3.2_neonatal_deaths.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.2_1")

SDG 3.2 Number of under-five deaths (Child mortality)

SDG3.2_2 <- read.csv(paste0(root, "WHO_SDG3.2_under_5_deaths.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.2_2")

SDG 3.2 Number of infant deaths (Child mortality)

SDG3.2_3 <- read.csv(paste0(root, "WHO_SDG3.2_infant_deaths.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.2_3")

SDG 3.3 New HIV infections (per 1000 uninfected population)

SDG3.3_1 <- read.csv(paste0(root, "WHO_SDG3.3_new_HIV_infections.csv")) %>%
  filter(Period == 2015,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.3_1")

SDG 3.3 Incidence of tuberculosis (per 100 000 population per year)

SDG3.3_2 <- read.csv(paste0(root, "WHO_SDG3.3_TB.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.3_2")

SDG 3.3 Malaria incidence (per 1 000 population at risk)

SDG3.3_3 <- read.csv(paste0(root, "WHO_SDG3.3_malaria.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.3_3")

SDG 3.3 Hepatitis B surface antigen (HBsAg) prevalence among children under 5 years

SDG3.3_4 <- read.csv(paste0(root, "WHO_SDG3.3_hepatitis_B.csv")) %>%
  filter(Period == 2015) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.3_4")

SDG 3.3 Reported number of people requiring interventions against NTDs

SDG3.3_5 <- read.csv(paste0(root, "WHO_SDG3.3_NCD_interventions.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.3_5")

SDG 3.4 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)

SDG3.4_1 <- read.csv(paste0(root, "WHO_SDG3.4_adult_death_prob.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.4_1")

SDG 3.4 Number of deaths attributed to non-communicable diseases, by type of disease and sex

SDG3.4_2 <- read.csv(paste0(root, "WHO_SDG3.4_NCD_by_cause.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes",
         Dim2 == "Diabetes mellitus") %>%
  mutate(Indicator = Dim2) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.4_2")

SDG3.4_3 <- read.csv(paste0(root, "WHO_SDG3.4_NCD_by_cause.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes",
         Dim2 == "Cardiovascular diseases") %>%
  mutate(Indicator = Dim2) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.4_3")

SDG3.4_4 <- read.csv(paste0(root, "WHO_SDG3.4_NCD_by_cause.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes",
         Dim2 == "Respiratory diseases") %>%
  mutate(Indicator = Dim2) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.4_4")

SDG 3.4 Crude suicide rates (per 100 000 population) (SDG 3.4.2)

SDG3.4_5 <- read.csv(paste0(root, "WHO_SDG3.4_suicides.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.4_5")

SDG3.4 Total NCD Deaths (in thousands)

SDG3.4_6 <- read.csv(paste0(root, "WHO_SDG3.4_NCD_data_total.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.4_6")

SDG 3.5 Alcohol, total per capita (15+) consumption (in litres of pure alcohol) (SDG Indicator 3.5.2)

SDG3.5 <- read.csv(paste0(root, "WHO_SDG3.5_alcohol_consumption.csv")) %>%
  filter(Period == 2015,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.5")

SDG 3.6 Estimated road traffic death rate (per 100 000 population)

SDG3.6 <- read.csv(paste0(root, "WHO_SDG3.6_traffic_deaths_prop.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.6")

SDG 3.7 Adolescent birth rate (per 1000 women aged 15-19 years)

SDG3.7 <- read.csv(paste0(root, "WHO_SDG3.7_adolescent_births.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.7")

SDG 3.8 UHC Index of service coverage (SCI)

SDG3.8_1 <- read.csv(paste0(root, "WHO_SDG3.8_UHC_data_availability.csv")) %>%
  filter(Period == "2013-2017") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.8_1")

SDG 3.8 Data availability for UHC index of essential service coverage (%)

SDG3.8_2 <- read.csv(paste0(root, "WHO_SDG3.8_UHC_index_of_service_coverage.csv")) %>%
  filter(Period == 2017) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.8_2")

SDG 3.8 Population with household expenditures on health greater than 10% of total household expenditure or income (SDG 3.8.2) (%)

Not used for some reason.

# SDG3.8_3 <- read.csv(paste0(root, "WHO_SDG3.8_UHC_percent_of_expenditure_1.csv")) %>%
#   filter(Period == 2016) %>%
#   select(Indicator, ParentLocation, Location, FactValueNumeric)

SDG 3.8 Population with household expenditures on health greater than 25% of total household expenditure or income ( SDG indicator 3.8.2) (%)

Not used for some reason.

# SDG3.8_4 <- read.csv(paste0(root, "WHO_SDG3.8_UHC_percent_of_expenditure_2.csv")) %>%
#   filter(Period == 2016) %>%
#   select(Indicator, ParentLocation, Location, FactValueNumeric)

SDG 3.9 Poison control and unintentional poisoning

SDG3.9_1 <- read.csv(paste0(root, "WHO_SDG3.9_unintentional_poisoning_prop.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.9_1")

SDG 3.9 Indicator 3.9.1: Mortality rate attributed to household and ambient air pollution (per 100 000 population)

Data in a format that’s not easy to use.

# SDG3.9_2 <- read.csv(paste0(root, "WHO_SDG3.9_ambient_air_pollution.csv")) %>%
#   filter(Period == 2016,
#          Dim1 == "Both sexes") %>%
#   select(Indicator, ParentLocation, Location, FactValueNumeric)

SDG 3.9 Mortality rate attributed to exposure to unsafe WASH services (per 100 000 population) (SDG 3.9.2)

SDG3.9_3 <- read.csv(paste0(root, "WHO_SDG3.9_WASH_mortalities.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.9_3")

SDG 16.1 Estimates of rate of homicides (per 100 000 population)

SDG16.1 <- read.csv(paste0(root, "WHO_SDG16.1_homicides.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG16.1")

SDG 3.a Prevalence of current tobacco use among persons aged 15 years and older (age-standardized rate)

SDG3.a <- read.csv(paste0(root, "WHO_SDG3.a_tobacco_control.csv")) %>%
  filter(Period == 2016,
         Dim1 == "Both sexes") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.a")

SDG 3.b Total net official development assistance to medical research and basic health sectors per capita (US$), by recipient country

SDG3.b_1 <- read.csv(paste0(root, "WHO_SDG3.b_dev_assistence_for_med_research.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.b_1")

SDG 3.b Measles-containing-vaccine second-dose (MCV2) immunization coverage by the nationally recommended age (%)

SDG3.b_2 <- read.csv(paste0(root, "WHO_SDG3.b_measles_vaccine.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.b_2")

SDG 3.b Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

SDG3.b_3 <- read.csv(paste0(root, "WHO_SDG3.b_diphtheria_vaccine.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.b_3")

SDG 3.b Pneumococcal conjugate vaccines (PCV3) immunization coverage among 1-year-olds (%)

SDG3.b_4 <- read.csv(paste0(root, "WHO_SDG3.b_pneumococcal_vaccine.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.b_4")

SDG 3.b Girls aged 15 years old that received the recommended doses of HPV vaccine

SDG3.b_5 <- read.csv(paste0(root, "WHO_SDG3.b_HPV_vaccine.csv")) %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.b_5")

SDG 3.b Proportion of health facilities with a core set of relevant essential medicines available and affordable on a sustainable basis

Full data not available.

SDG 3.c SDG Target 3.c | Health workforce: Substantially increase health financing and the recruitment, development, training and retention of the health workforce in developing countries, especially in least developed countries and small island developing States

SDG3.c_1 <- read.csv(paste0(root, "WHO_SDG3.c_health_workforce.csv"))  %>%
  filter(Period == 2016,
         Indicator == "Medical doctors (per 10,000)") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.c_1")

SDG3.c_2 <- read.csv(paste0(root, "WHO_SDG3.c_health_workforce.csv"))  %>%
  filter(Period == 2016,
         Indicator == "Nursing and midwifery personnel (per 10,000)") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.c_2")

SDG3.c_3 <- read.csv(paste0(root, "WHO_SDG3.c_health_workforce.csv"))  %>%
  filter(Period == 2016,
         Indicator == "Dentists (per 10,000)") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.c_3")

SDG3.c_4 <- read.csv(paste0(root, "WHO_SDG3.c_health_workforce.csv"))  %>%
  filter(Period == 2016,
         Indicator == "Pharmacists  (per 10,000)") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.c_4")

SDG 3.d SDG Target 3.d | National and global health risks: Strengthen the capacity of all countries, in particular developing countries, for early warning, risk reduction and management of national and global health risks

Data not available.

SDG 3.d Average of 13 International Health Regulations core capacity scores, SPAR version

SDG3.d_1 <- read.csv(paste0(root, "WHO_SDG3.d_health_risks.csv"))  %>%
  filter(Period == 2016) %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "SDG3.d_1")

Other Life expectancy at birth (years)

other_1 <- read.csv(paste0(root, "WHO_Other_life_expectancy.csv")) %>%
  filter(Period == 2015,
         Dim1 == "Both sexes",
         Indicator == "Life expectancy at birth (years)") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "other_1")

Other Life expectancy at age 60 (years)

other_2 <- read.csv(paste0(root, "WHO_Other_life_expectancy.csv")) %>%
  filter(Period == 2015,
         Dim1 == "Both sexes",
         Indicator == "Life expectancy at age 60 (years)") %>%
  select(Indicator, ParentLocation, Location, FactValueNumeric) %>%
  mutate(SDG = "other_2")

rbind the data

health_ls = sapply(.GlobalEnv, is.data.frame) 
health <- do.call(rbind, mget(names(health_ls)[health_ls]))

write.csv(health, file = "WHO_health.csv")

Create list of SDGs used

unique(health[, c(5, 1)])
# ...not shown
# write_csv(unique(health[, c(5, 1)]), file = paste0(root, "SDG_description.csv"))

Pivot wider

health_wide <- health %>%
  arrange(Location) %>%
  select(-Indicator) %>%
  pivot_wider(names_from = SDG, values_from = FactValueNumeric) %>%
  as_tibble()
health_wide <- health_wide[2:nrow(health_wide), -3]

Add world population data

popl <- read_csv(paste0(root, "WHO_population.csv")) %>%
  filter(Year == 2016) %>%
  rename(popl_size = `Population (in thousands) total`,
         Location = Country) %>%
  select(Location, popl_size) %>%
  mutate(popl_size = as.numeric(gsub("[[:space:]]", "", popl_size)) * 1000)

health_wide <- health_wide %>%
  left_join(popl)

Express some variables to unit of population size

health_wide <- health_wide %>%
  mutate(SDG3.4_4 = SDG3.4_4 / popl_size * 100000,
         SDG3.4_3 = SDG3.4_3 / popl_size * 100000,
         SDG3.4_2 = SDG3.4_2 / popl_size * 100000,
         SDG3.4_6 = SDG3.4_6 / 100,
         SDG3.2_2 = SDG3.2_2 / popl_size * 100000,
         SDG3.2_3 = SDG3.2_3 / popl_size * 100000,
         SDG3.2_1 = SDG3.2_1 / popl_size * 100000)

Histograms of missing values, and correlations

# calculate histograms
health_wide$na_count <- apply(health_wide[, 3:(ncol(health_wide) - 1)], 1,
                              function(x) sum(is.na(x)))
hist(health_wide$na_count, breaks = 14, plot = TRUE)

# remove rows where there are more than 10 NAs
health_wide <- health_wide %>%
  dplyr::filter(na_count <= 10) %>%
  dplyr::select(-na_count)
# calculate pairwise correlations
corr <- round(cor(health_wide[, 3:(ncol(health_wide) - 1)]), 1)

# visualization of the correlation matrix
ggcorrplot(corr, type = 'upper', outline.col = "grey60",
           colors = c("#1679a1", "white", "#f8766d"),
           lab = TRUE)

Some of the variables are multicollinear. See Graham (2003) for a discussion of collieanrity in ecological data and how to deal with it.

Impute remaining NAs

There are still many remaining NAs, and I impute them with the imputePCA() method in the missMDA package (see Dray and Josse 2015).

health_wide_complete <- imputePCA(health_wide[, 3:(ncol(health_wide) - 1)])$completeObs

# save for later use
# SGD_data <- cbind(health_wide[, 1:2], health_wide_complete)
# write_csv(SGD_data, file = paste0(root, "SDG_complete.csv"))

Scale and center the data and do the PCA

Note The analysis can proceed from here from the SDG_complete.csv and SDG_description.csv files.

health_wide_complete_std <- decostand(health_wide_complete,
                                      method = "standardize")
health_pca <- rda(health_wide_complete_std)
health_pca
Call: rda(X = health_wide_complete_std)

              Inertia Rank
Total              37     
Unconstrained      37   37
Inertia is variance 

Eigenvalues for unconstrained axes:
   PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
17.349  3.211  1.967  1.654  1.526  1.357  1.025  0.825 
(Showing 8 of 37 unconstrained eigenvalues)
# summary(health_pca)

Graphical displays

Make figure using the vegan biplot.rda() function:

biplot(health_pca, scaling = 1, main = "PCA scaling 1", choices = c(1, 2))

biplot(health_pca, scaling = 2, main = "PCA scaling 2", choices = c(1, 2))

Assemble the ordination plot using the vegan component functions:

pl1 <- ordiplot(health_pca, type = "none", scaling = 1, main = "PCA WHO/SDG")
points(pl1, "sites", pch = 21, cex = 1.0, col = "grey20", bg = "grey80")
points(pl1, "species", pch = 21, col = "turquoise", arrows = TRUE)
text(pl1, "species", col = "blue4", cex = 0.9)

# text(pl1, "sites", col = "red4", cex = 0.9)

pl2 <- ordiplot(health_pca, type = "none", scaling = 2, main = "PCA WHO/SDG")
points(pl2, "sites", pch = 21, cex = 1.75, col = "grey80", bg = "grey80")
points(pl2, "species", pch = 21, col = "turquoise", arrows = TRUE)
text(pl2, "species", col = "blue4", cex = 0.9)
text(pl2, "sites", col = "red4", cex = 0.9)

Yet another way to make an ordination plot. Notice how I use ggplot() to assemble the figure from pre-assembled dataframes containing the species (SDGs) and site (countries) scores. In the respecitve dataframes I also include appropriate labels that can be used to colour-code the ParentLocation (major groupings of countries).

site_scores <- tibble(ParentLocation = health_wide$ParentLocation,
                      Location = health_wide$Location)
site_scores <- tibble(cbind(site_scores, scores(health_pca, display = "sites", choices = c(1:7))))
species_scores <- data.frame(scores(health_pca, display = "species", choices = c(1:7)))
species_scores$species <- rownames(species_scores)
species_scores <- tibble(species_scores)

ggplot(data = site_scores, aes(x = PC1, y = PC2)) +
  geom_point(aes(col = ParentLocation)) +
  geom_segment(data = species_scores, 
               aes(x = 0, y = 0, xend = PC1, yend = PC2),
               arrow = arrow(length = unit(0.4, "cm"), type = "closed"), 
               color = "lightseagreen", alpha = 1, size = 0.3) +
  geom_text(data = species_scores, 
            aes(x = PC1, y = PC2, label = species),
            color = "black") +
  xlab("PC1") + ylab("PC2") + 
  ggtitle("WHO SDGs, Scaling 2")

There seems to be separate groups of colours (ParentLocation). Certain countries come out together in this ordination. This analysis will benefit from a cluster analyses of some kind.

References

Dray S, Josse J (2015) Principal component analysis with missing values: A comparative survey of methods. Plant Ecology 216:657–667.
Graham MH (2003) Confronting multicollinearity in ecological multiple regression. Ecology 84:2809–2815.

Reuse

Citation

BibTeX citation:
@online{j._smit2021,
  author = {J. Smit, Albertus},
  title = {PCA: {WHO} {SDGs}},
  date = {2021-01-01},
  url = {http://tangledbank.netlify.app/BCB743/PCA_SDG_example.html},
  langid = {en}
}
For attribution, please cite this work as:
J. Smit A (2021) PCA: WHO SDGs. http://tangledbank.netlify.app/BCB743/PCA_SDG_example.html.