Randomisation of Rows in a Data Frame

Published

July 11, 2025

Randomising the order of observations is effectively a permutation of the data.frame’s (row) index vector. Because R stores a data.frame as a list of equal-length vectors, shuffling the rows amounts to reordering each vector simultaneously according to a single, randomly drawn permutation.

Base-R

1set.seed(42)
2df_perm <- df[sample(nrow(df)), ]
3row.names(df_perm) <- NULL
1
Fix the RNG for reproducibility when needed
2
sample() without the replace argument returns a permutation
3
(Optional) drop the original row indices

sample(nrow(df)) results in a vector 1:n rearranged into a fresh random order, so subsetting the data.frame with that vector coerces every column to follow suit. The row names stay attached unless one removes them.

Tidyverse

library(dplyr)

df_perm <- df %>%
  slice_sample(n = nrow(.))

# or, predating slice_sample():
df_perm <- df %>%
  sample_frac(1)

slice_sample() accepts an optional weight_by argument, permitting unequal permutation weights should the null hypothesis require them. By contrast, sample_frac(1) asks to “take a 100 % sample”, and, being fraction-based, it sidesteps the need to compute nrow(df).

Statistical context

Within permutation-based inference, randomising rows removes any systematic linkage between predictor and response variables but it conserves the empirical marginal distributions. Recomputing a test statistic over many such reshufflings results in distributions that approximate the data’s sampling distribution under the null hypothesis of no association. One call to sample() is sufficient to create one permutation. By placing that call inside replicate() or purrr::rerun() supplies the Monte-Carlo ensemble.

Randomisation of rows presupposes exchangeability. If one’s data carry temporal, spatial, or hierarchical strata, a naïve global shuffle may void the null model’s statistical validity. In such cases one would confine sample() to blocks delineated by the relevant grouping variable, for example via dplyr::group_modify(~ .x[sample(nrow(.x)), ]), and so restrict permutations to contextually relevant partitions.

Reuse

Citation

BibTeX citation:
@online{smit,_a._j.2025,
  author = {Smit, A. J.,},
  title = {Randomisation of {Rows} in a {Data} {Frame}},
  date = {2025-07-11},
  url = {http://tangledbank.netlify.app/BCB743/randomisation.html},
  langid = {en}
}
For attribution, please cite this work as:
Smit, A. J. (2025) Randomisation of Rows in a Data Frame. http://tangledbank.netlify.app/BCB743/randomisation.html.