Ultra-Fast Large Scale Principal Component Analysis (PCA) in R

In this post, you will learn how to perform principal component analysis (PCA) with lower precision vectors and matrices using the R programming language and the state-of-the-art R package called MPCR. The MPCR package is currently accessible on https://github.com/stsds/MPCR and will soon be conveniently available for download on CRAN.

What is PCA?

PCA is an exploratory data analysis based in dimensions reduction. The general idea is to reduce the dataset to have fewer dimensions and at the same time preserve as much information as possible.

PCA allows us to make visual representations in two dimensions and check for groups or differences in the data related to different states, treatment, etc. Besides, we can get some clue about which variables in the data are responsible for the visual differences.

It is important to highlight that PCA is not used exclusively for the above, and since is an exploratory analysis, the data similarities or differences must be considered in the context where the data come from.

How to Perform PCA?


In this example, I will generate my own dataset with two variables. The following codes show how this can be done in R.

#Generate values for Variable 1.
var_1 <- rnorm(n = 50, mean = 50, sd = 3)

#Generate values for Variable 2.
var_2 <- 0.5 * var_1 + rnorm(n = 50, mean = 0, sd = sqrt(3))

#Combine the values as a data frame.
data_set_1 <- data.frame(var_1, var_2)

#View the entries of the newly created data frame.

#Visualize the relationship between var_1 and var_2 using a scatterplot.
ggplot(data_set_1, aes(x = var_1, y = var_2)) + geom_point(color = "blue", size = 2) + xlab("Variable 1") + ylab("Variable 2") + theme_classic()

Below is the resulting scatterplot:

2. Center each variable.

Compute the mean or average of each variable and subtract from each value its corresponding mean.


#Remove the mean.
data_set_1 <- data_set_1 %>% mutate(var_centered_1 = var_1 - mean(var_1), var_centered_2 = var_2 - mean(var_2))

#Visualize the relationship between var_centered_1 and var_centered_2 using a scatterplot.
ggplot(data_set_1, aes(x = var_centered_1, y = var_centered_2)) + geom_point(color = "blue", size = 2) + xlab("Centered Variable 1") + ylab("Centered Variable 2") + theme_classic()

Removing the mean shifts the position of the points in such a way that the x and y axes are centered around zero, while preserving the relationship between the two variables. This is why the shape of the scatterplot of centered values below is similar to the scatterplot of raw values above.


The built-in R function cov() facilitates the computation of the empirical covariance matrix.

#Extract the centered variables.
data_set_2 <- data_set_1 %>% select(var_centered_1, var_centered_2) %>% as.matrix()

#Compute the empirical covariance matrix.
emp_cov <- cov(data_set_2)

Note that emp_cov is a 2×2 matrix containing in its diagonal the marginal variances of each variable and in its off-diagonals are the covariances between variable 1 and variable 2.


Principal components represent the directions of the data that explain the maximal amount of variance.

The built-in R function eigen() readily computes the eigenvectors and eigenvalues from the empirical covariance matrix.

#Obtain the eigenvectors and eigenvalues.
cov_eigen <- eigen(emp_cov)

#Extract the eigenvectors.
eigen_vec <- cov_eigen$vectors

#Extract the eigenvalues.
eigen_val <- cov_eigen$values

The span of each eigenvector can be considered

#Extract the first eigenvector.
eigen_vec1 <- eigen_vec[,1]

#Compute the slope of the first eigenvector.
eigen_vec1_slope <- eigen_vec1[2] / eigen_vec1[1]

#Extract the second eigenvector.
eigen_vec2 <- eigen_vec[,2]

#Compute the slope of the second eigenvector.
eigen_vec2_slope <- eigen_vec2[2] / eigen_vec2[1]

ggplot(data.frame(data_set_2), aes(x = var_centered_1, y = var_centered_2)) + geom_point(color = "blue", size = 2) + geom_vline(xintercept = 0, size = 0.5) + geom_hline(yintercept = 0, size = 0.5) + geom_abline(slope = eigen_vec1_slope, color = "green", size = 0.7) + geom_abline(slope = eigen_vec2_slope, color = "red", size = 0.7) + xlab("Centered Variable 1") + ylab("Centered Variable 2") + theme_classic()

Below is the scatterplot of the centered values with the green line representing the first eigenvector and the red line representing the second eigenvector:

Regarding to the eigenvalues, their numeric values are equal to the sum of squares of the distances of each projected data point in the corresponding principal component. This sum of squares is maximized in the first principal component.

5. MAke a scree plot.

Dividing each eigenvalue by n – 1 (n is the number of rows in the original data) will provide an estimation of the variance accounted by each principal component. The sum of all the variances (the total variance) can be used to calculate the percentage of variation and visualize them with a Scree Plot:

# Calculate the estimated variance for each eigenvalue
eigen_var <- eigen_val / (nrow(data_set_2) - 1)
# Data frame with variance percentages
var_per <- data.frame(
  PC  = c("PC1", "PC2"),
  PER = c(eigen_var) * 100 / sum(eigen_var) # Calculate the percentage
# Scree plot 
ggplot(var_per, aes(x = PC, y = PER)) +
  geom_col(width = 0.5, color = "black") +
  xlab("Principal component") +
  ylab("Percentage of variation (%)") +