In this series, we're going to look at dimensionality reduction algorithms for use in the display and analysis of high parameter flow cytometry data. Dimensionality reduction (DR) tools allow us to look at the complex data from large flow panels in a format that we can actually comprehend, most commonly as 2D plots. These tools are useful for getting an overview of the data, seeing changes between conditions and perhaps, sometimes, understanding relationships between cells.
The goal of this series is to provide some practical guidance on how to use these for analyzing flow cytometry data. Oftentimes, publications focus on comparing these tools in terms of how well they perform based on metrics that are harder to understand for biologists. Instead, we'll look at the time it takes to run the method, how easy it is to run the analysis, how the plots look, and do some very basic analysis of how well the visualization represents the data in terms of how well relationships between biologically similar cell types are maintained.
For these analyses, I'll be using a 50-colour mouse data set from 2021, specifically the lymph node and gut samples. A lot of new fluorophores and conjugates have come on the market since then, so there are now better ways of reaching 50 colours for flow. You can find the manuscript and data online, and I'll provide the R code and data in this Dropbox folder as well as on GitHub in case you want to replicate it or run your own data through any of these methods.
Some quick notes on the pre-processing of the data:
Data scaling is critical for all of these methods (see this paper for an example of how this matters). We'll look at how scaling affects the output in a future post. For now, all the data are scaled in FlowJo prior to analysis in R. For the aspects where we look at cell type relationships, the clustering has been performed using FlowSOM via EmbedSOM with automated cluster identification based on marker expression. To reduce the analysis time a bit, the data are downsampled to 20,000 cells per sample (120,000 in total). All of these steps are covered in the pre-processing code, which is also provided.
Before we start, let's review what these methods are trying to achieve. They are all trying to preserve some aspects of the data's structure (relationships between points) while reducing the dimensions down to a number that we can visualise. Some algorithms like tSNE and UMAP focus on preserving local structure, while others like PCA, PaCMAP and PHATE, try to preserve global structure. Local structure preservation more or less means keeping relationships correct within "islands" or "blobs" on the plots as opposed to keeping a constant distance metric between blobs. The PaCMAP paper has a really nice breakdown of the difference between local and global structure preservation using the "Mammoth" dataset, which is data in the shape of a pachyderm that gets distorted in various ways by different algorithms.
To start, I recommend reading this excellent paper looking at using various DR tools for mass cytometry data. They found that tSNE, UMAP, PHATE, scvis and SQuaD-MDS performed well. In this series, I'm going to focus on methods that are straightforward to implement in R.
This includes:
PCA
tSNE
UMAP
PHATE
TriMAP
PaCMAP
densMAP
denSNE
EmbedSOM
Today, let's look at the plots that come out and how long it takes to generate each one. All of the processing is being done on a pretty standard Windows laptop with 8 cores, so nothing fancy.
PCA
PCA is quick. It isn't really meant for reducing the data down into two dimensions, but the first two dimensions of the PCA result generally carry a lot of information.
How long does it take?
user system elapsed
0.72 0.02 1.25
That's quick. Just over a second (1.25).
How does it look?
Well, the clusters aren't separating very well, and there's a lot of confusion in the center. Not great.
The plots here are being generated using the scattermore package created by Mirek Kratochvil. I highly recommend this for anyone trying to use R to plot large numbers of points. So much faster.
Is it easy to run?
If you're comfortable with coding, absolutely. Otherwise, not so much. I'm not aware of this being available in any flow analysis software--correct me if I'm wrong.
tSNE
tSNE is a very robust method that provides great preservation of the relationships between similar cells. It tells us next to nothing about points that are far away from each other. Generally, if there's white space between the points, the distance is meaningless.
The tSNE here is being run in a manner that replicates most of the modifications of OptSNE. This includes massively increasing the learning rate (also used in OpenTSNE), as well as reducing both the early exaggeration and total iterations. This really speeds it up while giving us quite good results. The perplexity parameter provides some control over how many neighboring cells influence the position of each cell, so large numbers give slightly better preservation of far-away relationships.
How long does it take?
user system elapsed
1446.50 2.36 323.33
A lot slower than the PCA, coming in at 323sec or just over 6min.
How does it look?
Yeah, pretty good. The different clusters are well separated, and more common cell types take up more space on the plot. We get nice agreement with the FlowSOM clustering. You can see why tSNE + FlowSOM is a popular combination.
Is it easy to run?
Yes. tSNE is available in most flow cytometry analysis software packages, online in tools like Cytobank and Omiq as well as R, Python, etc. Some of these are easier than others, and the web-based ones are probably the easiest (and most expensive).
tSNE with PCA initialization
This is often used for scRNA-Seq. Running PCA first can give the tSNE a better starting point, particularly when the expression of several genes moves in concert or when there are drop-outs. For flow data, this doesn't usually matter.
Time and appearance are essentially unchanged. Some of the blobs have rotated around a bit, which will happen every time you run a tSNE with a different starting point. While you'll have control over PCA-initialization in R or Python, this is less likely in commercial software.
UMAP
UMAP purports to preserve global distance better than tSNE. It does, but that's a bit like saying Perth is closer to London than Sydney is. It still focuses on local structure, so don't try to interpret distances across white spaces on the plot.
How long does it take?
Slightly quicker than tSNE, done in about 5min. Note that I'm using the uwot library, which runs in parallel, unlike the other umap package for R. The Python implementation may be quicker. Also, standard tSNE is much slower to achieve the type of results seen here with the optSNE approach.
How does it look?
There's a lot more white space in a UMAP than in a tSNE. The blobs are more compressed and separated. UMAP tends to have trailing tails leading between blobs, unlike the more rounded shape of tSNE.
Is it easy to run?
Nearly as easy as tSNE. UMAP is accessible via plug-in only for FlowJo, running in R. This honestly is harder (for me) than running it directly in R. FCS Express, on the other hand, has native UMAP support.
EmbedSOM
EmbedSOM arose out of FlowSOM and also uses self-organizing maps to determine the structure of the data.
How long does it take?
Oh, it's quick! About 6sec. And this is the slow version. For really big data, try GigaSOM.
How does it look?
EmbedSOM is pretty different to tSNE or UMAP in appearance. The islands are a lot more connected to each other, which suggests it may preserve some of the global structure. The same cell clusters are grouped together as in UMAP or tSNE (e.g., the two clusters of naive CD8s, orange and purple in the upper right).
EmbedSOM has a lot of options that allow you to get different embeddings. The one above is the default, vanilla EmbedSOM based on a standard self-organizing map. Below we've got an embedding using the growing quadtree SOM, which gives a better representation of the underlying data according to the authors. This takes slightly longer to run (~19sec rather than ~6sec).
In the future we'll look at the "cheap trick" that EmbedSOM uses to replicate any dimensionality reduction algorithm in a fraction of the time.
Is it easy to run?
Not quite as accessible as tSNE or UMAP, although it's really easy to use in R. EmbedSOM is accessible via plug-in only for FlowJo, running in R again.
PHATE
PHATE was developed for scRNA-Seq data. It tries to preserve both local and global structure by creating branches in the data. This is great if you're looking at developmental processes, relatedness between cells or perhaps time/pseudotime relationships.
How long does it take?
user system elapsed
2755.06 7.96 405.25
The implementation in R is fairly slow at almost 7min. About 80% of the time is dedicated to finding the k-nearest neighbors (only 5 in this case). The uwot and RcppHNSW packages can do this in a fraction of the time, so I suspect phateR could do with some optimization.
How does it look?
Different. Not very aesthetically pleasing to my eye, but it shows connections between related cell type. For instance, we can follow the large red cluster of follicular B cells up through B cells to germinal centre (GC) B cells, or the two naive CD8 clusters (left) connect to the CD8 gamma-delta T cells. That's useful.
Is it easy to run?
It runs easily in R or Python. There's a plug-in for FlowJo, running in R.
densvis
Work here by Hyunghoon Cho and colleagues implements an algorithm that preserves the density of the data in the original space in the final embedding. In a high dimensional dataset, a lot of the space is empty. In flow data particularly, though, there are areas with a high density. For instance, if you have a panel focused on T cells, but include CD14 to identify monocytes or CD19 to identify B cells, those cells will occupy a small area because they won't express any of the T cell markers that would disperse them across those channels or dimensions. Densvis integrates this density-preserving algorithm into tSNE and UMAP, with user control over how much influence it has.
Let's look at both tSNE and UMAP with a typical 50% weighting for density preservation.
How long does it take?
densMAP
user system elapsed
0.13 0.00 185.12
denSNE
user system elapsed
2010.77 3.06 2193.42
densMAP is fast, similar to UMAP running via Python, coming in at 3min. denSNE is very slow. This is despite using OptSNE settings to speed it up. In fact, it's about 8x slower because it isn't actually running in parallel, at least for me. Perhaps it works better in Python.
How does it look?
densMAP:
The density-preserved UMAPs always end up compressed for me, so we can't appreciate the data very well. This seems to be happening because there are a few outlier cells in low-density regions that are taking up most of the space. Changing the weighting causes this to happen more or less. Perhaps this would be less of an issue with lower parameter data sets, but I suspect densMAP isn't great for flow data.
denSNE:
Here we can actually see what densvis is trying to accomplish and it's pretty useful. The fuzzy areas are low-density, while the saturated areas have a high density in both the original data and in the embedding. These are the naive follicular B cells (red), naive CD8 clusters (purple and orange), naive CD4 T cells (blue) and intestinal gamma-delta CD8 T cells (green, less so). These are cell types with very uniform expression of markers, and this visualization reflects that information in a way that other dimensionality reductions don't.
Is it easy to run?
In Python, yes. In R, you need to have Python installed and set-up correctly.
TriMAP
TriMAP attempts to preserve the global structure of the data and does so better than UMAP or tSNE. So, with TriMAP, the distances between islands are actually somewhat meaningful.
How long does it take?
user system elapsed
207.06 105.76 180.94
Fairly fast at 3min.
How does it look?
Similar to UMAP, but with more tails connecting the islands. The connections mostly make sense, too. We've got T cell groups in the lower left with connections between them. On the lower right, the B cells connect slightly to pDCs (centre) and nearly to the myeloid cluster of cDCs and monocytes (middle right). The eosinophils are pretty far out to the top, with a bit of a connection to monocytes.
Is it easy to run?
In Python, yes. In R, you can run it in Python via reticulate, as I've done here.
PaCMAP
PaCMAP tries for the best of both worlds, aiming to keep both local and global structure intact. In toy data sets like the mammoth, this works remarkably well.
How long does it take?
user system elapsed
230.42 129.63 115.49
Pretty fast. Just under 2min.
How does it look?
Quite similar to UMAP for this data set.
Is it easy to run?
In Python, yes. In R, you can run it in Python via reticulate, as I've done here.
References and further reading:
Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization | Communications Biology (nature.com)
A cross entropy test allows quantitative statistical comparison of t-SNE and UMAP representations - PMC (nih.gov)
[2012.04456] Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMAP, and PaCMAP for Data Visualization (arxiv.org)
YingfanWang/PaCMAP: PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure (github.com)
What always bugs me is the dependence on the manual gating. All these colors represent bunch of cell population where we can arrive only manually. Once done then only we can overlay these populations on the map. This means any wrong gating or biasness can change the data visualization.