In this post, I compared a few dimensionality reduction approaches for visualizing flow cytometry data. I plotted the flow cytometry data with various approaches, coloring in the clusters generated by FlowSOM. Today we're going to have a look at how well these different approaches compare in terms of maintaining the structure of the data.
Fair warning: I'm not a mathematician and this is not intended to be definitive. I'm just hoping this provides some insight into differences in functioning between these algorithms.
In order to understand whether the dimensionality reduction algorithms are "preserving" the shape of the original data, we need a metric for that. Whatever we choose to measure the multidimensional flow data is going to influence our analysis of the preservation of that structure. I'm choosing to use FlowSOM clusters as my wayposts in the data, and to measure the distance between different clusters as the structure of the data. In the multidimensional space of the original data, that's hard to visualize, but you might think about it like this on a tSNE plot:
On this plot, we have a short distance between the purple Naive CD8 cluster and the orange Naive CD8 cluster. The line between the purple Naive CD8 cluster and the red Follicular B cell cluster is longer. So, the CD8 clusters are closer to each other.
If we take all those measurements and draw a dendrogram of the relationships between the clusters in the multidimensional space, we get this:
It's not perfect, but there are groupings that make sense like the T cell branch, the ILC branch and the pDC branch. As a reminder, on a dendrogram, the longer you have to travel up a branch and back out again, the less related the cell types are.
What we can do now is run this analysis for each type of dimensionality reduction in this post and this one on EmbedSOM. Using those distance relationships, we can create a meta-dendrogram of how the dimensionality reduction approaches relate to each other. Remember, this is about how they deal with the relationships between FlowSOM clusters.
At first, we get this:
This suggests that none of our dimensionality reduction approaches preserves the distance relationships between clusters, at least as they were in the original multidimensional (multid) dataset.
Why?
The multidimensional space has more dimensions, so much greater distance. We can't really compare the distances directly, so we need to put them all on the same scale. If we normalize the distances, we get this dendrogram:
Notably with this, EmbedSOM and EmbedSOM based on growing quad-trees (GQT) are close together, UMAP and EmbedSOM with UMAP landmarks are similar, and PHATE and EmbedSOM with PHATE landmarks are on the same branch. That suggests this approach has some validity.
Interestingly, PaCMAP is quite close to the original (multid) data, as is DenSNE. PaCMAP is designed to preserve the global and local structure, and it appears to be doing that here.
Neighborhood visualization
Another way of understanding the data is to look at the nearest neighbors (k-NN) for each cell. This means we're looking at the relationships between cells rather than clusters. This makes sense in that we're no longer relying on an external metric like FlowSOM for understanding the structure of the data. Several of these algorithms also employ k-NN to create the dimensionality reduction, so this also makes sense.
k-NN struggles with high dimensional data, though. In the original space, a lot of the dimensions are pretty empty. Think about it this way: if we have B cell markers in the data, the T cells will all be centered around zero for those channels and vice versa. It's hard to figure out what the nearest neighbors are if they're all really far apart.
Furthermore, although some of the dimensionality reduction approaches use k-NN, they won't be calculating it exactly the same way. Really, if we want to assess how well the dimensionality reductions work, we should be running them on the same input k-NN matrix. This can be done in R for tSNE and UMAP. For tSNE, use the Rtsne_neighbors function, and in uwot, pass the k-NN matrix to nn_method.
As Tyler Burns has recently highlighted, the Sleepwalk package developed by Svetlana Ovchinnikova and Simon Anders provides a great visualization tool for comparing how cells are placed on different dimensionality reduction plots.
Here are some videos of the neighborhoods of cells on different embeddings.
For a fully interactive comparison of tSNE, UMAP, PaCMAP and EmbedSOM built on PaCMAP landmarks, visit Sleepwalk (drcytometer.github.io).
Finally, I'm going to mention this pre-print from David Novak in Yvan Saeys's lab. This project develops a new dimensionality reduction approach called ViVAE that aims to preserve the overall structure of the data better, both locally like tSNE and farther away like PCA or MDS. The authors also developed some math to compare and quantify the ability of different algorithms to achieve these metrics. ViVAE scores well on both categories, and is likely to be a good method for analyzing scRNA-Seq and high dimensional cytometry data. As an aside, in this paper and others, tSNE performs best for preserving local structure in the data (relationships within blobs).
Comentários