Previously, we looked at automating the identification of cluster of cells in high dimensional flow cytometry data. The method in flowcytoscript is a very basic approach adapted from ScType by Aleksander Ianevski (article here). This is now available as a stand-alone bit of code that you can use independently of flowcytoscript to try to identify or automate the naming of your clusters. Code is available on GitHub.
What is it, how does it work, how well does it work?
It's R code.
It matches the expression of markers in your clusters to cell type definitions in a spreadsheet.
It's not amazing.
I'll be working on improving this, but you may still find the code useful to quickly give you an idea of which cell types may be present in a cluster. There are several problems with trying to automate cluster identification, even with the best of data. One of the major ones is that clustering algorithms don't split cells into the same neat groupings we expect to find, and if you under-cluster your data, you'll have multiple cell types present in a given cluster. Secondarily, unlike scRNA-Seq data, with flow cytometry data we don't have access to all of the expression information, just the markers we chose to include in our panel. That makes it pretty impossible to identify all cell types all of the time. Still, it can be better than it currently is.
If you're interested in this problem, you may wish to get in touch with Ryan Brinkman of Dotmatics regarding the SOULCAP project.
Anyway, let's have a look at an example of automated cluster annotation using Florian Mair's OMIP-102 50-colour human PBMC data. This is pretty ideal data for this because it's deep phenotyping covering well characterized cell types.
I've pre-processed the data in FlowJo, setting biexponential transforms as detailed here. I've then exported the data as a channel values CSV file, preserving those transforms into the written data. The data get clustered in R using FlowSOM (via EmbedSOM and ConsensusClusterPlus), and for ease of visualization, I've assigned a color to each cluster and mapped the data to a tSNE. The details of all of this are available in the omip102_example file on GitHub. The actual data from OMIP-102 are too big to put on GitHub, so reach out if you want a copy.
Pre-processed data from OMIP-102:
To run your data like this example, you want your CSV file to look like this:
Each row is a cell and has the expression value for each marker (columns). Additionally, there's a column for which cluster the cell belongs to (called Cluster). Add in the Clust_color and tSNE or UMAP columns if you want to create nice graphs (otherwise delete the relevant parts from the code).
There are a couple of points in the code where you can provide more information to get better results with the annotations. In particular, you'll need to specify which species you're working with and which tissue source(s) you've got.
The code then tries to standardize your marker names so it can match them to the cell type database. If you've written your marker names in a non-standard way, you can either change them yourself or update the marker_names spreadsheet to add your way of writing things.
Examples from the marker_names spreadsheet:
Once that's done, the code scores each cluster against the definitions in the celltype_database.
Each cell type is listed with the markers it does and doesn't express (in theory). You can modify the spreadsheet to add new cell types or change the definitions of existing cell types. Contact me if there are specific items you'd like added.
Examples from the human cell type spreadsheet:
The script creates a heatmap showing the scores for each cluster against every cell type (red = high score).
And we also get histograms (ridgeline plots) for each cluster showing the expression of all markers. This is helpful for evaluating the labels that have been assigned to each cluster. Are these all accurate? I doubt it. Some of the B cell subsets and pDC/Basophil distinctions seem problematic. This is a work in progress. Suggestions are appreciated.
Histograms of marker expression with annotated clusters
If you've clustered a lot, like this example, there will probably be multiple clusters per cell type. For instance, there are multiple NK clusters. What's the difference between these?
To address this, we can run the second part of the cluster annotation, which figures out which are the most differentially expressed markers between clusters with the same label. It then adds those marker names to the cluster label if the cluster expresses the marker. This generates longer, usually unique names. In this case, the one of the NK clusters is now called NK CD16.
Histograms of marker expression with highly annotated clusters
Finally, here are some tSNE plots of the data with the clusters labeled in both long and short form.
Anyway, try it out if you want. It may be useful. Check the results before you go ahead, though.
In my experience (mostly mass cytometry), a lot of clustering programs have difficulty with separating out pDCs/cDCs/Basophils. There seem to be at least 2 major issues: Issue #1. Rarity (Freq of Total, not Freq of Parent). In PBMCs, all 3 of those are comparatively rare populations. Some of the density-dependent methods (SPADE and others) help combat this issue, but it's always easier to find larger groupings. Issue #2. Poorly described by panel. Looking at the OMIP-102 panel, there are definitely some major DC lineage markers (esp CD11c, CD123), but only 10/50 markers in Table 2 even mention DCs, and only ~5 are "lineage" for DCs or basophils. Regarding B cells: I've also encountered issues where an expression level i…