If you’ve worked with single-cell data, there is always one random undefined cluster that refuses to be annotated, often leading to a non-trivial number of hours being spent digging through cell marker databases and publications. When I came across the clustermole R package by Igor Dolgalev, I was pleasantly surprised at its effectiveness and simplicity. The only issue was that Python is my go-to daily driver for most tasks, which also includes single-cell analysis. So, I did the only logical thing anyone would do, spend my next few weekends writing clustermolepy.
Now, clustermolepy
isn’t meant to be an automated cell type annotation tool, but a quick way to explore the enrichment of genes in your unidentified cluster and query its cell type in different gene set libraries.
How Does Enrichment Work?
It’s worth briefly understanding how the “enrichment” is actually calculated and what happens under the hood. At the core of this is a statistical test that checks whether your list of marker genes overlaps with known cell type–specific gene sets more than you’d expect by random chance.
This is done using Fisher’s exact test, which builds a simple 2×2 contingency table like this:
From this table, Fisher’s Exact Test is used to calculate a p-value, basically asking:
If I randomly picked genes, what’s the chance I’d see this much overlap?
Here’s the formula behind it:
$$ p = \frac{{\binom{a + b}{a} \binom{c + d}{c}}}{{\binom{N}{a + c}}} $$The adjusted p-value is corrected for multiple hypothesis testing using Benjimini-Hocberg. For more details, read the FAQ section of Enricher website!
To illustrate the main use case of clustermolepy
, let’s walk through an example using the well-known pbmc3k dataset from the scanpy
library.
Getting Started with clustermolepy
Installing clustermolepy
You can install the stable release from PyPI using pip:
pip install clustermolepy
You will require Python 3.10 or higher. Make sure your environment meets this requirement before installing.
Annotating Cluster in pbmc3k Dataset
Next, we can import scanpy
and clustermolepy
and set up our data for this example. The pbmc3k dataset is a popular single-cell RNA-seq dataset that contains 3,000 peripheral blood mononuclear cells (PBMCs) from a healthy donor. It includes various cell types, including B cells, T cells, and monocytes. It comes pre-processed with cell type annotations and UMAP embeddings already done, so we can only focus on the cluster enrichment results.
We will use the Leiden clustering algorithm to identify cell clusters in the pbmc3k dataset.
Code:
# Loading Enricher module from clustermolepy along with scanpy
from clustermolepy.enrichr import Enrichr
import scanpy as sc
# Reading the pbmc3k data
adata = sc.datasets.pbmc3k_processed()
# Running the leiden clustering on adata
sc.tl.leiden(adata, flavor='igraph', n_iterations=2, resolution=0.8)
# Plotting the UMAP plots with orginal cell types present in the louvain column
# the newly assigned cluster ids are present in the leiden column
sc.pl.umap(adata, color=['louvain', 'leiden'], legend_loc='on data')
Ouput:
Identify Cluster DE Gene for Enricher
We need to identify marker genes for each Leiden cluster. We’ll use Scanpy’s rank_genes_groups
method to perform differential gene expression and find genes that are upregulated in each cluster compared to the others. clustermolepy
provides a wrapper around Enrichr API to find the enrichment of genes in gene sets.
We will use B cells for this example, which is a well-defined population in pbmc3k dataset. Using scanpy.tl.rank_gene_groups()
, we can identify the top 25 up-regulated genes in each cluster.
Code:
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon', use_raw=False)
# Let's extract the top N marker genes per cluster (e.g., top 25)
top_n_markers = 25
b_cell_markers = sc.get.rank_genes_groups_df(adata, '1').head(top_n_markers).names # The Leiden Cluster 1 corresponds to B cells
print(f"Cluster Marker Genes (Top {top_n_markers} cluster_1 / B cells markers):")
print(b_cell_markers)
Output:
Cluster Marker Genes (Top 25 cluster_1 / B cells markers):
0 HLA-DQA1
1 CD79A
2 HLA-DPB1
3 HLA-DRB1
4 HLA-DQB1
5 CD79B
6 HLA-DPA1
7 CD37
8 MS4A1
9 HLA-DMA
10 TCL1A
11 HLA-DMB
12 SMIM14
13 MZB1
14 LTB
15 PLD4
16 IGJ
17 EAF2
18 CD1C
19 RABEP2
20 TNFRSF17
21 RPL22L1
22 IRF8
23 CEPT1
24 NT5C
Name: names, dtype: object
Using the Enrichr Module for querying Enrichr libraries
Now that we have our list of marker genes for each Leiden cluster, we can use clustermolepy
’s built-in Enrichr
module to directly query the Enrichr API! This makes it super easy to get biological insights from our marker genes.
clustermolepy
provides the get_cell_type_enrichment()
method! This handy function simplifies the process by automatically querying a curated set of Enrichr libraries that are specifically relevant for cell type identification.
Under the hood, get_cell_type_enrichment()
is multi-threaded, making it efficient for querying multiple gene sets efficiently. It automatically checks your marker genes against these ten key gene set libraries:
1. CellMarker_2024
2. CellMarker_Augmented_2021
3. Descartes_Cell_Types_and_Tissue_2021
4. PanglaoDB_Augmented_2021
5. Azimuth_Cell_Types_2021
6. Azimuth_2023
7. Tabula_Sapiens
8. Human_Gene_Atlas
9. Tabula_Muris
10. Mouse_Gene_Atlas
Using get_cell_type_enrichment()
is incredibly straightforward. Using this with our Leiden Cluster 1 marker genes:
Code:
enrichr = Enrichr(list(b_cell_markers), adj_pval_cutoff=0.05) # filter results > 0.05 adjusted p-value
enrichr.get_cell_type_enrichment().head()
Looking at the output below, clustermolepy
identifies strong enrichment for B cell–related terms across multiple annotation databases. Both CellMarker and PanglaoDB return highly significant results, with adjusted p-values well below 0.05 and high odds ratios. Interestingly, we see subtypes like “Naive B Cells”
and “Memory B Cells”
in PanglaoDB, while CellMarker provides tissue-specific annotations like “B cell:Kidney”
or “B Cell Lung Human”
. This highlights how different databases can offer complementary perspectives on cell identity, which is especially helpful when working with complex or heterogeneous cell populations.
Output:
| | term name | p-value | odds ratio | combined score | overlapping genes | adjusted p-value | old p-value | old adjusted p-value | gene_set |
|---:|:--------------------|------------:|-------------:|-----------------:|:--------------------------------------------------------------------------------------------------------------------------------|-------------------:|--------------:|-----------------------:|:--------------------------|
| 0 | B cell:Kidney | 5.08603e-18 | 103.437 | 4118.88 | ['SMIM14', 'EAF2', 'CD79B', 'CD79A', 'TCL1A', 'RABEP2', 'MZB1', 'TNFRSF17', 'IRF8', 'CD37', 'RPL22L1', 'LTB', 'CEPT1', 'MS4A1'] | 8.54452e-16 | 0 | 0 | CellMarker_Augmented_2021 |
| 1 | B Cell Kidney Human | 4.98039e-18 | 103.601 | 4127.57 | ['SMIM14', 'EAF2', 'CD79B', 'CD79A', 'TCL1A', 'RABEP2', 'MZB1', 'TNFRSF17', 'IRF8', 'CD37', 'RPL22L1', 'LTB', 'CEPT1', 'MS4A1'] | 9.21372e-16 | 0 | 0 | CellMarker_2024 |
| 2 | B Cell Lung Human | 1.92402e-16 | 1664.67 | 60239.2 | ['CD79B', 'CD79A', 'TCL1A', 'MZB1', 'TNFRSF17', 'MS4A1'] | 1.77972e-14 | 0 | 0 | CellMarker_2024 |
| 3 | B Cells Memory | 2.2445e-15 | 143.797 | 4850.32 | ['CD79B', 'CD79A', 'TCL1A', 'TNFRSF17', 'IRF8', 'CD37', 'LTB', 'CD1C', 'MS4A1'] | 2.87264e-14 | 0 | 0 | PanglaoDB_Augmented_2021 |
| 4 | B Cells Naive | 2.87264e-15 | 139.718 | 4678.26 | ['CD79B', 'CD79A', 'TCL1A', 'EAF2', 'IRF8', 'CD37', 'LTB', 'CD1C', 'MS4A1'] | 2.87264e-14 | 0 | 0 | PanglaoDB_Augmented_2021 |
While get_cell_type_enrichment()
defaults to querying ten curated cell type gene set libraries, you’re not locked into those. You can easily provide your own list of Enrichr gene set libraries. Maybe you’re working on something niche, or if you just want to query mouse gene set libraries to check the overlap of genes.
Code:
gene_sets = ['Mouse_Gene_Atlas', 'Tabula_Muris']
enrichr.get_cell_type_enrichment(gene_sets).head(3)
Output:
| | term name | p-value | odds ratio | combined score | overlapping genes | adjusted p-value | old p-value | old adjusted p-value | gene_set |
|---:|:-------------------------|------------:|-------------:|-----------------:|:---------------------------------------------------|-------------------:|--------------:|-----------------------:|:-------------|
| 0 | B Cell Fat CL:0000236 | 5.17138e-11 | 138.264 | 3274.82 | ['CD79B', 'CD79A', 'IRF8', 'CD37', 'LTB', 'MS4A1'] | 1.18942e-09 | 0 | 0 | Tabula_Muris |
| 1 | B Cell Liver CL:0000236 | 4.98722e-09 | 110.998 | 2121.88 | ['CD79B', 'CD79A', 'CD37', 'LTB', 'MS4A1'] | 3.20099e-08 | 0 | 0 | Tabula_Muris |
| 2 | B Cell Muscle CL:0000236 | 5.71356e-09 | 107.86 | 2047.23 | ['CD79B', 'CD79A', 'CD37', 'LTB', 'MS4A1'] | 3.20099e-08 | 0 | 0 | Tabula_Muris |
Extra Utilities
Beyond just sending your marker genes to Enrichr,clustermolepy
comes with a couple of utilities for working with gene sets.
Fuzzy Matching for Enrichr Libraries
If you’ve ever used the Enrichr web interface, you know how many libraries there are. clustermolepy
adds a small helpful feature: fuzzy matching for library names. So if you forget whether it’s called "CellMarker_2024"
or "Cell_Marker_2023"
, you can just pass a partial name and let the fuzzy match do the work:
Code:
enrichr.get_libraries(name="cellmarker") # fuzzy search!
Output:
| | geneCoverage | genesPerTerm | libraryName | link | numTerms | categoryId |
|---:|---------------:|---------------:|:--------------------------|:--------------------------------------------|-----------:|-------------:|
| 0 | 14167 | 80 | CellMarker_Augmented_2021 | http://biocc.hrbmu.edu.cn/CellMarker/ | 1097 | 5 |
| 1 | 12642 | 30 | CellMarker_2024 | http://bio-bigdata.hrbmu.edu.cn/CellMarker/ | 1692 | 5 |
Gene Name Conversion via Biomart
clustermolepy
also provides a convenient way to convert gene symbols between different species using the biomart
module. This is particularly useful when you want to compare gene sets or perform enrichment analysis across different organisms.
In this example, we’ll convert human gene symbols to mouse gene symbols using the biomart
module. We’ll use the convert_gene_names()
function to fetch the gene conversion data and then apply it to our list of human genes.
Code:
from clustermolepy.utils import Biomart
# Convert gene names from human to mouse
bm = Biomart(verbose=False)
result = bm.convert_gene_names(
genes=["TP53", "CD4", "FOXP3"], # Example gene names
from_organism="hsapiens", # Human
to_organism="mmusculus" # Mouse
)
print(result)
Output:
{
'TP53': ['Trp53'],
'CD4': ['Cd4'],
'FOXP3': ['Foxp3']
}