If you’ve worked with single-cell data, there is always one random undefined cluster that refuses to be annotated, often leading to a non-trivial number of hours being spent digging through cell marker databases and publications. When I came across the clustermole R package by Igor Dolgalev, I was pleasantly surprised at its effectiveness and simplicity. The only issue was that Python is my go-to daily driver for most tasks, which also includes single-cell analysis. So, I did the only logical thing anyone would do, spend my next few weekends writing clustermolepy.

Now, clustermolepy isn’t meant to be an automated cell type annotation tool, but a quick way to explore the enrichment of genes in your unidentified cluster and query its cell type in different gene set libraries.


How Does Enrichment Work?

It’s worth briefly understanding how the “enrichment” is actually calculated and what happens under the hood. At the core of this is a statistical test that checks whether your list of marker genes overlaps with known cell type–specific gene sets more than you’d expect by random chance.

This is done using Fisher’s exact test, which builds a simple 2×2 contingency table like this:

Contingency Table

From this table, Fisher’s Exact Test is used to calculate a p-value, basically asking:

If I randomly picked genes, what’s the chance I’d see this much overlap?

Here’s the formula behind it:

$$ p = \frac{{\binom{a + b}{a} \binom{c + d}{c}}}{{\binom{N}{a + c}}} $$

The adjusted p-value is corrected for multiple hypothesis testing using Benjimini-Hocberg. For more details, read the FAQ section of Enricher website!

To illustrate the main use case of clustermolepy, let’s walk through an example using the well-known pbmc3k dataset from the scanpy library.


Getting Started with clustermolepy

Installing clustermolepy

You can install the stable release from PyPI using pip:

pip install clustermolepy

You will require Python 3.10 or higher. Make sure your environment meets this requirement before installing.

Annotating Cluster in pbmc3k Dataset

Next, we can import scanpy and clustermolepy and set up our data for this example. The pbmc3k dataset is a popular single-cell RNA-seq dataset that contains 3,000 peripheral blood mononuclear cells (PBMCs) from a healthy donor. It includes various cell types, including B cells, T cells, and monocytes. It comes pre-processed with cell type annotations and UMAP embeddings already done, so we can only focus on the cluster enrichment results.

We will use the Leiden clustering algorithm to identify cell clusters in the pbmc3k dataset.

Code:

# Loading Enricher module from clustermolepy along with scanpy
from clustermolepy.enrichr import Enrichr
import scanpy as sc

# Reading the pbmc3k data
adata = sc.datasets.pbmc3k_processed()

# Running the leiden clustering on adata
sc.tl.leiden(adata, flavor='igraph', n_iterations=2, resolution=0.8)

# Plotting the UMAP plots with orginal cell types present in the louvain column
# the newly assigned cluster ids are present in the leiden column
sc.pl.umap(adata, color=['louvain', 'leiden'], legend_loc='on data')

Ouput: PBMC3K UMAP

Identify Cluster DE Gene for Enricher

We need to identify marker genes for each Leiden cluster. We’ll use Scanpy’s rank_genes_groups method to perform differential gene expression and find genes that are upregulated in each cluster compared to the others. clustermolepy provides a wrapper around Enrichr API to find the enrichment of genes in gene sets.

We will use B cells for this example, which is a well-defined population in pbmc3k dataset. Using scanpy.tl.rank_gene_groups(), we can identify the top 25 up-regulated genes in each cluster.

Code:

sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon', use_raw=False)

# Let's extract the top N marker genes per cluster (e.g., top 25)
top_n_markers = 25
b_cell_markers = sc.get.rank_genes_groups_df(adata, '1').head(top_n_markers).names # The Leiden Cluster 1 corresponds to B cells
print(f"Cluster Marker Genes (Top {top_n_markers} cluster_1 / B cells markers):")
print(b_cell_markers)

Output:

Cluster Marker Genes (Top 25 cluster_1 / B cells markers):
0     HLA-DQA1
1        CD79A
2     HLA-DPB1
3     HLA-DRB1
4     HLA-DQB1
5        CD79B
6     HLA-DPA1
7         CD37
8        MS4A1
9      HLA-DMA
10       TCL1A
11     HLA-DMB
12      SMIM14
13        MZB1
14         LTB
15        PLD4
16         IGJ
17        EAF2
18        CD1C
19      RABEP2
20    TNFRSF17
21     RPL22L1
22        IRF8
23       CEPT1
24        NT5C
Name: names, dtype: object

Using the Enrichr Module for querying Enrichr libraries

Now that we have our list of marker genes for each Leiden cluster, we can use clustermolepy’s built-in Enrichr module to directly query the Enrichr API! This makes it super easy to get biological insights from our marker genes.

clustermolepy provides the get_cell_type_enrichment() method! This handy function simplifies the process by automatically querying a curated set of Enrichr libraries that are specifically relevant for cell type identification.

Under the hood, get_cell_type_enrichment() is multi-threaded, making it efficient for querying multiple gene sets efficiently. It automatically checks your marker genes against these ten key gene set libraries:

1.  CellMarker_2024
2.  CellMarker_Augmented_2021
3.  Descartes_Cell_Types_and_Tissue_2021
4.  PanglaoDB_Augmented_2021
5.  Azimuth_Cell_Types_2021
6.  Azimuth_2023
7.  Tabula_Sapiens
8.  Human_Gene_Atlas
9.  Tabula_Muris
10. Mouse_Gene_Atlas

Using get_cell_type_enrichment() is incredibly straightforward. Using this with our Leiden Cluster 1 marker genes:

Code:

enrichr = Enrichr(list(b_cell_markers), adj_pval_cutoff=0.05) # filter results > 0.05 adjusted p-value
enrichr.get_cell_type_enrichment().head()

Looking at the output below, clustermolepy identifies strong enrichment for B cell–related terms across multiple annotation databases. Both CellMarker and PanglaoDB return highly significant results, with adjusted p-values well below 0.05 and high odds ratios. Interestingly, we see subtypes like “Naive B Cells” and “Memory B Cells” in PanglaoDB, while CellMarker provides tissue-specific annotations like “B cell:Kidney” or “B Cell Lung Human”. This highlights how different databases can offer complementary perspectives on cell identity, which is especially helpful when working with complex or heterogeneous cell populations.

Output:

|    | term name           |     p-value |   odds ratio |   combined score | overlapping genes                                                                                                               |   adjusted p-value |   old p-value |   old adjusted p-value | gene_set                  |
|---:|:--------------------|------------:|-------------:|-----------------:|:--------------------------------------------------------------------------------------------------------------------------------|-------------------:|--------------:|-----------------------:|:--------------------------|
|  0 | B cell:Kidney       | 5.08603e-18 |      103.437 |          4118.88 | ['SMIM14', 'EAF2', 'CD79B', 'CD79A', 'TCL1A', 'RABEP2', 'MZB1', 'TNFRSF17', 'IRF8', 'CD37', 'RPL22L1', 'LTB', 'CEPT1', 'MS4A1'] |        8.54452e-16 |             0 |                      0 | CellMarker_Augmented_2021 |
|  1 | B Cell Kidney Human | 4.98039e-18 |      103.601 |          4127.57 | ['SMIM14', 'EAF2', 'CD79B', 'CD79A', 'TCL1A', 'RABEP2', 'MZB1', 'TNFRSF17', 'IRF8', 'CD37', 'RPL22L1', 'LTB', 'CEPT1', 'MS4A1'] |        9.21372e-16 |             0 |                      0 | CellMarker_2024           |
|  2 | B Cell Lung Human   | 1.92402e-16 |     1664.67  |         60239.2  | ['CD79B', 'CD79A', 'TCL1A', 'MZB1', 'TNFRSF17', 'MS4A1']                                                                        |        1.77972e-14 |             0 |                      0 | CellMarker_2024           |
|  3 | B Cells Memory      | 2.2445e-15  |      143.797 |          4850.32 | ['CD79B', 'CD79A', 'TCL1A', 'TNFRSF17', 'IRF8', 'CD37', 'LTB', 'CD1C', 'MS4A1']                                                 |        2.87264e-14 |             0 |                      0 | PanglaoDB_Augmented_2021  |
|  4 | B Cells Naive       | 2.87264e-15 |      139.718 |          4678.26 | ['CD79B', 'CD79A', 'TCL1A', 'EAF2', 'IRF8', 'CD37', 'LTB', 'CD1C', 'MS4A1']                                                     |        2.87264e-14 |             0 |                      0 | PanglaoDB_Augmented_2021  |

While get_cell_type_enrichment() defaults to querying ten curated cell type gene set libraries, you’re not locked into those. You can easily provide your own list of Enrichr gene set libraries. Maybe you’re working on something niche, or if you just want to query mouse gene set libraries to check the overlap of genes.

Code:

gene_sets = ['Mouse_Gene_Atlas', 'Tabula_Muris']
enrichr.get_cell_type_enrichment(gene_sets).head(3)

Output:

|    | term name                |     p-value |   odds ratio |   combined score | overlapping genes                                  |   adjusted p-value |   old p-value |   old adjusted p-value | gene_set     |
|---:|:-------------------------|------------:|-------------:|-----------------:|:---------------------------------------------------|-------------------:|--------------:|-----------------------:|:-------------|
|  0 | B Cell Fat CL:0000236    | 5.17138e-11 |      138.264 |          3274.82 | ['CD79B', 'CD79A', 'IRF8', 'CD37', 'LTB', 'MS4A1'] |        1.18942e-09 |             0 |                      0 | Tabula_Muris |
|  1 | B Cell Liver CL:0000236  | 4.98722e-09 |      110.998 |          2121.88 | ['CD79B', 'CD79A', 'CD37', 'LTB', 'MS4A1']         |        3.20099e-08 |             0 |                      0 | Tabula_Muris |
|  2 | B Cell Muscle CL:0000236 | 5.71356e-09 |      107.86  |          2047.23 | ['CD79B', 'CD79A', 'CD37', 'LTB', 'MS4A1']         |        3.20099e-08 |             0 |                      0 | Tabula_Muris |

Extra Utilities

Beyond just sending your marker genes to Enrichr,clustermolepycomes with a couple of utilities for working with gene sets.

Fuzzy Matching for Enrichr Libraries

If you’ve ever used the Enrichr web interface, you know how many libraries there are. clustermolepy adds a small helpful feature: fuzzy matching for library names. So if you forget whether it’s called "CellMarker_2024" or "Cell_Marker_2023", you can just pass a partial name and let the fuzzy match do the work:

Code:

enrichr.get_libraries(name="cellmarker")  # fuzzy search!

Output:

|    |   geneCoverage |   genesPerTerm | libraryName               | link                                        |   numTerms |   categoryId |
|---:|---------------:|---------------:|:--------------------------|:--------------------------------------------|-----------:|-------------:|
|  0 |          14167 |             80 | CellMarker_Augmented_2021 | http://biocc.hrbmu.edu.cn/CellMarker/       |       1097 |            5 |
|  1 |          12642 |             30 | CellMarker_2024           | http://bio-bigdata.hrbmu.edu.cn/CellMarker/ |       1692 |            5 |

Gene Name Conversion via Biomart

clustermolepy also provides a convenient way to convert gene symbols between different species using the biomart module. This is particularly useful when you want to compare gene sets or perform enrichment analysis across different organisms. In this example, we’ll convert human gene symbols to mouse gene symbols using the biomart module. We’ll use the convert_gene_names() function to fetch the gene conversion data and then apply it to our list of human genes.

Code:

from clustermolepy.utils import Biomart

# Convert gene names from human to mouse
bm = Biomart(verbose=False)
result = bm.convert_gene_names(
    genes=["TP53", "CD4", "FOXP3"], # Example gene names
    from_organism="hsapiens", # Human
    to_organism="mmusculus" # Mouse
)
print(result)

Output:

{
    'TP53': ['Trp53'],
    'CD4': ['Cd4'],
    'FOXP3': ['Foxp3']
}