The following code replicates the analysis shown in the Kandinsky paper to study spatial co-localization/dispersion between different cell types found in a breast cancer sample profiled with 10X Genomics Xenium platform.
The Xenium Breast Cancer dataset has been downloaded from 10X Genomics website.
This sample has been analysed using the Xenium Prime 5K Human Pan Tissue & Pathways Panel.
To work with Xenium data using Kandinsky, we first need to set the path to the folder containing Xenium raw input data.
list.files(path)
#> [1] "aux_outputs.tar.gz" "cell_boundaries.csv.gz"
#> [3] "cell_boundaries.parquet" "cell_feature_matrix.tar.gz"
#> [5] "cells.csv.gz" "cells.parquet"
#> [7] "transcripts.parquet"At this point we can import and load Xenium data
into a Seurat object using the Kandinsky function
prepare_xenium_seurat().
We need to specify:
the path to the input folder
the dataset.id that will be used to name the Seurat
dataset.
Additional parameters include:
fovs: to import only the cells mapped within the
specified list of FOVS.
If we want to import the whole dataset, we can set
fov = NULL (default option)
h5: a logical value indicating the type of input
file to use for importing the Xenium cell expression count matrix.
When set to h5 = TRUE, the function will use the R package
rhdf5 (if installed) to import the count matrix from the
cell_feature_matrix.h5 file located in the Xenium
folder.
Alternatively, setting h5 = FALSE (the default) will import
the expression count matrix by reading the compressed
cell_feature_matrix.tar.gz file in the Xenium
folder.
xen = prepare_xenium_seurat(path, dataset.id = 'Breast', h5 = FALSE)
#> Loading cell polygon file...
#> Loading cell metadata...
#> Locating cell transcript file...
#> Preparing fov position data...
#> Preparing gene expression matrix...
#> Creating Seurat object...Before proceeding with the analysis, we can remove single cells with a low number of detected transcripts and genes, using arbitrary thresholds.
In this dataset, we will retain cells that have at least 25 detected transcripts across a minimum of 20 unique genes:
# Filter low-count cells
ncol(xen)
#> [1] 699110
xen = subset(xen, subset = nCount_Xenium > 25)
xen = subset(xen, subset = nFeature_Xenium > 20)
ncol(xen)
#> [1] 468583Now that the new Seurat object is ready, we can generate
the associated Kandinsky data using the kandinsky_init()
function.
Since we are working with Xenium data, we specify the argument
tech = 'xenium' to ensure Kandinsky recognizes the Xenium
data format.
We use Kandinsky to identify cell neighbourhoods based on
cell-to-cell centroid distance
(nb.method = 'C'), and setting a distance threshold of 40um
(d.max = 40).
# Initialize Kandinsky
xen = kandinsky_init(xen
, tech = 'xenium'
, nb.method = 'C'
, d.max = 40)
#> Building Kandinsky slot "sf"...
#> Building Kandinsky slot "nb"...
#> Building Kandinsky slot "tx"...
#> Building Kandinsky slot "fov_mask"...
#> Kandinsky data loaded into Seurat object!To identify major cell types in our Xenium Breast cancer sample, we will use Seurat to perform a sketch-based pre-processing and clustering analysis pipeline.
Specifically, we will first identify single-cell clusters on a subset of 75K cells, and then project the identified cluster into the full dataset to annotate the remaining cells excluded from the initial clustering step. Users can found more information about Seurat sketch-based analysis here.
#Start cell clustering (sketch method)
xen = NormalizeData(xen)
#> Normalizing layer: counts
xen <- FindVariableFeatures(xen)
#> Finding variable features for layer counts
xen <- SketchData(object = xen,
ncells = 75000,
method = "LeverageScore",
sketched.assay = "sketch"
)
#> Calcuating Leverage Score
#> Attempting to cast layer counts to dgCMatrix
#> Attempting to cast layer data to dgCMatrix
DefaultAssay(xen) <- "sketch"
xen <- FindVariableFeatures(xen)
#> Finding variable features for layer counts
xen <- ScaleData(xen)
#> Centering and scaling data matrix
xen <- RunPCA(xen)
#> PC_ 1
#> Positive: IL7R, CCL2, LUM, C11orf96, POSTN, CD52, CXCR4, DCN, SLC2A3, BGN
#> THY1, C1QC, TRAC, SERPINE1, TGFBR2, COL4A1, TRBC1, CCN1, BIRC3, PIM1
#> ZEB2, MS4A6A, CCN2, CTSL, FSTL1, AQP1, AIF1, SGK1, CXCL9, EMP1
#> Negative: CA12, XBP1, RAB11FIP1, SLC9A3R1, TSPAN13, GATA3, CCND1, DDR1, ESR1, HSPB1
#> ANKRD30A, PLXNB1, WWP1, MLPH, HSPB8, THSD4, PDZK1, FOXA1, RGL3, GREB1
#> SERPINA3, PABPC1L, TRPS1, CELSR1, DHCR24, NOTCH2NLA, PDCD4, PRLR, IL6ST, LRATD2
#> PC_ 2
#> Positive: IL7R, TRBC1, TRAC, CXCR4, CD52, MS4A1, CD8A, CD3E, CTLA4, TNFRSF13C
#> CD3D, P2RX5, CD2, CD247, SH2D1A, BIRC3, PIM2, RAB11FIP1, CD79A, TIGIT
#> CD5, CA12, CD28, SEPTIN6, PDCD4, PPP1R16B, RUNX3, TENT5C, XBP1, CD19
#> Negative: COL4A1, CCN1, SERPINE1, COL4A2, BGN, POSTN, PLPP3, THY1, COL5A1, CCN2
#> HSPG2, AQP1, C11orf96, PLVAP, EMP1, THBS1, EPAS1, ADAMTS1, CCL2, TNC
#> FLT1, LUM, SLCO2A1, SERPINH1, FSTL1, COL5A2, SULF1, PTPRB, SOCS3, MMP14
#> PC_ 3
#> Positive: PLVAP, AQP1, FLT1, SLCO2A1, SHANK3, PTPRB, CD93, EGFL7, ADAMTS1, NOS3
#> CALCRL, EPAS1, KDR, TMEM255B, TIE1, MALL, PECAM1, PODXL, ENG, ADGRL4
#> NOTCH4, SELP, SELE, MMRN2, HSPG2, ROBO4, TEK, CD34, PTPRM, TSPAN7
#> Negative: POSTN, LUM, CCN2, COL5A1, SULF1, DCN, SERPINE1, MEG3, TNC, ADAM12
#> COL5A2, CTHRC1, THBS1, ABI3BP, LTBP2, BGN, INHBA, THBS2, CCDC71L, TIMP3
#> CCL2, CCDC80, CCN4, AEBP1, CDH11, CTSL, MRC2, CTSK, VGLL3, LRP1
#> PC_ 4
#> Positive: C1QC, MS4A6A, BIRC5, ITGAX, TOP2A, AIF1, TYMS, C5AR1, KIF18B, CDK1
#> FCGR2A, MYBL2, MPP1, CTSL, CD68, ITGB2, LGMN, ADAM8, TPX2, RRM2
#> MAFB, FOXM1, SGK1, CIITA, STMN1, CCR1, MSR1, NUSAP1, SLCO2B1, MKI67
#> Negative: TIMP3, ERBB2, THBS1, XBP1, ANKRD30A, IGSF1, TCIM, FOXA1, FGFR1, CACNG4
#> NPY1R, MLPH, KIAA1324, WWP1, RAB11FIP1, TFAP2A, TNC, ERBB3, GATA3, SERPINA3
#> ESR1, DCN, IRS1, MEG3, THSD4, SLC9A3R1, IL6ST, PVT1, PLXNB1, PAWR
#> PC_ 5
#> Positive: C1QC, ITGAX, MS4A6A, C5AR1, CTSL, FCGR2A, AIF1, LGMN, MAFB, MPP1
#> CD68, MSR1, ADAM8, SLCO2B1, SGK1, CCR1, PLAUR, LILRB4, LAIR1, PLA2G7
#> MRC1, GRN, ABCA1, NR1H3, HAVCR2, ZEB2, CYBB, ITGB2, SIGLEC1, SIRPA
#> Negative: BIRC5, TOP2A, KIF18B, TYMS, CDK1, MYBL2, RRM2, FOXM1, TPX2, MKI67
#> STMN1, NUSAP1, FANCI, RAD54L, CENPF, SPAG5, MELK, UHRF1, CDCA3, TK1
#> KIF2C, NEK2, DHFR, BUB1, SMC4, ASPM, PCNA, KIF11, TMPO, E2F1
xen <- FindNeighbors(xen, dims = 1:50)
#> Computing nearest neighbor graph
#> Computing SNN
xen <- FindClusters(xen, resolution = 1.2)
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#>
#> Number of nodes: 75000
#> Number of edges: 3021467
#>
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.8710
#> Number of communities: 28
#> Elapsed time: 26 seconds
#> 5 singletons identified. 23 final clusters.
xen <- RunUMAP(xen, dims = 1:50, return.model = TRUE)
#> UMAP will return its model
#> 11:13:51 UMAP embedding parameters a = 0.9922 b = 1.112
#> 11:13:51 Read 75000 rows and found 50 numeric columns
#> 11:13:51 Using Annoy for neighbor search, n_neighbors = 30
#> 11:13:51 Building Annoy index with metric = cosine, n_trees = 50
#> 0% 10 20 30 40 50 60 70 80 90 100%
#> [----|----|----|----|----|----|----|----|----|----|
#> **************************************************|
#> 11:14:00 Writing NN index file to temp file /var/tmp/pbs.750154.peralta/RtmpLxjNjH/fileab0163608201
#> 11:14:00 Searching Annoy index using 1 thread, search_k = 3000
#> 11:14:35 Annoy recall = 100%
#> 11:14:36 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
#> 11:14:40 Initializing from normalized Laplacian + noise (using RSpectra)
#> 11:14:43 Commencing optimization for 200 epochs, with 3719930 positive edges
#> 11:16:36 Optimization finished
xen <- ProjectData(object = xen,
assay = "Xenium",
full.reduction = "pca.full",
sketched.assay = "sketch",
sketched.reduction = "pca",
umap.model = "umap",
dims = 1:50,
refdata = list(cluster_full = "seurat_clusters")
)
#> pca.full is not in the object. Data from all cells will be projected to pca
#> Projecting cell embeddings
#> Finding sketch neighbors
#> Finding sketch weight matrix
#> Transfering refdata from sketch
#> Projection to sketch umap
#> Running UMAP projection
#> 11:21:16 Read 468583 rows
#> 11:21:16 Processing block 1 of 1
#> 11:21:16 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
#> 11:21:23 Initializing by weighted average of neighbor coordinates using 1 thread
#> 11:21:26 Commencing optimization for 67 epochs, with 14057490 positive edges
#> 11:23:37 Finished
# Now that we have projected the full dataset, switch back to analyzing all cells
DefaultAssay(xen) <- "Xenium"
Idents(xen) = 'cluster_full'
length(unique(xen$cluster_full))
#> [1] 23We have identified a total of 23 clusters.
We can now annotate each cluster for its corresponding cell
type by inspecting cluster-specific marker genes through
differential expression (DE) analysis using Seurat
FindAllMarkers() function. In particular, we define the
list of marker genes for each cluster by focusing on significantly
up-regulated genes (adjusted p-value < 0.05 and logFC > 0.5).
library(presto)
res = FindAllMarkers(xen, logfc.thresold = 0.1, min.pct = 0.1, only.pos = T)
res = res %>% filter(p_val_adj < 0.05, avg_log2FC > 0.5)res %>% group_by(cluster) %>% arrange(desc(avg_log2FC)) %>% ungroup() %>% arrange(cluster)
#> # A tibble: 2,785 × 7
#> p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
#> 1 0 3.03 0.107 0.008 0 0 CRH
#> 2 0 2.73 0.236 0.024 0 0 PLIN5
#> 3 0 2.56 0.223 0.023 0 0 AGR3
#> 4 0 2.52 0.378 0.043 0 0 AGR2
#> 5 0 2.49 0.406 0.052 0 0 WNT9A
#> 6 0 2.49 0.79 0.121 0 0 TSPAN13
#> 7 0 2.47 0.517 0.074 0 0 SERPINA3
#> 8 0 2.40 0.569 0.081 0 0 DDR1
#> 9 0 2.39 0.104 0.01 0 0 PC
#> 10 0 2.38 0.396 0.055 0 0 CELSR1
#> # ℹ 2,775 more rows
# 23 clusters found by Seurat
clust_names = seq_len(23) - 1
names(clust_names) = c('Tumour.1','Fibro.1','Endo','TCD8','Macro','Tumour.2',
'Fibro.2','Tumour.3','Tumour.4','Mono','Bcell','Tumour.5',
'Tumour.6','TCD4','Plasma','PVL','Unassigned','DC.1',
'MEC','Mast','Tumour.7','DC.2','Adipocytes')
clust_names = data.frame(seurat_clust = clust_names,final_clust = names(clust_names))
xen$cluster_full = as.numeric(as.character(xen$cluster_full))
new_meta = merge(xen@meta.data, clust_names, by.x = 'cluster_full', by.y = 'seurat_clust')
rownames(new_meta) = new_meta$cell_id
new_meta = new_meta[colnames(xen),]
xen@meta.data = new_meta
xen$final_clust = factor(xen$final_clust,levels = sort(unique(xen$final_clust)))
Idents(xen) = 'final_clust'
xen$top_anno = unname(sapply(as.character(xen$final_clust),function(x){strsplit(x,split='\\.')[[1]][[1]]}))We can now visualize our single cells annotated according to their cell type of origin. Specifically, we will use three different visualization strategies:
DotPlot();RunUMAP()) and coloured according to their cell type of
origin;KanPlot() is used to map single cells into space according
to their associated segmentation coordinates. Each cell will be coloured
according to their cell type of origin.markers = FindAllMarkers(xen, logfc.thresold = 0.5, min.pct = 0.1, only.pos = T)
markers = markers %>%
group_by(cluster) %>%
filter(p_val_adj < 0.05, avg_log2FC >= 1) %>%
slice_max(n = 2, order_by = avg_log2FC) %>%
pull(gene)
markers = unique(markers)
library(ggplot2)
library(viridis)
# Cell type marker dot plot
DotPlot(xen, features = markers, dot.min = 0.05) + scale_color_viridis(option = 'viridis') +
theme(axis.text.x = element_text(size = 4, angle = 90), legend.text = element_text(size = 3), legend.key.size = unit(1, 'cm'), legend.title = element_text(size = 5)) + labs(x = '', y = '')
pal = c("ivory1", "#ff0062","#e6ad1e", "cyan","#cfcf1b", "#FFCC99", "gray90", "#94FFB5","#f0f29d",
"limegreen", "#e8e9fd", "#FFA405", "#FFA8BB", "lightcyan", "magenta", "dodgerblue3", "#00d1c3","#E0FF66",
"#ddbefa","indianred1", "#FFFF80", "#FFE100", "#ff7236")
# UMAP
DimPlot(xen, reduction = 'full.umap', cols = pal, raster.dpi = c(8000, 8000), pt.size = 5) + theme_void() + NoLegend()
# KanPlot of representative sample region (upper-left sample quarter)
max_x = median(xen$x_centroid)
min_y = median(xen$y_centroid)
KanPlot(subset(xen, subset = x_centroid > max_x & y_centroid< min_y),
feature = 'final_clust'
, palette_discrete = pal
, show_border = T
, border.size = 0.02)By examining the spatial visualization of the whole Xenium breast
cancer slide generated with KanPlot(), we can notice a
highly variable density/abundance of cells,
particularly when comparing peri-tumoral and stromal regions.
Having defined c-NBs with Kandinsky using a fixed centroid distance of 40um, we might have a relatively variable size of c-NBs depending on cell localization within the sample.
To assess this variability, we can use the nb_summary()
function from Kandinsky to get a text summary of the range of
c-NB size values observed across all cells in the Xenium
dataset.
nb_summary(xen)
#> Neighbour method: Centroid distance (C), d.max = 40
#> Total cells/spots: 468583
#> Total neighbour links: 17828594
#> Minimum number of neighbours per cell/spot: 0
#> Mean number of neighbours per cell/spot: 38.05
#> Median number of neighbours per cell/spot: 35
#> Max number of neighbours per cell/spot: 140
#> Number of cells/spots with no neighbours: 1757
# Get list of c-NB size for all cells
tot_nb = data.frame(size = Matrix::colSums(as(KanData(xen,'nb'),'CsparseMatrix')))
# Plot distribution of c-NB size values across cells
labels = c(min(tot_nb$size), median(tot_nb$size), max(tot_nb$size))
ggplot(tot_nb, aes(y = size)) + theme_classic()+
geom_density() + scale_y_reverse(breaks = (labels), labels = (labels))+
geom_hline(yintercept = labels[2], linetype = 'dashed') +
theme(axis.text.x = element_blank(), axis.line.x = element_blank(), axis.ticks.x = element_blank()) +
guides(y = guide_axis(cap = "both")) + labs(y = 'c-NB size', x = '')Kandinsky implements a co-localisation/dispersion analysis module to test for spatial co-localisation or dispersion between any pair of cell types present in the analysed dataset.
Co-localisation/dispersion analysis is performed by using the
Kandinsky jc_coloc() function.
The cell type annotation name is specified through the
label parameter. The function will internally performs a
join count test to calculate a co-localisation
z-coefficient for each cell type pairs, with positive and negative
coefficients indicating a tendency of co-localisation or dispersion
between cell types, respectively.
The main output of jc_coloc() is a heatmap plot
displaying all co-localisation z-values computed by Kandinsky for any
cell type combination. By setting the parameter
return.mat = TRUE, the function will return a list object
containing both the heatmap plot and a data frame reporting all the
numeric co-localisation z-values computed by Kandinsky.
xen$final_clust = factor(xen$final_clust, levels = c(
'Bcell','Plasma','DC.1','DC.2','Mono','Macro',
'TCD4','TCD8','Mast','Tumour.1','Tumour.2',
'Tumour.3','Tumour.4','Tumour.5','Tumour.6','Tumour.7',
'MEC','Adipocytes','Endo',
'Fibro.1','Fibro.2','PVL','Unassigned'))
coloc = jc_coloc(xen, label = 'final_clust', return.mat = TRUE)
head(coloc$coloc_mat)
#> Joincount Expected Variance z-value Type1 Type2
#> Bcell:Bcell 208821 35284.556 79180.219 616.71192 Bcell Bcell
#> Plasma:Plasma 29564 11883.349 20715.906 122.84184 Plasma Plasma
#> DC.1:DC.1 30027 4006.886 5764.844 342.70103 DC.1 DC.1
#> DC.2:DC.2 5506 1364.572 1717.168 99.94101 DC.2 DC.2
#> Mono:Mono 33847 18853.386 36322.642 78.67153 Mono Mono
#> Macro:Macro 46688 32008.312 70061.469 55.45966 Macro Macro
#> oe_log2ratio oe_ratio
#> Bcell:Bcell 2.5651580 5.918198
#> Plasma:Plasma 1.3149001 2.487851
#> DC.1:DC.1 2.9057069 7.493849
#> DC.2:DC.2 2.0125560 4.034965
#> Mono:Mono 0.8442043 1.795274
#> Macro:Macro 0.5446052 1.458621
coloc$plot + ggtitle('JC co-localization test')From the co-localisation results, we noticed a
tendency of myoepithelial cells (MEC) to co-localise
with Tumour 5 and Tumour 6 cells.
We can check for phenotypic differences between this
subset of tumour cells and the remaining tumour clusters via
differential gene expression analysis:
de_tumour = FindMarkers(xen, ident.1 = c('Tumour.5','Tumour.6'), ident.2 = c('Tumour.4','Tumour.1','Tumour.2','Tumour.3','Tumour.7')
, logfc.thresold = -Inf, return.thresh = 1.1, only.pos = FALSE)
de_tumour$gene = rownames(de_tumour)
de_tumour$comparison = '56_vs_12347'To interpret differential gene expression results, we will use the package fgsea to perform pre-ranked gene set enrichment analysis considering the Hallmark gene gene sets from the Molecular Signature DataBase (MSigDB).
# Load list of Hallmark signatures
h = system.file('extdata','MSigDB_Hallmark_sigs.rds', package = 'Kandinsky')
h = readRDS(h)
de_tumour = de_tumour %>% mutate(FC = 2^avg_log2FC) %>% arrange(desc(FC))
# Run fgsea
library(fgsea)
stats = de_tumour$avg_log2FC
names(stats) = de_tumour$gene
set.seed(347548)
gsea_res = fgsea(stats = stats, pathways = h)
gsea_res = gsea_res %>% as.data.frame()
# Clean fgsea result table
gsea_res$pathway = gsub('HALLMARK_','',gsea_res$pathway)
gsea_res = gsea_res[,c('pathway','padj','NES')] %>% filter(padj < 0.05) %>%
arrange((NES)) %>%
mutate(pathway = factor(pathway, levels = unique(pathway)), label = ifelse(NES > 0,'Enriched in Tumour 5/6','Depleted in Tumour 5/6'))
# Plot significant enrichment results
nes_limits = c(min(gsea_res$NES), max(gsea_res$NES))
ggplot(gsea_res,aes(x = NES, y = pathway, fill = label))+
theme_classic() + geom_bar(stat = 'identity', color = 'black')+
scale_fill_manual(values = c('dodgerblue3','yellow'))+
scale_x_continuous(breaks = nes_limits, labels = round(nes_limits, 1))+
theme(axis.line.y = element_blank(), axis.ticks.y = element_blank(),
axis.text = element_text(color = 'black'), axis.text.y = element_text(size = 3),
legend.position = 'top')+
guides(y = guide_axis(cap = "both"), x = guide_axis(cap = "both"))+
labs(fill = '', y = '')Tumour 5 and 6 cells showed increased interferon response and reduced proliferation potential compared to tumour cells in other clusters.