Introduction


The following code replicates the analysis shown in the Kandinsky paper to study spatial co-localization/dispersion between different cell types found in a breast cancer sample profiled with 10X Genomics Xenium platform.


Xenium


The Xenium Breast Cancer dataset has been downloaded from 10X Genomics website.

This sample has been analysed using the Xenium Prime 5K Human Pan Tissue & Pathways Panel.


1) Loading data


To work with Xenium data using Kandinsky, we first need to set the path to the folder containing Xenium raw input data.

# Import Xenium data
path = 'Kandinsky/test_data/Xenium/Breast_5K/'
list.files(path)
#> [1] "aux_outputs.tar.gz"         "cell_boundaries.csv.gz"    
#> [3] "cell_boundaries.parquet"    "cell_feature_matrix.tar.gz"
#> [5] "cells.csv.gz"               "cells.parquet"             
#> [7] "transcripts.parquet"

At this point we can import and load Xenium data into a Seurat object using the Kandinsky function prepare_xenium_seurat().

We need to specify:

  • the path to the input folder

  • the dataset.id that will be used to name the Seurat dataset.

Additional parameters include:

  • fovs: to import only the cells mapped within the specified list of FOVS.
    If we want to import the whole dataset, we can set fov = NULL (default option)

  • h5: a logical value indicating the type of input file to use for importing the Xenium cell expression count matrix.
    When set to h5 = TRUE, the function will use the R package rhdf5 (if installed) to import the count matrix from the cell_feature_matrix.h5 file located in the Xenium folder.
    Alternatively, setting h5 = FALSE (the default) will import the expression count matrix by reading the compressed cell_feature_matrix.tar.gz file in the Xenium folder.

xen = prepare_xenium_seurat(path, dataset.id = 'Breast', h5 = FALSE)
#> Loading cell polygon file...
#> Loading cell metadata...
#> Locating cell transcript file...
#> Preparing fov position data...
#> Preparing gene expression matrix...
#> Creating Seurat object...

Before proceeding with the analysis, we can remove single cells with a low number of detected transcripts and genes, using arbitrary thresholds.

In this dataset, we will retain cells that have at least 25 detected transcripts across a minimum of 20 unique genes:

# Filter low-count cells
ncol(xen)
#> [1] 699110
xen = subset(xen, subset = nCount_Xenium > 25)
xen = subset(xen, subset = nFeature_Xenium > 20)
ncol(xen)
#> [1] 468583


2) c-NB identification


Now that the new Seurat object is ready, we can generate the associated Kandinsky data using the kandinsky_init() function.

Since we are working with Xenium data, we specify the argument tech = 'xenium' to ensure Kandinsky recognizes the Xenium data format.

We use Kandinsky to identify cell neighbourhoods based on cell-to-cell centroid distance (nb.method = 'C'), and setting a distance threshold of 40um (d.max = 40).

# Initialize Kandinsky
xen = kandinsky_init(xen
                     , tech = 'xenium'
                     , nb.method = 'C'
                     , d.max = 40)
#> Building Kandinsky slot "sf"...
#> Building Kandinsky slot "nb"...
#> Building Kandinsky slot "tx"...
#> Building Kandinsky slot "fov_mask"...
#> Kandinsky data loaded into Seurat object!


3) Data normalization and clustering


To identify major cell types in our Xenium Breast cancer sample, we will use Seurat to perform a sketch-based pre-processing and clustering analysis pipeline.

Specifically, we will first identify single-cell clusters on a subset of 75K cells, and then project the identified cluster into the full dataset to annotate the remaining cells excluded from the initial clustering step. Users can found more information about Seurat sketch-based analysis here.

#Start cell clustering (sketch method)
xen = NormalizeData(xen)
#> Normalizing layer: counts

xen <- FindVariableFeatures(xen)
#> Finding variable features for layer counts
xen <- SketchData(object = xen,
                  ncells = 75000,
                  method = "LeverageScore",
                  sketched.assay = "sketch"
)   
#> Calcuating Leverage Score
#> Attempting to cast layer counts to dgCMatrix
#> Attempting to cast layer data to dgCMatrix

DefaultAssay(xen) <- "sketch"
xen <- FindVariableFeatures(xen)
#> Finding variable features for layer counts
xen <- ScaleData(xen)
#> Centering and scaling data matrix
xen <- RunPCA(xen)
#> PC_ 1 
#> Positive:  IL7R, CCL2, LUM, C11orf96, POSTN, CD52, CXCR4, DCN, SLC2A3, BGN 
#>     THY1, C1QC, TRAC, SERPINE1, TGFBR2, COL4A1, TRBC1, CCN1, BIRC3, PIM1 
#>     ZEB2, MS4A6A, CCN2, CTSL, FSTL1, AQP1, AIF1, SGK1, CXCL9, EMP1 
#> Negative:  CA12, XBP1, RAB11FIP1, SLC9A3R1, TSPAN13, GATA3, CCND1, DDR1, ESR1, HSPB1 
#>     ANKRD30A, PLXNB1, WWP1, MLPH, HSPB8, THSD4, PDZK1, FOXA1, RGL3, GREB1 
#>     SERPINA3, PABPC1L, TRPS1, CELSR1, DHCR24, NOTCH2NLA, PDCD4, PRLR, IL6ST, LRATD2 
#> PC_ 2 
#> Positive:  IL7R, TRBC1, TRAC, CXCR4, CD52, MS4A1, CD8A, CD3E, CTLA4, TNFRSF13C 
#>     CD3D, P2RX5, CD2, CD247, SH2D1A, BIRC3, PIM2, RAB11FIP1, CD79A, TIGIT 
#>     CD5, CA12, CD28, SEPTIN6, PDCD4, PPP1R16B, RUNX3, TENT5C, XBP1, CD19 
#> Negative:  COL4A1, CCN1, SERPINE1, COL4A2, BGN, POSTN, PLPP3, THY1, COL5A1, CCN2 
#>     HSPG2, AQP1, C11orf96, PLVAP, EMP1, THBS1, EPAS1, ADAMTS1, CCL2, TNC 
#>     FLT1, LUM, SLCO2A1, SERPINH1, FSTL1, COL5A2, SULF1, PTPRB, SOCS3, MMP14 
#> PC_ 3 
#> Positive:  PLVAP, AQP1, FLT1, SLCO2A1, SHANK3, PTPRB, CD93, EGFL7, ADAMTS1, NOS3 
#>     CALCRL, EPAS1, KDR, TMEM255B, TIE1, MALL, PECAM1, PODXL, ENG, ADGRL4 
#>     NOTCH4, SELP, SELE, MMRN2, HSPG2, ROBO4, TEK, CD34, PTPRM, TSPAN7 
#> Negative:  POSTN, LUM, CCN2, COL5A1, SULF1, DCN, SERPINE1, MEG3, TNC, ADAM12 
#>     COL5A2, CTHRC1, THBS1, ABI3BP, LTBP2, BGN, INHBA, THBS2, CCDC71L, TIMP3 
#>     CCL2, CCDC80, CCN4, AEBP1, CDH11, CTSL, MRC2, CTSK, VGLL3, LRP1 
#> PC_ 4 
#> Positive:  C1QC, MS4A6A, BIRC5, ITGAX, TOP2A, AIF1, TYMS, C5AR1, KIF18B, CDK1 
#>     FCGR2A, MYBL2, MPP1, CTSL, CD68, ITGB2, LGMN, ADAM8, TPX2, RRM2 
#>     MAFB, FOXM1, SGK1, CIITA, STMN1, CCR1, MSR1, NUSAP1, SLCO2B1, MKI67 
#> Negative:  TIMP3, ERBB2, THBS1, XBP1, ANKRD30A, IGSF1, TCIM, FOXA1, FGFR1, CACNG4 
#>     NPY1R, MLPH, KIAA1324, WWP1, RAB11FIP1, TFAP2A, TNC, ERBB3, GATA3, SERPINA3 
#>     ESR1, DCN, IRS1, MEG3, THSD4, SLC9A3R1, IL6ST, PVT1, PLXNB1, PAWR 
#> PC_ 5 
#> Positive:  C1QC, ITGAX, MS4A6A, C5AR1, CTSL, FCGR2A, AIF1, LGMN, MAFB, MPP1 
#>     CD68, MSR1, ADAM8, SLCO2B1, SGK1, CCR1, PLAUR, LILRB4, LAIR1, PLA2G7 
#>     MRC1, GRN, ABCA1, NR1H3, HAVCR2, ZEB2, CYBB, ITGB2, SIGLEC1, SIRPA 
#> Negative:  BIRC5, TOP2A, KIF18B, TYMS, CDK1, MYBL2, RRM2, FOXM1, TPX2, MKI67 
#>     STMN1, NUSAP1, FANCI, RAD54L, CENPF, SPAG5, MELK, UHRF1, CDCA3, TK1 
#>     KIF2C, NEK2, DHFR, BUB1, SMC4, ASPM, PCNA, KIF11, TMPO, E2F1
xen <- FindNeighbors(xen, dims = 1:50)
#> Computing nearest neighbor graph
#> Computing SNN
xen <- FindClusters(xen, resolution = 1.2)
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#> 
#> Number of nodes: 75000
#> Number of edges: 3021467
#> 
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.8710
#> Number of communities: 28
#> Elapsed time: 26 seconds
#> 5 singletons identified. 23 final clusters.

xen <- RunUMAP(xen, dims = 1:50, return.model = TRUE)
#> UMAP will return its model
#> 11:13:51 UMAP embedding parameters a = 0.9922 b = 1.112
#> 11:13:51 Read 75000 rows and found 50 numeric columns
#> 11:13:51 Using Annoy for neighbor search, n_neighbors = 30
#> 11:13:51 Building Annoy index with metric = cosine, n_trees = 50
#> 0%   10   20   30   40   50   60   70   80   90   100%
#> [----|----|----|----|----|----|----|----|----|----|
#> **************************************************|
#> 11:14:00 Writing NN index file to temp file /var/tmp/pbs.750154.peralta/RtmpLxjNjH/fileab0163608201
#> 11:14:00 Searching Annoy index using 1 thread, search_k = 3000
#> 11:14:35 Annoy recall = 100%
#> 11:14:36 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
#> 11:14:40 Initializing from normalized Laplacian + noise (using RSpectra)
#> 11:14:43 Commencing optimization for 200 epochs, with 3719930 positive edges
#> 11:16:36 Optimization finished

xen <- ProjectData(object = xen,
                   assay = "Xenium",
                   full.reduction = "pca.full",
                   sketched.assay = "sketch",
                   sketched.reduction = "pca",
                   umap.model = "umap",
                   dims = 1:50,
                   refdata = list(cluster_full = "seurat_clusters")
)
#> pca.full is not in the object. Data from all cells will be projected to pca
#> Projecting cell embeddings
#> Finding sketch neighbors
#> Finding sketch weight matrix
#> Transfering refdata from sketch
#> Projection to sketch umap
#> Running UMAP projection
#> 11:21:16 Read 468583 rows
#> 11:21:16 Processing block 1 of 1
#> 11:21:16 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
#> 11:21:23 Initializing by weighted average of neighbor coordinates using 1 thread
#> 11:21:26 Commencing optimization for 67 epochs, with 14057490 positive edges
#> 11:23:37 Finished

# Now that we have projected the full dataset, switch back to analyzing all cells
DefaultAssay(xen) <- "Xenium"
Idents(xen) = 'cluster_full'
length(unique(xen$cluster_full))
#> [1] 23


We have identified a total of 23 clusters.

We can now annotate each cluster for its corresponding cell type by inspecting cluster-specific marker genes through differential expression (DE) analysis using Seurat FindAllMarkers() function. In particular, we define the list of marker genes for each cluster by focusing on significantly up-regulated genes (adjusted p-value < 0.05 and logFC > 0.5).

library(presto)
res = FindAllMarkers(xen, logfc.thresold = 0.1, min.pct = 0.1, only.pos = T)
res = res %>% filter(p_val_adj < 0.05, avg_log2FC > 0.5)
res %>% group_by(cluster) %>% arrange(desc(avg_log2FC)) %>% ungroup() %>% arrange(cluster)
#> # A tibble: 2,785 × 7
#>    p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene    
#>    <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
#>  1     0       3.03 0.107 0.008         0 0       CRH     
#>  2     0       2.73 0.236 0.024         0 0       PLIN5   
#>  3     0       2.56 0.223 0.023         0 0       AGR3    
#>  4     0       2.52 0.378 0.043         0 0       AGR2    
#>  5     0       2.49 0.406 0.052         0 0       WNT9A   
#>  6     0       2.49 0.79  0.121         0 0       TSPAN13 
#>  7     0       2.47 0.517 0.074         0 0       SERPINA3
#>  8     0       2.40 0.569 0.081         0 0       DDR1    
#>  9     0       2.39 0.104 0.01          0 0       PC      
#> 10     0       2.38 0.396 0.055         0 0       CELSR1  
#> # ℹ 2,775 more rows

# 23 clusters found by Seurat
clust_names = seq_len(23) - 1

names(clust_names) = c('Tumour.1','Fibro.1','Endo','TCD8','Macro','Tumour.2',
                       'Fibro.2','Tumour.3','Tumour.4','Mono','Bcell','Tumour.5',
                       'Tumour.6','TCD4','Plasma','PVL','Unassigned','DC.1',
                       'MEC','Mast','Tumour.7','DC.2','Adipocytes')

clust_names = data.frame(seurat_clust = clust_names,final_clust = names(clust_names))

xen$cluster_full = as.numeric(as.character(xen$cluster_full))
new_meta = merge(xen@meta.data, clust_names, by.x = 'cluster_full', by.y = 'seurat_clust')
rownames(new_meta) = new_meta$cell_id
new_meta = new_meta[colnames(xen),]
xen@meta.data = new_meta
xen$final_clust = factor(xen$final_clust,levels = sort(unique(xen$final_clust)))
Idents(xen) = 'final_clust'
xen$top_anno = unname(sapply(as.character(xen$final_clust),function(x){strsplit(x,split='\\.')[[1]][[1]]}))


We can now visualize our single cells annotated according to their cell type of origin. Specifically, we will use three different visualization strategies:

  • markers dot plot: the top 2 DE genes for each cell type will be displayed in a dot plot using the Seurat function DotPlot();
  • Uniform Manifold Approximation and Projection (UMAP): cells will be visualized in a low-dimensional space (previously generated as part of the Seurat clustering pipeline with RunUMAP()) and coloured according to their cell type of origin;
  • spatial visualization: Kandinsky function KanPlot() is used to map single cells into space according to their associated segmentation coordinates. Each cell will be coloured according to their cell type of origin.
markers = FindAllMarkers(xen, logfc.thresold = 0.5, min.pct = 0.1, only.pos = T)
markers =  markers  %>% 
  group_by(cluster) %>% 
  filter(p_val_adj < 0.05, avg_log2FC >= 1) %>% 
  slice_max(n = 2, order_by = avg_log2FC) %>%
  pull(gene)

markers = unique(markers)

library(ggplot2)
library(viridis)

# Cell type marker dot plot
DotPlot(xen, features = markers, dot.min = 0.05) + scale_color_viridis(option = 'viridis') +
  theme(axis.text.x = element_text(size = 4, angle = 90), legend.text = element_text(size = 3), legend.key.size = unit(1, 'cm'), legend.title = element_text(size = 5)) + labs(x = '', y = '')


pal = c("ivory1", "#ff0062","#e6ad1e", "cyan","#cfcf1b", "#FFCC99", "gray90", "#94FFB5","#f0f29d", 
        "limegreen", "#e8e9fd", "#FFA405", "#FFA8BB", "lightcyan", "magenta", "dodgerblue3", "#00d1c3","#E0FF66",
        "#ddbefa","indianred1", "#FFFF80", "#FFE100", "#ff7236")

# UMAP
DimPlot(xen, reduction = 'full.umap', cols = pal, raster.dpi = c(8000, 8000), pt.size = 5) + theme_void() + NoLegend()


# KanPlot of representative sample region (upper-left sample quarter)
max_x = median(xen$x_centroid)
min_y = median(xen$y_centroid)
KanPlot(subset(xen, subset = x_centroid > max_x & y_centroid< min_y),
        feature = 'final_clust'
        , palette_discrete = pal
        , show_border = T
        , border.size = 0.02)


#Visualize the whole sample
KanPlot(xen, feature = 'final_clust', palette_discrete = pal)

By examining the spatial visualization of the whole Xenium breast cancer slide generated with KanPlot(), we can notice a highly variable density/abundance of cells, particularly when comparing peri-tumoral and stromal regions.

Having defined c-NBs with Kandinsky using a fixed centroid distance of 40um, we might have a relatively variable size of c-NBs depending on cell localization within the sample.

To assess this variability, we can use the nb_summary() function from Kandinsky to get a text summary of the range of c-NB size values observed across all cells in the Xenium dataset.

nb_summary(xen)
#> Neighbour method: Centroid distance (C), d.max = 40
#> Total cells/spots: 468583
#> Total neighbour links: 17828594
#> Minimum number of neighbours per cell/spot: 0
#> Mean number of neighbours per cell/spot: 38.05
#> Median number of neighbours per cell/spot: 35
#> Max number of neighbours per cell/spot: 140
#> Number of cells/spots with no neighbours: 1757

# Get list of c-NB size for all cells
tot_nb = data.frame(size = Matrix::colSums(as(KanData(xen,'nb'),'CsparseMatrix')))

# Plot distribution of c-NB size values across cells
labels = c(min(tot_nb$size), median(tot_nb$size), max(tot_nb$size))

ggplot(tot_nb, aes(y = size)) + theme_classic()+
  geom_density() + scale_y_reverse(breaks = (labels), labels = (labels))+
  geom_hline(yintercept = labels[2], linetype = 'dashed') +
  theme(axis.text.x = element_blank(), axis.line.x = element_blank(), axis.ticks.x = element_blank()) +
  guides(y = guide_axis(cap = "both")) + labs(y = 'c-NB size', x = '')


4) Co-localisation/dispersion


Kandinsky implements a co-localisation/dispersion analysis module to test for spatial co-localisation or dispersion between any pair of cell types present in the analysed dataset.

Co-localisation/dispersion analysis is performed by using the Kandinsky jc_coloc() function.

The cell type annotation name is specified through the label parameter. The function will internally performs a join count test to calculate a co-localisation z-coefficient for each cell type pairs, with positive and negative coefficients indicating a tendency of co-localisation or dispersion between cell types, respectively.

The main output of jc_coloc() is a heatmap plot displaying all co-localisation z-values computed by Kandinsky for any cell type combination. By setting the parameter return.mat = TRUE, the function will return a list object containing both the heatmap plot and a data frame reporting all the numeric co-localisation z-values computed by Kandinsky.


xen$final_clust = factor(xen$final_clust, levels = c(
  'Bcell','Plasma','DC.1','DC.2','Mono','Macro',
  'TCD4','TCD8','Mast','Tumour.1','Tumour.2',
  'Tumour.3','Tumour.4','Tumour.5','Tumour.6','Tumour.7',
  'MEC','Adipocytes','Endo',
  'Fibro.1','Fibro.2','PVL','Unassigned'))

coloc = jc_coloc(xen, label = 'final_clust', return.mat = TRUE)

head(coloc$coloc_mat)
#>               Joincount  Expected  Variance   z-value  Type1  Type2
#> Bcell:Bcell      208821 35284.556 79180.219 616.71192  Bcell  Bcell
#> Plasma:Plasma     29564 11883.349 20715.906 122.84184 Plasma Plasma
#> DC.1:DC.1         30027  4006.886  5764.844 342.70103   DC.1   DC.1
#> DC.2:DC.2          5506  1364.572  1717.168  99.94101   DC.2   DC.2
#> Mono:Mono         33847 18853.386 36322.642  78.67153   Mono   Mono
#> Macro:Macro       46688 32008.312 70061.469  55.45966  Macro  Macro
#>               oe_log2ratio oe_ratio
#> Bcell:Bcell      2.5651580 5.918198
#> Plasma:Plasma    1.3149001 2.487851
#> DC.1:DC.1        2.9057069 7.493849
#> DC.2:DC.2        2.0125560 4.034965
#> Mono:Mono        0.8442043 1.795274
#> Macro:Macro      0.5446052 1.458621

coloc$plot + ggtitle('JC co-localization test')


From the co-localisation results, we noticed a tendency of myoepithelial cells (MEC) to co-localise with Tumour 5 and Tumour 6 cells.
We can check for phenotypic differences between this subset of tumour cells and the remaining tumour clusters via differential gene expression analysis:


de_tumour = FindMarkers(xen, ident.1 = c('Tumour.5','Tumour.6'), ident.2 = c('Tumour.4','Tumour.1','Tumour.2','Tumour.3','Tumour.7')
                        , logfc.thresold = -Inf, return.thresh = 1.1, only.pos = FALSE)

de_tumour$gene = rownames(de_tumour)
de_tumour$comparison = '56_vs_12347'

To interpret differential gene expression results, we will use the package fgsea to perform pre-ranked gene set enrichment analysis considering the Hallmark gene gene sets from the Molecular Signature DataBase (MSigDB).

# Load list of Hallmark signatures
h = system.file('extdata','MSigDB_Hallmark_sigs.rds', package = 'Kandinsky')
h = readRDS(h)

de_tumour = de_tumour %>% mutate(FC = 2^avg_log2FC) %>% arrange(desc(FC))

# Run fgsea
library(fgsea)
stats = de_tumour$avg_log2FC
names(stats) = de_tumour$gene

set.seed(347548)
gsea_res = fgsea(stats = stats, pathways = h)
gsea_res = gsea_res %>% as.data.frame()

# Clean fgsea result table
gsea_res$pathway = gsub('HALLMARK_','',gsea_res$pathway)
gsea_res = gsea_res[,c('pathway','padj','NES')] %>% filter(padj < 0.05) %>% 
  arrange((NES)) %>% 
  mutate(pathway = factor(pathway, levels = unique(pathway)), label = ifelse(NES > 0,'Enriched in Tumour 5/6','Depleted in Tumour 5/6'))

# Plot significant enrichment results
nes_limits = c(min(gsea_res$NES), max(gsea_res$NES))
ggplot(gsea_res,aes(x = NES, y = pathway, fill = label))+
  theme_classic() + geom_bar(stat = 'identity', color = 'black')+
  scale_fill_manual(values = c('dodgerblue3','yellow'))+
  scale_x_continuous(breaks = nes_limits, labels = round(nes_limits, 1))+
  theme(axis.line.y = element_blank(), axis.ticks.y = element_blank(),
        axis.text = element_text(color = 'black'), axis.text.y = element_text(size = 3),
        legend.position = 'top')+
  guides(y = guide_axis(cap = "both"), x = guide_axis(cap = "both"))+
  labs(fill = '', y = '')

Tumour 5 and 6 cells showed increased interferon response and reduced proliferation potential compared to tumour cells in other clusters.