Last updated: 2022-02-22
Checks: 7 0
Knit directory: MelanomaIMC/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200728)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version d246c15. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: Table_S4.csv
Ignored: code/.DS_Store
Ignored: code/._.DS_Store
Ignored: data/.DS_Store
Ignored: data/._.DS_Store
Ignored: data/data_for_analysis/
Ignored: data/full_data/
Unstaged changes:
Modified: .gitignore
Modified: analysis/Supp-Figure_10.rmd
Modified: analysis/_site.yml
Deleted: analysis/license.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/04_1_Protein_celltype_classification.rmd
) and HTML (docs/04_1_Protein_celltype_classification.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 73aa800 | toobiwankenobi | 2022-02-22 | add .html for static website |
Rmd | f9a3a83 | toobiwankenobi | 2022-02-08 | clean repo for release |
Rmd | 3da15db | toobiwankenobi | 2021-11-24 | changes for revision |
html | 4109ff1 | toobiwankenobi | 2021-07-07 | delete html files and adapt gitignore |
Rmd | 3203891 | toobiwankenobi | 2021-02-19 | change celltype names |
html | 3203891 | toobiwankenobi | 2021-02-19 | change celltype names |
Rmd | ee1595d | toobiwankenobi | 2021-02-12 | clean repo and adapt files |
html | ee1595d | toobiwankenobi | 2021-02-12 | clean repo and adapt files |
html | 3f5af3f | toobiwankenobi | 2021-02-09 | add .html files |
Rmd | f9bb33a | toobiwankenobi | 2021-02-04 | new Figure 5 and minor changes in figure order |
Rmd | 2ac1833 | toobiwankenobi | 2021-01-08 | changes to Figures |
Rmd | 9442cb9 | toobiwankenobi | 2020-12-22 | add all new files |
Rmd | d8819f2 | toobiwankenobi | 2020-10-08 | read new data (nuclei expansion) and adapt scripts |
Rmd | 2c11d5c | toobiwankenobi | 2020-08-05 | add new scripts |
This script performs cell-type classification based on manually labelled cells. We will create increasing complexity for cell type labelling.
knitr::opts_chunk$set(echo = TRUE, message= FALSE)
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
First, we will read in the SingleCellExperiment object and load all libraries.
library(caret)
library(scater)
library(tidyverse)
library(dittoSeq)
library(viridis)
library(doParallel)
library(ggpubr)
library(randomForest)
sce <- readRDS("data/data_for_analysis/sce_protein.rds")
# load all subsetted sce object from hierarchichal gating and combine the
label.files <- list.files("data/data_for_analysis/protein/celltype_classifier/", full.names = TRUE)
# Read in SCE objects
cur_sces <- lapply(label.files, readRDS)
# Merge SCE objects
# Due to relabelling, we now need to match the colData entries and remove rowData
cur_entries <- names(colData(cur_sces[[1]]))
cur_sces <- lapply(cur_sces, function(x){
colData(x) <- colData(x)[,cur_entries]
rowData(x) <- NA
return(x)
})
labelled_sce <- do.call("cbind", cur_sces)
# add rowData
rowData(labelled_sce) <- rowData(sce)
unique(labelled_sce$cytomapper_CellLabel)
[1] "Bcell" "BnTcell" "Macrophage" "Neutrophil" "pDC"
[6] "Stroma" "Tcytotoxic" "Thelper" "Tregulatory" "Tumor"
[11] "unknown"
labelled_sce[,labelled_sce$cytomapper_CellLabel == "Bcell"]$cytomapper_CellLabel <- "B cell"
labelled_sce[,labelled_sce$cytomapper_CellLabel == "BnTcell"]$cytomapper_CellLabel <- "BnT cell"
labelled_sce[,labelled_sce$cytomapper_CellLabel == "Tcytotoxic"]$cytomapper_CellLabel <- "CD8+ T cell"
labelled_sce[,labelled_sce$cytomapper_CellLabel == "Thelper"]$cytomapper_CellLabel <- "CD4+ T cell"
labelled_sce[,labelled_sce$cytomapper_CellLabel == "Tregulatory"]$cytomapper_CellLabel <- "FOXP3+ T cell"
# how many duplicates do we have?
ncol(labelled_sce[,duplicated(labelled_sce$cellID) == T]) / ncol(labelled_sce[,duplicated(labelled_sce$cellID) == F]) * 100
[1] 0.06727787
# remove duplicates (more than 1 label per cellID)
unique_labels <- labelled_sce[,duplicated(labelled_sce$cellID) == F]
unique(unique_labels$cytomapper_CellLabel)
[1] "B cell" "BnT cell" "Macrophage" "Neutrophil"
[5] "pDC" "Stroma" "CD8+ T cell" "CD4+ T cell"
[9] "FOXP3+ T cell" "Tumor" "unknown"
label_vector <- rep("unlabelled", ncol(sce))
names(label_vector) <- colnames(sce)
label_vector[colnames(unique_labels)] <- unique_labels$cytomapper_CellLabel
# add to sce
colData(sce)$layer_1_gated <- label_vector
Here, we will define a colour vector for the cell-types contained in layer 1.
layer1_colours <- vector(length = length(unique(label_vector)))
names(layer1_colours) <- unique(label_vector)
layer1_colours["BnT cell"] <- "goldenrod2"
layer1_colours["B cell"] <- "green1"
layer1_colours["Macrophage"] <- "greenyellow"
layer1_colours["Neutrophil"] <- "blue1"
layer1_colours["CD4+ T cell"] <- "yellow"
layer1_colours["FOXP3+ T cell"] <- "sienna1"
layer1_colours["pDC"] <- "red2"
layer1_colours["Stroma"] <- "tomato"
layer1_colours["CD8+ T cell"] <- "deepskyblue"
layer1_colours["Tumor"] <- "sienna4"
layer1_colours["unknown"] <- "gray"
layer1_colours["unlabelled"] <- "gray"
# Save in SCE object
metadata(sce)$colour_vectors$layer_1 <- layer1_colours
In the next step, we will check the quality of the labels by:
Next, we will check how many cells and how many images are labelled.
# 2. How many cells of how many images are labelled
# Percent cells labelled
as_tibble(colData(sce)) %>%
summarise(labelled_cells = sum(layer_1_gated != "unlabelled")/n()) * 100
labelled_cells
1 8.863518
# Percent images labelled
as_tibble(colData(sce)) %>%
group_by(ImageNumber) %>%
summarise(labelled_cells = sum(layer_1_gated != "unlabelled")) %>%
ungroup() %>%
summarise(labelled_images = sum(labelled_cells != 0)/n()) * 100
labelled_images
1 42.51497
# Percent of cells labelled per image
as_tibble(colData(sce)) %>%
group_by(ImageNumber) %>%
summarise(labelled_cells = sum(layer_1_gated != "unlabelled")/n(),
number_cells = n()) %>%
as.data.frame()
ImageNumber labelled_cells number_cells
1 1 0.005673409 4054
2 2 0.693469672 8606
3 3 0.387682493 8151
4 4 0.080830616 5827
5 5 0.277162359 8937
6 6 0.646044158 4348
7 7 0.083700441 8172
8 8 0.656653747 6192
9 9 0.540270346 7102
10 10 0.145543745 1223
11 11 0.000000000 8051
12 12 0.023703041 4472
13 13 0.384896023 8223
14 14 0.240790656 6678
15 15 0.018567639 4524
16 16 0.000000000 150
17 17 0.000000000 4282
18 18 0.354059609 9730
19 19 0.460003837 5213
20 20 0.000000000 6404
21 21 0.014819588 3104
22 22 0.000000000 6044
23 23 0.000000000 5163
24 24 0.000000000 3676
25 25 0.000000000 7588
26 26 0.010978520 2095
27 27 0.263806287 2354
28 28 0.000000000 2860
29 29 0.130766749 9299
30 30 0.022828220 9681
31 31 0.450201497 5211
32 32 0.856838744 7516
33 33 0.154418539 14111
34 34 0.000000000 7979
35 35 0.000000000 6390
36 36 0.015293537 4054
37 37 0.081128189 6311
38 38 0.000000000 7972
39 39 0.000000000 5838
40 40 0.000000000 3583
41 41 0.000000000 6265
42 42 0.411458333 4032
43 43 0.000000000 12718
44 44 0.443698950 3428
45 45 0.039334577 6432
46 46 0.000000000 1725
47 47 0.119944212 10038
48 48 0.000000000 13418
49 49 0.000000000 6955
50 50 0.000000000 11348
51 51 0.000000000 397
52 52 0.089435014 7894
53 53 0.000000000 4225
54 54 0.000000000 1554
55 55 0.000000000 3822
56 56 0.016384018 3479
57 57 0.000000000 1124
58 58 0.145661811 3158
59 59 0.023677079 7011
60 60 0.000000000 7333
61 61 0.000000000 6449
62 62 0.000000000 623
63 63 0.163449163 1554
64 64 0.415044172 11659
65 65 0.000000000 1884
66 66 0.000000000 9010
67 67 0.000000000 5410
68 68 0.647529706 7995
69 69 0.035565177 6467
70 70 0.000000000 4677
71 71 0.000000000 498
72 72 0.117969495 4196
73 73 0.224781119 5254
74 74 0.000000000 936
75 75 0.005124162 7611
76 76 0.451695050 6283
77 77 0.073209073 4453
78 78 0.086204082 6125
79 79 0.000000000 2316
80 80 0.000000000 5077
81 81 0.000000000 7437
82 82 0.036069442 5933
83 83 0.044240130 9245
84 84 0.000000000 7433
85 85 0.014137264 8276
86 86 0.122383857 10655
87 87 0.000000000 4625
88 88 0.000000000 8381
89 89 0.000000000 8795
90 90 0.000000000 6276
91 91 0.000000000 5592
92 92 0.000000000 4971
93 93 0.000000000 7830
94 94 0.000000000 11871
95 95 0.223364019 14013
96 96 0.000000000 7064
97 97 0.000000000 5448
98 98 0.000000000 1583
99 99 0.000000000 5865
100 100 0.003671409 2179
101 101 0.090753425 4088
102 102 0.055220641 4147
103 103 0.000000000 8063
104 104 0.103303303 6660
105 105 0.000000000 9135
106 106 0.000000000 4230
107 107 0.000000000 5316
108 108 0.020666920 7887
109 109 0.000000000 8182
110 110 0.028205128 8580
111 111 0.000000000 994
112 112 0.000000000 7230
113 113 0.022005295 6044
114 114 0.008444537 9592
115 115 0.000000000 5326
116 116 0.000000000 5164
117 117 0.000000000 2520
118 118 0.188493151 7300
119 119 0.000000000 9371
120 120 0.000000000 7119
121 121 0.000000000 3693
122 122 0.000000000 10368
123 123 0.000000000 7635
124 124 0.000000000 5999
125 125 0.000000000 6573
126 126 0.000000000 7288
127 127 0.000000000 6411
128 128 0.000000000 6508
129 129 0.000000000 7698
130 130 0.631642879 6586
131 131 0.000000000 6072
132 132 0.064892136 5748
133 133 0.085784818 4861
134 134 0.058528428 6578
135 135 0.000000000 7615
136 136 0.000000000 5898
137 137 0.000000000 5788
138 138 0.000000000 4631
139 139 0.032942993 7771
140 140 0.022099448 4525
141 141 0.027125540 11797
142 142 0.061248528 5943
143 143 0.000000000 6024
144 144 0.000000000 5584
145 145 0.098246355 9466
146 146 0.000000000 8089
147 147 0.000000000 7454
148 148 0.021767297 3859
149 149 0.000000000 6605
150 150 0.000000000 6910
151 151 0.000000000 5720
152 152 0.000000000 3895
153 153 0.000000000 6053
154 154 0.000000000 3958
155 155 0.000000000 3183
156 156 0.000000000 6238
157 157 0.000000000 2151
158 158 0.126542400 3809
159 159 0.000000000 5162
160 160 0.000000000 844
161 161 0.120879121 4914
162 162 0.082316344 8358
163 163 0.000000000 969
164 164 0.000000000 2304
165 165 0.000000000 3467
166 166 0.000000000 578
167 167 0.000000000 1403
We will check how balanced the classes are across the images.
# Total cells per class
as_tibble(colData(sce)) %>%
group_by(layer_1_gated) %>%
summarise(number_cells = n())
# A tibble: 12 × 2
layer_1_gated number_cells
<chr> <int>
1 B cell 9476
2 BnT cell 3512
3 CD4+ T cell 7087
4 CD8+ T cell 3617
5 FOXP3+ T cell 791
6 Macrophage 3640
7 Neutrophil 2620
8 pDC 700
9 Stroma 2919
10 Tumor 53023
11 unknown 311
12 unlabelled 901708
# Total cells per class and Sample
as_tibble(colData(sce)) %>%
group_by(layer_1_gated, ImageNumber) %>%
summarise(number_cells = n()) %>%
as.data.frame() %>%
head(.)
layer_1_gated ImageNumber number_cells
1 B cell 14 813
2 B cell 33 1572
3 B cell 64 4191
4 B cell 86 844
5 B cell 95 651
6 B cell 118 440
Now, we will check the expression of selected markers across the classes and visualize cell labels on UMAP.
lab_sce <- sce[,sce$layer_1_gated != "unlabelled"]
agr_sce <- aggregateAcrossCells(lab_sce, ids = colData(lab_sce)[,c("ImageNumber", "layer_1_gated")],
average = TRUE)
Warning: 'average=' is deprecated, use 'statistics=' instead
assay(agr_sce, "asinh") <- asinh(counts(agr_sce))
assay(agr_sce, "scaled_asinh") <- t(scale(t(asinh(counts(agr_sce)))))
colnames(agr_sce) <- paste0(agr_sce$ImageNumber, "_", agr_sce$layer_1_gated)
# Define markers that were used for gating
rownames(sce)
[1] "Vimentin" "Caveolin1" "HistoneH3" "SMA" "CD15" "H3K27me3"
[7] "CD7" "CXCR2" "HLADR" "S100" "CD19" "CD45RA"
[13] "Sox9" "TOX1" "CD20" "CD68" "pERK" "CD3"
[19] "CD36" "p75" "PD1" "MiTF" "CD11b" "GrzB"
[25] "PDL1" "TCF7" "CD45RO" "FOXP3" "ICOS" "bCatenin"
[31] "CD8" "Collagen1" "Ki67" "CD11c" "pS6" "CD4"
[37] "IDO1" "SOX10" "CD303" "CD206" "PARP" "DNA1"
[43] "DNA2" "Ki67Pt198" "CD45" "MPO"
cur_markers <- c("CD15", "S100", "CD20", "CD68", "CD3", "p75", "MiTF", "FOXP3", "CD8", "Collagen1",
"CD4", "IDO1", "SOX10", "CD303", "MPO")
# Non-scaled
dittoHeatmap(agr_sce[cur_markers,], assay = "asinh",
cells.use = colnames(agr_sce[cur_markers,]),
annot.by = c("ImageNumber", "layer_1_gated"),
order.by = "layer_1_gated", cluster_rows = FALSE,
scale = "none", heatmap.colors = viridis(100),
annotation_colors = list(layer_1_gated = metadata(sce)$colour_vectors$layer_1))
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
# Centered and scaled
dittoHeatmap(agr_sce[cur_markers,], assay = "scaled_asinh",
annot.by = c("ImageNumber", "layer_1_gated"),
order.by = "layer_1_gated", cluster_rows = FALSE,
annotation_colors = list(layer_1_gated = metadata(sce)$colour_vectors$layer_1),
heatmap.colors = colorRampPalette(c("dark blue", "white", "dark red"))(100),
breaks = seq(-3, 3, length.out = 101))
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
unlab_sce <- sce[,sce$layer_1_gated == "unlabelled"]
ggplot() +
geom_point(aes(x = UMAP1, y = UMAP2, colour = layer_1_gated),
data = data.frame(UMAP1 = reducedDim(unlab_sce, "UMAP")[,1],
UMAP2 = reducedDim(unlab_sce, "UMAP")[,2],
layer_1_gated = colData(unlab_sce)$layer_1_gated)) +
geom_point(aes(x = UMAP1, y = UMAP2, colour = layer_1_gated), size = 0.5,
data = data.frame(UMAP1 = reducedDim(lab_sce, "UMAP")[,1],
UMAP2 = reducedDim(lab_sce, "UMAP")[,2],
layer_1_gated = colData(lab_sce)$layer_1_gated)) +
scale_color_manual(values = metadata(sce)$colour_vectors$layer_1) +
theme_bw()
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
After quality control, we will now use a random forest classifier to classify the remaining cells in the dataset.
In the first instance, we will split the labelled data based on their cell-types and ignore from which images the calls come. In the current setting most images have been labelled but in the future we want to have a closer look at how well cells of non-labelled images are classified.
We will first split the labelled data into training and test (validation) data at a ratio of 70/30 train/test.
set.seed(1234)
trainIndex <- createDataPartition(factor(lab_sce$layer_1_gated), p = 0.70)
train_sce <- lab_sce[,trainIndex$Resample1]
test_sce <- lab_sce[,-trainIndex$Resample1]
Here, we will first use a 10-fold crossvalidation by partitioning the data randomly across the full dataset. This process is repeated 5 times. We will also use parallel processing for time reasons. For the randomForrest
classifier, we need to tune the mtry
parameter - the number of variables sampled for each split.
# Define seeds for parallel processing
# Per iteration, we evaluate 10 models while tuning mtry
set.seed(222)
seeds <- vector(mode = "list", length = 11)
for (i in 1:10) {
seeds[[i]] <- sample.int(5000, 10)
}
seeds[[11]] <- sample.int(5000, 1)
fitControl <- trainControl(method = "repeatedcv",
repeats = 1,
number = 10,
seeds = seeds)
cl <- makePSOCKcluster(round(detectCores()/2,0), outfile = "")
registerDoParallel(cl)
set.seed(1234)
start = Sys.time()
rffit <- train(x = t(assay(train_sce, "asinh")[rowData(sce)$good_marker,]),
y = factor(train_sce$layer_1_gated),
method = "rf", ntree = 500,
tuneLength = 10,
trControl = fitControl,
allowParallel = TRUE)
stopCluster(cl)
end = Sys.time()
print(end-start)
Time difference of 17.97361 mins
rffit
Random Forest
61391 samples
39 predictor
11 classes: 'B cell', 'BnT cell', 'CD4+ T cell', 'CD8+ T cell', 'FOXP3+ T cell', 'Macrophage', 'Neutrophil', 'pDC', 'Stroma', 'Tumor', 'unknown'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 55250, 55253, 55251, 55252, 55253, 55253, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9953413 0.9923472
6 0.9969051 0.9949197
10 0.9970842 0.9952134
14 0.9971820 0.9953735
18 0.9973938 0.9957214
22 0.9973612 0.9956678
26 0.9973937 0.9957208
30 0.9970028 0.9950782
34 0.9964978 0.9942478
39 0.9959929 0.9934172
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 18.
We will now have a look at the accuracy measures over iterations. The only parameter that has been tuned is mtry
.
ggplot(rffit) +
geom_errorbar(data = rffit$results,
aes(ymin = Accuracy - AccuracySD,
ymax = Accuracy + AccuracySD),
width = 0.4)
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
We can also compute the confusion matrix:
confusionMatrix(rffit)
Cross-Validated (10 fold, repeated 1 times) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction B cell BnT cell CD4+ T cell CD8+ T cell FOXP3+ T cell
B cell 10.8 0.0 0.0 0.0 0.0
BnT cell 0.0 4.0 0.1 0.0 0.0
CD4+ T cell 0.0 0.0 8.0 0.0 0.0
CD8+ T cell 0.0 0.0 0.0 4.1 0.0
FOXP3+ T cell 0.0 0.0 0.0 0.0 0.9
Macrophage 0.0 0.0 0.0 0.0 0.0
Neutrophil 0.0 0.0 0.0 0.0 0.0
pDC 0.0 0.0 0.0 0.0 0.0
Stroma 0.0 0.0 0.0 0.0 0.0
Tumor 0.0 0.0 0.0 0.0 0.0
unknown 0.0 0.0 0.0 0.0 0.0
Reference
Prediction Macrophage Neutrophil pDC Stroma Tumor unknown
B cell 0.0 0.0 0.0 0.0 0.0 0.0
BnT cell 0.0 0.0 0.0 0.0 0.0 0.0
CD4+ T cell 0.0 0.0 0.0 0.0 0.0 0.0
CD8+ T cell 0.0 0.0 0.0 0.0 0.0 0.0
FOXP3+ T cell 0.0 0.0 0.0 0.0 0.0 0.0
Macrophage 4.1 0.0 0.0 0.0 0.0 0.0
Neutrophil 0.0 3.0 0.0 0.0 0.0 0.0
pDC 0.0 0.0 0.8 0.0 0.0 0.0
Stroma 0.0 0.0 0.0 3.3 0.0 0.0
Tumor 0.0 0.0 0.0 0.0 60.4 0.0
unknown 0.0 0.0 0.0 0.0 0.0 0.3
Accuracy (average) : 0.9974
We will also look at the variable importance.
cur_varImp <- varImp(rffit)
plot(cur_varImp, top = 37)
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
Finally, we will validate the model using the test data.
cur_pred <- predict(rffit, newdata = t(assay(test_sce, "asinh")[rowData(sce)$good_marker,]))
cm <- confusionMatrix(data = cur_pred, reference = factor(test_sce$layer_1_gated))
cm
Confusion Matrix and Statistics
Reference
Prediction B cell BnT cell CD4+ T cell CD8+ T cell FOXP3+ T cell
B cell 2841 0 6 0 0
BnT cell 0 1053 8 0 0
CD4+ T cell 1 0 2110 1 1
CD8+ T cell 0 0 0 1084 0
FOXP3+ T cell 0 0 2 0 236
Macrophage 0 0 0 0 0
Neutrophil 0 0 0 0 0
pDC 0 0 0 0 0
Stroma 0 0 0 0 0
Tumor 0 0 0 0 0
unknown 0 0 0 0 0
Reference
Prediction Macrophage Neutrophil pDC Stroma Tumor unknown
B cell 0 0 1 0 0 0
BnT cell 0 0 0 0 0 0
CD4+ T cell 0 0 0 0 0 0
CD8+ T cell 0 0 0 0 4 0
FOXP3+ T cell 0 0 0 0 0 0
Macrophage 1087 0 0 0 1 0
Neutrophil 0 786 0 0 2 0
pDC 5 0 209 0 0 0
Stroma 0 0 0 873 2 0
Tumor 0 0 0 2 15897 4
unknown 0 0 0 0 0 89
Overall Statistics
Accuracy : 0.9985
95% CI : (0.9979, 0.9989)
No Information Rate : 0.6047
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9975
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: B cell Class: BnT cell Class: CD4+ T cell
Sensitivity 0.9996 1.00000 0.99247
Specificity 0.9997 0.99968 0.99988
Pos Pred Value 0.9975 0.99246 0.99858
Neg Pred Value 1.0000 1.00000 0.99934
Prevalence 0.1080 0.04003 0.08082
Detection Rate 0.1080 0.04003 0.08021
Detection Prevalence 0.1083 0.04033 0.08033
Balanced Accuracy 0.9997 0.99984 0.99618
Class: CD8+ T cell Class: FOXP3+ T cell Class: Macrophage
Sensitivity 0.99908 0.995781 0.99542
Specificity 0.99984 0.999923 0.99996
Pos Pred Value 0.99632 0.991597 0.99908
Neg Pred Value 0.99996 0.999962 0.99980
Prevalence 0.04125 0.009010 0.04151
Detection Rate 0.04121 0.008972 0.04132
Detection Prevalence 0.04136 0.009048 0.04136
Balanced Accuracy 0.99946 0.997852 0.99769
Class: Neutrophil Class: pDC Class: Stroma Class: Tumor
Sensitivity 1.00000 0.995238 0.99771 0.9994
Specificity 0.99992 0.999808 0.99992 0.9994
Pos Pred Value 0.99746 0.976636 0.99771 0.9996
Neg Pred Value 1.00000 0.999962 0.99992 0.9991
Prevalence 0.02988 0.007983 0.03326 0.6047
Detection Rate 0.02988 0.007945 0.03319 0.6043
Detection Prevalence 0.02996 0.008135 0.03326 0.6046
Balanced Accuracy 0.99996 0.997523 0.99882 0.9994
Class: unknown
Sensitivity 0.956989
Specificity 1.000000
Pos Pred Value 1.000000
Neg Pred Value 0.999847
Prevalence 0.003535
Detection Rate 0.003383
Detection Prevalence 0.003383
Balanced Accuracy 0.978495
a <- data.frame(cm$byClass) %>%
mutate(class = sub("Class: ", "", rownames(cm$byClass))) %>%
ggplot() +
geom_point(aes(1 - Specificity, Sensitivity,
size = Detection.Rate,
fill = class),
shape = 21) +
scale_fill_manual(values = metadata(sce)$colour_vectors$layer_1) +
theme_bw() +
theme(text=element_text(size=12)) +
ylab("Sensitivity (TPR)") +
xlab("1 - Specificity (FPR)")
legend <- get_legend(a)
a + theme(legend.position = "none")
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
plot(legend)
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
We will also observe the distribution of classification probabilities per image and class:
cur_pred <- predict(rffit, newdata = t(assay(test_sce, "asinh")[rowData(sce)$good_marker,]),
type = "prob")
cur_pred %>%
mutate(class = test_sce$layer_1_gated,
image = test_sce$ImageNumber) %>%
reshape2::melt(id.vars = c("class", "image"), variable.name = "celltype", value.name = "probability") %>%
filter(class == celltype) %>%
ggplot() +
geom_boxplot(aes(interaction(image), probability), outlier.size = 0.5) +
facet_wrap(. ~ class) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
This plot shows the median probability for each image and class.
Finally, we will predict the labels of all other cells. For cell-type classification, we will use the method that was trained across all images.
start = Sys.time()
cell_labels.class <- as.character(predict.train(rffit,
newdata = t(assay(unlab_sce[rowData(sce)$good_marker,], "asinh")),
type = "raw"))
cell_labels.prob <- predict.train(rffit,
newdata = t(assay(unlab_sce[rowData(sce)$good_marker,], "asinh")),
type = "prob")
end = Sys.time()
print(end-start)
Time difference of 1.016046 mins
Store predictions in SCE object. We will not overwrite the labels of the already labelled cells.
cell_labels <- sce$layer_1_gated
cell_labels[colnames(unlab_sce)] <- cell_labels.class
sce$celltype <- cell_labels
Here, we will visualize the predicted cell-types and their associated classification probabilities.
dittoDimPlot(sce[,sce$layer_1_gated != "unlabelled"], var = "celltype", reduction.use = "UMAP", size = 0.5,
color.panel = metadata(sce)$colour_vectors$layer_1, main = "Cell types gated")
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
dittoDimPlot(sce[,sce$layer_1_gated == "unlabelled"], var = "celltype", reduction.use = "UMAP", size = 0.5,
color.panel = metadata(sce)$colour_vectors$layer_1, main = "Cell types classified")
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
for (i in unique(cell_labels.class)) {
cur_df <- data.frame(UMAP1 = reducedDim(unlab_sce, "UMAP")[,1],
UMAP2 = reducedDim(unlab_sce, "UMAP")[,2],
prob = cell_labels.prob[,i],
class = cell_labels.class == i)
ggplot() + geom_point(aes(UMAP1, UMAP2), data = cur_df[!cur_df$class,],
color = "gray") +
geom_point(aes(UMAP1, UMAP2, color = prob), data = cur_df[cur_df$class,],
size = 0.5)+
scale_colour_viridis(name = paste0(i, " probability"))
}
Finally, we will visualize the marker expression per cell type using the classified cells.
unlab_sce <- sce[,sce$layer_1_gated == "unlabelled"]
agr_sce <- aggregateAcrossCells(unlab_sce, ids = colData(unlab_sce)[,c("ImageNumber", "celltype")],
average = TRUE)
Warning: 'average=' is deprecated, use 'statistics=' instead
assay(agr_sce, "asinh") <- asinh(counts(agr_sce))
colnames(agr_sce) <- paste0(agr_sce$ImageNumber, "_", agr_sce$celltype)
rownames(sce)
[1] "Vimentin" "Caveolin1" "HistoneH3" "SMA" "CD15" "H3K27me3"
[7] "CD7" "CXCR2" "HLADR" "S100" "CD19" "CD45RA"
[13] "Sox9" "TOX1" "CD20" "CD68" "pERK" "CD3"
[19] "CD36" "p75" "PD1" "MiTF" "CD11b" "GrzB"
[25] "PDL1" "TCF7" "CD45RO" "FOXP3" "ICOS" "bCatenin"
[31] "CD8" "Collagen1" "Ki67" "CD11c" "pS6" "CD4"
[37] "IDO1" "SOX10" "CD303" "CD206" "PARP" "DNA1"
[43] "DNA2" "Ki67Pt198" "CD45" "MPO"
# Non-scaled
dittoHeatmap(agr_sce[cur_markers,], assay = "asinh",
annot.by = c("celltype"),
order.by = "celltype", cluster_rows = FALSE,
scale = "none", heatmap.colors = viridis(100),
annotation_colors = list(celltype = metadata(sce)$colour_vectors$layer_1))
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
# Centered and scaled
dittoHeatmap(agr_sce[cur_markers,], assay = "asinh",
annot.by = c("celltype"),
cluster_rows = FALSE,
annotation_colors = list(celltype = metadata(sce)$colour_vectors$layer_1),
heatmap.colors = colorRampPalette(c("dark blue", "white", "dark red"))(100),
breaks = seq(-3, 3, length.out = 101))
Version | Author | Date |
---|---|---|
7e52c83 | toobiwankenobi | 2022-02-22 |
saveRDS(sce, "data/data_for_analysis/sce_protein.rds")
# create data frame with class and probabilities and save as csv.
layer_1_dat <- as.data.frame(cell_labels.prob)
layer_1_dat$class <- cell_labels.class
write.csv(layer_1_dat, file = "data/data_for_analysis/layer_1_classification_protein.csv")
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] randomForest_4.6-14 ggpubr_0.4.0
[3] doParallel_1.0.16 iterators_1.0.13
[5] foreach_1.5.2 viridis_0.6.2
[7] viridisLite_0.4.0 dittoSeq_1.6.0
[9] forcats_0.5.1 stringr_1.4.0
[11] dplyr_1.0.7 purrr_0.3.4
[13] readr_2.1.2 tidyr_1.2.0
[15] tibble_3.1.6 tidyverse_1.3.1
[17] scater_1.22.0 scuttle_1.4.0
[19] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
[21] Biobase_2.54.0 GenomicRanges_1.46.1
[23] GenomeInfoDb_1.30.1 IRanges_2.28.0
[25] S4Vectors_0.32.3 BiocGenerics_0.40.0
[27] MatrixGenerics_1.6.0 matrixStats_0.61.0
[29] caret_6.0-90 lattice_0.20-45
[31] ggplot2_3.3.5 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] readxl_1.3.1 backports_1.4.1
[3] plyr_1.8.6 splines_4.1.2
[5] BiocParallel_1.28.3 listenv_0.8.0
[7] digest_0.6.29 htmltools_0.5.2
[9] fansi_1.0.2 magrittr_2.0.2
[11] ScaledMatrix_1.2.0 tzdb_0.2.0
[13] recipes_0.1.17 globals_0.14.0
[15] modelr_0.1.8 gower_0.2.2
[17] colorspace_2.0-2 rvest_1.0.2
[19] ggrepel_0.9.1 haven_2.4.3
[21] xfun_0.29 callr_3.7.0
[23] crayon_1.4.2 RCurl_1.98-1.5
[25] jsonlite_1.7.3 survival_3.2-13
[27] glue_1.6.1 gtable_0.3.0
[29] ipred_0.9-12 zlibbioc_1.40.0
[31] XVector_0.34.0 DelayedArray_0.20.0
[33] car_3.0-12 BiocSingular_1.10.0
[35] future.apply_1.8.1 abind_1.4-5
[37] scales_1.1.1 pheatmap_1.0.12
[39] DBI_1.1.2 rstatix_0.7.0
[41] Rcpp_1.0.8 proxy_0.4-26
[43] rsvd_1.0.5 lava_1.6.10
[45] prodlim_2019.11.13 httr_1.4.2
[47] RColorBrewer_1.1-2 ellipsis_0.3.2
[49] farver_2.1.0 pkgconfig_2.0.3
[51] nnet_7.3-17 sass_0.4.0
[53] dbplyr_2.1.1 utf8_1.2.2
[55] labeling_0.4.2 tidyselect_1.1.1
[57] rlang_1.0.0 reshape2_1.4.4
[59] later_1.3.0 munsell_0.5.0
[61] cellranger_1.1.0 tools_4.1.2
[63] cli_3.1.1 generics_0.1.2
[65] ggridges_0.5.3 broom_0.7.12
[67] evaluate_0.14 fastmap_1.1.0
[69] yaml_2.2.2 ModelMetrics_1.2.2.2
[71] processx_3.5.2 knitr_1.37
[73] fs_1.5.2 future_1.23.0
[75] nlme_3.1-155 sparseMatrixStats_1.6.0
[77] whisker_0.4 xml2_1.3.3
[79] compiler_4.1.2 rstudioapi_0.13
[81] beeswarm_0.4.0 e1071_1.7-9
[83] ggsignif_0.6.3 reprex_2.0.1
[85] bslib_0.3.1 stringi_1.7.6
[87] highr_0.9 ps_1.6.0
[89] Matrix_1.4-0 vctrs_0.3.8
[91] pillar_1.7.0 lifecycle_1.0.1
[93] jquerylib_0.1.4 BiocNeighbors_1.12.0
[95] cowplot_1.1.1 data.table_1.14.2
[97] bitops_1.0-7 irlba_2.3.5
[99] httpuv_1.6.5 R6_2.5.1
[101] promises_1.2.0.1 gridExtra_2.3
[103] vipor_0.4.5 parallelly_1.30.0
[105] codetools_0.2-18 MASS_7.3-55
[107] assertthat_0.2.1 rprojroot_2.0.2
[109] withr_2.4.3 GenomeInfoDbData_1.2.7
[111] hms_1.1.1 grid_4.1.2
[113] rpart_4.1.16 beachmat_2.10.0
[115] timeDate_3043.102 class_7.3-20
[117] rmarkdown_2.11 DelayedMatrixStats_1.16.0
[119] carData_3.0-5 git2r_0.29.0
[121] getPass_0.2-2 pROC_1.18.0
[123] lubridate_1.8.0 ggbeeswarm_0.6.0