Last updated: 2022-02-22
Checks: 7 0
Knit directory: MelanomaIMC/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200728)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version d246c15. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: Table_S4.csv
Ignored: code/.DS_Store
Ignored: code/._.DS_Store
Ignored: data/.DS_Store
Ignored: data/._.DS_Store
Ignored: data/data_for_analysis/
Ignored: data/full_data/
Unstaged changes:
Modified: .gitignore
Modified: analysis/Supp-Figure_10.rmd
Modified: analysis/_site.yml
Deleted: analysis/license.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/04_1_RNA_celltype_classification.rmd
) and HTML (docs/04_1_RNA_celltype_classification.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 73aa800 | toobiwankenobi | 2022-02-22 | add .html for static website |
Rmd | f9a3a83 | toobiwankenobi | 2022-02-08 | clean repo for release |
Rmd | d6a945a | toobiwankenobi | 2021-12-06 | updated figures |
html | 4109ff1 | toobiwankenobi | 2021-07-07 | delete html files and adapt gitignore |
Rmd | 3203891 | toobiwankenobi | 2021-02-19 | change celltype names |
html | 3203891 | toobiwankenobi | 2021-02-19 | change celltype names |
Rmd | ee1595d | toobiwankenobi | 2021-02-12 | clean repo and adapt files |
html | ee1595d | toobiwankenobi | 2021-02-12 | clean repo and adapt files |
html | 3f5af3f | toobiwankenobi | 2021-02-09 | add .html files |
Rmd | f9bb33a | toobiwankenobi | 2021-02-04 | new Figure 5 and minor changes in figure order |
Rmd | 2ac1833 | toobiwankenobi | 2021-01-08 | changes to Figures |
Rmd | 9442cb9 | toobiwankenobi | 2020-12-22 | add all new files |
Rmd | d8819f2 | toobiwankenobi | 2020-10-08 | read new data (nuclei expansion) and adapt scripts |
Rmd | a21c858 | toobiwankenobi | 2020-08-06 | adapt pipeline |
Rmd | 941155c | toobiwankenobi | 2020-08-05 | Update 04_1_RNA_celltype_classification.rmd |
Rmd | 2c11d5c | toobiwankenobi | 2020-08-05 | add new scripts |
Rmd | 512d944 | toobiwankenobi | 2020-07-30 | update |
Rmd | 054e957 | toobiwankenobi | 2020-07-30 | switch to Nils’s new pipeline |
Rmd | ceedd46 | toobiwankenobi | 2020-07-28 | update scripts |
Rmd | cf46cfa | toobiwankenobi | 2020-07-28 | create files |
This script performs cell-type classification based on manually labelled cells. We will create increasing complexity for cell type labelling.
knitr::opts_chunk$set(echo = TRUE, message= FALSE)
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
First, we will read in the SingleCellExperiment object and load all libraries.
library(caret)
library(scater)
library(tidyverse)
library(dittoSeq)
library(viridis)
library(doParallel)
library(ggpubr)
library(randomForest)
sce <- readRDS("data/data_for_analysis/sce_RNA.rds")
# load all subsetted sce object from hierarchichal gating and combine the
label.files <- list.files("data/data_for_analysis/rna/celltype_classifier/", full.names = TRUE)
# Read in SCE objects
cur_sces <- lapply(label.files, readRDS)
# Merge SCE objects
# Due to relabelling, we now need to match the colData entries and remove rowData
cur_entries <- names(colData(cur_sces[[1]]))
cur_sces <- lapply(cur_sces, function(x){
colData(x) <- colData(x)[,cur_entries]
rowData(x) <- NA
return(x)
})
labelled_sce <- do.call("cbind", cur_sces)
# add rowData
rowData(labelled_sce) <- rowData(sce)
# how many duplicates do we have?
ncol(labelled_sce[,duplicated(labelled_sce$cellID) == T]) / ncol(labelled_sce[,duplicated(labelled_sce$cellID) == F]) * 100
[1] 0.09870831
# remove duplicates (more than 1 label per cellID)
unique_labels <- labelled_sce[,duplicated(labelled_sce$cellID) == F]
label_vector <- rep("unlabelled", ncol(sce))
names(label_vector) <- colnames(sce)
label_vector[colnames(unique_labels)] <- unique_labels$cytomapper_CellLabel
# unique cell labels
unique(label_vector)
[1] "unlabelled" "Vasculature" "Tumor"
[4] "CD38" "HLA-DR" "CD134- Tcell"
[7] "Tcytotoxic" "CD134+ Tcell" "Macrophage"
[10] "Neutrophil" "Stroma" "unknown"
[13] "exhausted Tcytotoxic"
# add to sce
colData(sce)$layer_1_gated <- label_vector
Here, we will define a colour vector for the cell-types contained in layer 1.
layer1_colours <- vector(length = length(unique(label_vector)))
names(layer1_colours) <- unique(label_vector)
layer1_colours["CD38"] <- "goldenrod2"
layer1_colours["HLA-DR"] <- "green1"
layer1_colours["Macrophage"] <- "greenyellow"
layer1_colours["Neutrophil"] <- "blue1"
layer1_colours["CD134- Tcell"] <- "yellow"
layer1_colours["CD134+ Tcell"] <- "sienna1"
layer1_colours["Vasculature"] <- "red2"
layer1_colours["Stroma"] <- "tomato"
layer1_colours["Tcytotoxic"] <- "deepskyblue"
layer1_colours["exhausted Tcytotoxic"] <- "deeppink"
layer1_colours["Tumor"] <- "sienna4"
layer1_colours["unknown"] <- "darkgray"
layer1_colours["unlabelled"] <- "gray"
# Save in SCE object
metadata(sce)$colour_vectors$layer_1 <- layer1_colours
In the next step, we will check the quality of the labels by:
Next, we will check how many cells and how many images are labelled.
# 2. How many cells of how many images are labelled
# Percent cells labelled
as_tibble(colData(sce)) %>%
summarise(labelled_cells = sum(layer_1_gated != "unlabelled")/n()) * 100
labelled_cells
1 15.00411
# Percent images labelled
as_tibble(colData(sce)) %>%
group_by(ImageNumber) %>%
summarise(labelled_cells = sum(layer_1_gated != "unlabelled")) %>%
ungroup() %>%
summarise(labelled_images = sum(labelled_cells != 0)/n()) * 100
labelled_images
1 45.78313
# Percent of cells labelled per image
as_tibble(colData(sce)) %>%
group_by(ImageNumber) %>%
summarise(labelled_cells = sum(layer_1_gated != "unlabelled")/n(),
number_cells = n()) %>%
as.data.frame()
ImageNumber labelled_cells number_cells
1 1 0.0608424337 1282
2 2 0.0000000000 1259
3 3 0.6302583026 5420
4 4 0.6509507179 10308
5 5 0.4851955911 4627
6 6 0.0000000000 1121
7 7 0.4774311927 5450
8 8 0.0000000000 1713
9 9 0.0000000000 2343
10 10 0.0000000000 5227
11 11 0.0857067857 7619
12 12 0.2250061713 8102
13 13 0.1802919708 4110
14 14 0.4633860574 6828
15 15 0.0000000000 3612
16 16 0.0000000000 6202
17 17 0.2685050798 2067
18 18 0.7754577092 6609
19 19 0.5932611312 4986
20 20 0.2851585877 6684
21 21 0.0009721952 5143
22 22 0.1486738887 5354
23 23 0.0000000000 3918
24 24 0.0000000000 6890
25 25 0.0000000000 7950
26 26 0.0000000000 5697
27 27 0.0000000000 5655
28 28 0.0000000000 7343
29 29 0.0000000000 5738
30 30 0.0433778858 8230
31 31 0.0253794266 11860
32 32 0.6440576230 1666
33 33 0.0477107720 6749
34 34 0.0000000000 5818
35 35 0.0000000000 5612
36 36 0.0000000000 4242
37 37 0.2052594003 8404
38 38 0.0000000000 522
39 39 0.3264580370 7030
40 40 0.0687802961 7633
41 41 0.3979097629 3923
42 42 0.3447955390 3228
43 43 0.5761103924 4638
44 44 0.0000000000 5233
45 45 0.0000000000 8065
46 46 0.5218669607 2721
47 47 0.0000000000 1183
48 48 0.0001741857 5741
49 49 0.1722336257 5359
50 50 0.0000000000 1423
51 51 0.6531093126 9761
52 52 0.0000000000 6188
53 53 0.0000000000 7246
54 54 0.4339076079 6559
55 55 0.4261586803 5092
56 56 0.0000000000 1969
57 57 0.0000000000 136
58 58 0.6358640082 7824
59 59 0.0100279851 4288
60 60 0.7352537723 5103
61 61 0.0000000000 6871
62 62 0.0000000000 9609
63 63 0.0000000000 3733
64 64 0.0000000000 5643
65 65 0.2584729256 5134
66 66 0.0000000000 3311
67 67 0.0203687822 4664
68 68 0.8311225500 5051
69 69 0.0000000000 3108
70 70 0.0000000000 4269
71 71 0.0000000000 768
72 72 0.4892120521 10521
73 73 0.4762782900 5965
74 74 0.0000000000 6471
75 75 0.0372771475 4319
76 76 0.0000000000 6186
77 77 0.0000000000 3442
78 78 0.0000000000 6356
79 79 0.0000000000 3308
80 80 0.0042877561 6297
81 81 0.0000000000 7091
82 82 0.0000000000 6886
83 83 0.0000000000 1044
84 84 0.8013168724 6075
85 85 0.0000000000 5666
86 86 0.0521910389 2031
87 87 0.0819985656 4183
88 88 0.0000000000 8960
89 89 0.0000000000 6809
90 90 0.2976159627 7718
91 91 0.0000000000 3523
92 92 0.0000000000 2111
93 93 0.0000000000 4524
94 94 0.9257018768 6447
95 95 0.0000000000 3298
96 96 0.0000000000 1660
97 97 0.0000000000 1715
98 98 0.0000000000 3352
99 99 0.4570678337 11425
100 100 0.0000000000 3155
101 101 0.0000000000 7240
102 102 0.0381465013 4273
103 103 0.1079253449 7394
104 104 0.3565464262 5988
105 105 0.0587593233 5497
106 106 0.4434270765 4587
107 107 0.0000000000 1529
108 108 0.0834852351 5486
109 109 0.0803448707 8003
110 110 0.0000000000 6358
111 111 0.3372817194 8189
112 112 0.0000000000 5278
113 113 0.0000000000 7471
114 114 0.1358318891 4616
115 115 0.1058166989 4659
116 116 0.0000000000 5665
117 117 0.0000000000 5472
118 118 0.0000000000 3151
119 119 0.2458682939 7866
120 120 0.0000000000 6598
121 121 0.0000000000 5310
122 122 0.0819408740 3112
123 123 0.0222222222 7065
124 124 0.0000000000 4971
125 125 0.0000000000 3663
126 126 0.0000000000 8504
127 127 0.0058910162 6790
128 128 0.5924764890 5742
129 129 0.6383304614 5223
130 130 0.0000000000 727
131 131 0.1086138513 9833
132 132 0.0000000000 7723
133 133 0.0000000000 3655
134 134 0.0942982456 5472
135 135 0.0000000000 9404
136 136 0.0000000000 3313
137 137 0.0000000000 7845
138 138 0.0000000000 4145
139 139 0.0537383178 6420
140 140 0.0127215552 6996
141 141 0.4080804517 5668
142 142 0.0000000000 5932
143 143 0.0000000000 729
144 144 0.1777088773 6128
145 145 0.0000000000 3713
146 146 0.0000000000 896
147 147 0.0000000000 4532
148 148 0.0000000000 761
149 149 0.0000000000 2295
150 150 0.0100050025 5997
151 151 0.0303030303 10494
152 152 0.0528153645 4582
153 153 0.0000000000 4204
154 154 0.6292493726 4383
155 155 0.0000000000 6009
156 156 0.0000000000 5227
157 157 0.0000000000 502
158 158 0.0000000000 3326
159 159 0.0000000000 5323
160 160 0.2551888397 2939
161 161 0.0000000000 4527
162 162 0.0767246937 7755
163 163 0.0000000000 6801
164 164 0.0000000000 4910
165 165 0.0315228967 4695
166 166 0.0000000000 5300
We will check how balanced the classes are across the images.
# Total cells per class
as_tibble(colData(sce)) %>%
group_by(layer_1_gated) %>%
summarise(number_cells = n())
# A tibble: 13 × 2
layer_1_gated number_cells
<chr> <int>
1 CD134- Tcell 20610
2 CD134+ Tcell 806
3 CD38 4440
4 exhausted Tcytotoxic 553
5 HLA-DR 9407
6 Macrophage 7948
7 Neutrophil 3264
8 Stroma 1560
9 Tcytotoxic 4662
10 Tumor 74661
11 unknown 72
12 unlabelled 734588
13 Vasculature 1692
# Total cells per class and Sample
as_tibble(colData(sce)) %>%
group_by(layer_1_gated, ImageNumber) %>%
summarise(number_cells = n()) %>%
as.data.frame() %>%
head(.)
layer_1_gated ImageNumber number_cells
1 CD134- Tcell 4 3995
2 CD134- Tcell 5 759
3 CD134- Tcell 7 512
4 CD134- Tcell 12 1002
5 CD134- Tcell 37 567
6 CD134- Tcell 39 1001
Now, we will check the expression of selected markers across the classes and visualize cell labels on UMAP.
lab_sce <- sce[,sce$layer_1_gated != "unlabelled"]
agr_sce <- aggregateAcrossCells(lab_sce, ids = colData(lab_sce)[,c("ImageNumber", "layer_1_gated")],
average = TRUE)
Warning: 'average=' is deprecated, use 'statistics=' instead
assay(agr_sce, "asinh") <- asinh(counts(agr_sce))
assay(agr_sce, "scaled_asinh") <- t(scale(t(asinh(counts(agr_sce)))))
colnames(agr_sce) <- paste0(agr_sce$ImageNumber, "_", agr_sce$layer_1_gated)
# Define markers that were used for gating
cur_markers <- c("SMA", "CK5","CD38","HLADR","S100","Cadherin11","FAP", "CD134", "CD68",
"CD3", "Lag3", "PD1", "CD8", "SOX10", "CD31", "Mart1", "pRB", "CD15", "MPO",
"CD163")
# Non-scaled
dittoHeatmap(agr_sce[cur_markers,], assay = "asinh",
cells.use = colnames(agr_sce[cur_markers,]),
annot.by = c("ImageNumber", "layer_1_gated"),
order.by = "layer_1_gated", cluster_rows = FALSE,
scale = "none", heatmap.colors = viridis(100),
annotation_colors = list(layer_1_gated = metadata(sce)$colour_vectors$layer_1))
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
# Centered and scaled
dittoHeatmap(agr_sce[cur_markers,], assay = "scaled_asinh",
annot.by = c("ImageNumber", "layer_1_gated"),
order.by = "layer_1_gated", cluster_rows = FALSE,
annotation_colors = list(layer_1_gated = metadata(sce)$colour_vectors$layer_1),
heatmap.colors = colorRampPalette(c("dark blue", "white", "dark red"))(100),
breaks = seq(-3, 3, length.out = 101))
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
unlab_sce <- sce[,sce$layer_1_gated == "unlabelled"]
ggplot() +
geom_point(aes(x = UMAP1, y = UMAP2, colour = layer_1_gated),
data = data.frame(UMAP1 = reducedDim(unlab_sce, "UMAP")[,1],
UMAP2 = reducedDim(unlab_sce, "UMAP")[,2],
layer_1_gated = colData(unlab_sce)$layer_1_gated)) +
geom_point(aes(x = UMAP1, y = UMAP2, colour = layer_1_gated), size = 0.5,
data = data.frame(UMAP1 = reducedDim(lab_sce, "UMAP")[,1],
UMAP2 = reducedDim(lab_sce, "UMAP")[,2],
layer_1_gated = colData(lab_sce)$layer_1_gated)) +
scale_color_manual(values = metadata(sce)$colour_vectors$layer_1) +
theme_bw()
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
After quality control, we will now use a random forest classifier to classify the remaining cells in the dataset.
In the first instance, we will split the labelled data based on their cell-types and ignore from which images the calls come. In the current setting most images have been labelled but in the future we want to have a closer look at how well cells of non-labelled images are classified.
We will first split the labelled data into training and test (validation) data at a ratio of 70/30 train/test.
set.seed(1234)
trainIndex <- createDataPartition(factor(lab_sce$layer_1_gated), p = 0.70)
train_sce <- lab_sce[,trainIndex$Resample1]
test_sce <- lab_sce[,-trainIndex$Resample1]
Here, we will first use a 10-fold crossvalidation by partitioning the data randomly across the full dataset. This process is repeated 5 times. We will also use parallel processing for time reasons. For the randomForrest
classifier, we need to tune the mtry
parameter - the number of variables sampled for each split.
# Define seeds for parallel processing
# Per iteration, we evaluate 10 models while tuning mtry
set.seed(222)
seeds <- vector(mode = "list", length = 11)
for (i in 1:10) {
seeds[[i]] <- sample.int(5000, 10)
}
seeds[[11]] <- sample.int(5000, 1)
fitControl <- trainControl(method = "repeatedcv",
repeats = 1,
number = 10,
seeds = seeds)
cl <- makePSOCKcluster(round(detectCores()/2.5,0), outfile = "")
registerDoParallel(cl)
set.seed(1234)
start = Sys.time()
rffit <- train(x = t(assay(train_sce, "asinh")[rowData(sce)$good_marker,]),
y = factor(train_sce$layer_1_gated),
method = "rf", ntree = 500,
tuneLength = 10,
trControl = fitControl,
allowParallel = TRUE)
stopCluster(cl)
end = Sys.time()
print(end-start)
Time difference of 37.06826 mins
rffit
Random Forest
90777 samples
33 predictor
12 classes: 'CD134- Tcell', 'CD134+ Tcell', 'CD38', 'exhausted Tcytotoxic', 'HLA-DR', 'Macrophage', 'Neutrophil', 'Stroma', 'Tcytotoxic', 'Tumor', 'unknown', 'Vasculature'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 81700, 81703, 81700, 81699, 81698, 81699, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9845665 0.9754440
5 0.9918592 0.9870778
8 0.9930269 0.9889361
12 0.9931921 0.9891999
15 0.9931260 0.9890953
19 0.9927184 0.9884506
22 0.9923109 0.9878048
26 0.9917050 0.9868419
29 0.9913304 0.9862476
33 0.9907135 0.9852693
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 12.
We will now have a look at the accuracy measures over iterations. The only parameter that has been tuned is mtry
.
ggplot(rffit) +
geom_errorbar(data = rffit$results,
aes(ymin = Accuracy - AccuracySD,
ymax = Accuracy + AccuracySD),
width = 0.4)
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
We can also compute the confusion matrix:
confusionMatrix(rffit)
Cross-Validated (10 fold, repeated 1 times) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction CD134- Tcell CD134+ Tcell CD38 exhausted Tcytotoxic
CD134- Tcell 15.8 0.0 0.0 0.0
CD134+ Tcell 0.0 0.6 0.0 0.0
CD38 0.0 0.0 3.4 0.0
exhausted Tcytotoxic 0.0 0.0 0.0 0.4
HLA-DR 0.0 0.0 0.0 0.0
Macrophage 0.0 0.0 0.0 0.0
Neutrophil 0.0 0.0 0.0 0.0
Stroma 0.0 0.0 0.0 0.0
Tcytotoxic 0.0 0.0 0.0 0.0
Tumor 0.0 0.0 0.0 0.0
unknown 0.0 0.0 0.0 0.0
Vasculature 0.0 0.0 0.0 0.0
Reference
Prediction HLA-DR Macrophage Neutrophil Stroma Tcytotoxic Tumor
CD134- Tcell 0.0 0.0 0.0 0.0 0.0 0.0
CD134+ Tcell 0.0 0.0 0.0 0.0 0.0 0.0
CD38 0.0 0.0 0.0 0.0 0.0 0.0
exhausted Tcytotoxic 0.0 0.0 0.0 0.0 0.0 0.0
HLA-DR 7.2 0.0 0.0 0.1 0.0 0.0
Macrophage 0.0 6.1 0.0 0.0 0.0 0.0
Neutrophil 0.0 0.0 2.5 0.0 0.0 0.0
Stroma 0.0 0.0 0.0 1.0 0.0 0.0
Tcytotoxic 0.0 0.0 0.0 0.0 3.5 0.0
Tumor 0.0 0.0 0.0 0.1 0.0 57.5
unknown 0.0 0.0 0.0 0.0 0.0 0.0
Vasculature 0.0 0.0 0.0 0.0 0.0 0.0
Reference
Prediction unknown Vasculature
CD134- Tcell 0.0 0.0
CD134+ Tcell 0.0 0.0
CD38 0.0 0.0
exhausted Tcytotoxic 0.0 0.0
HLA-DR 0.0 0.0
Macrophage 0.0 0.0
Neutrophil 0.0 0.0
Stroma 0.0 0.0
Tcytotoxic 0.0 0.0
Tumor 0.0 0.0
unknown 0.0 0.0
Vasculature 0.0 1.2
Accuracy (average) : 0.9932
We will also look at the variable importance.
cur_varImp <- varImp(rffit)
plot(cur_varImp, top = 34)
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
Finally, we will validate the model using the test data.
cur_pred <- predict(rffit, newdata = t(assay(test_sce, "asinh")[rowData(sce)$good_marker,]))
cm <- confusionMatrix(data = cur_pred, reference = factor(test_sce$layer_1_gated))
cm
Confusion Matrix and Statistics
Reference
Prediction CD134- Tcell CD134+ Tcell CD38 exhausted Tcytotoxic
CD134- Tcell 6164 6 10 1
CD134+ Tcell 2 234 0 0
CD38 4 0 1310 0
exhausted Tcytotoxic 0 0 0 160
HLA-DR 0 0 1 0
Macrophage 0 1 3 0
Neutrophil 0 0 0 0
Stroma 1 0 3 0
Tcytotoxic 12 0 5 4
Tumor 0 0 0 0
unknown 0 0 0 0
Vasculature 0 0 0 0
Reference
Prediction HLA-DR Macrophage Neutrophil Stroma Tcytotoxic Tumor
CD134- Tcell 0 0 0 5 12 0
CD134+ Tcell 0 0 0 0 1 0
CD38 0 2 0 9 1 0
exhausted Tcytotoxic 0 0 0 0 3 0
HLA-DR 2796 0 0 25 0 11
Macrophage 0 2365 0 19 0 5
Neutrophil 0 9 978 0 0 0
Stroma 11 4 1 386 0 3
Tcytotoxic 0 0 0 1 1381 0
Tumor 5 4 0 22 0 22368
unknown 0 0 0 0 0 0
Vasculature 10 0 0 1 0 11
Reference
Prediction unknown Vasculature
CD134- Tcell 0 3
CD134+ Tcell 0 0
CD38 0 1
exhausted Tcytotoxic 0 0
HLA-DR 0 9
Macrophage 0 0
Neutrophil 0 0
Stroma 2 0
Tcytotoxic 0 1
Tumor 3 5
unknown 16 0
Vasculature 0 488
Overall Statistics
Accuracy : 0.9935
95% CI : (0.9927, 0.9943)
No Information Rate : 0.5758
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9897
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: CD134- Tcell Class: CD134+ Tcell Class: CD38
Sensitivity 0.9969 0.970954 0.98348
Specificity 0.9989 0.999922 0.99955
Pos Pred Value 0.9940 0.987342 0.98719
Neg Pred Value 0.9994 0.999819 0.99941
Prevalence 0.1590 0.006196 0.03424
Detection Rate 0.1585 0.006016 0.03368
Detection Prevalence 0.1594 0.006093 0.03411
Balanced Accuracy 0.9979 0.985438 0.99152
Class: exhausted Tcytotoxic Class: HLA-DR
Sensitivity 0.969697 0.99079
Specificity 0.999923 0.99872
Pos Pred Value 0.981595 0.98381
Neg Pred Value 0.999871 0.99928
Prevalence 0.004242 0.07255
Detection Rate 0.004113 0.07188
Detection Prevalence 0.004190 0.07306
Balanced Accuracy 0.984810 0.99476
Class: Macrophage Class: Neutrophil Class: Stroma
Sensitivity 0.99203 0.99898 0.824786
Specificity 0.99923 0.99976 0.999349
Pos Pred Value 0.98830 0.99088 0.939173
Neg Pred Value 0.99948 0.99997 0.997869
Prevalence 0.06129 0.02517 0.012031
Detection Rate 0.06080 0.02514 0.009923
Detection Prevalence 0.06152 0.02537 0.010566
Balanced Accuracy 0.99563 0.99937 0.912068
Class: Tcytotoxic Class: Tumor Class: unknown
Sensitivity 0.98784 0.9987 0.7619048
Specificity 0.99939 0.9976 1.0000000
Pos Pred Value 0.98362 0.9983 1.0000000
Neg Pred Value 0.99955 0.9982 0.9998714
Prevalence 0.03594 0.5758 0.0005399
Detection Rate 0.03550 0.5750 0.0004113
Detection Prevalence 0.03609 0.5760 0.0004113
Balanced Accuracy 0.99361 0.9981 0.8809524
Class: Vasculature
Sensitivity 0.96252
Specificity 0.99943
Pos Pred Value 0.95686
Neg Pred Value 0.99951
Prevalence 0.01303
Detection Rate 0.01255
Detection Prevalence 0.01311
Balanced Accuracy 0.98098
a <- data.frame(cm$byClass) %>%
mutate(class = sub("Class: ", "", rownames(cm$byClass))) %>%
ggplot() +
geom_point(aes(1 - Specificity, Sensitivity,
size = Detection.Rate,
fill = class),
shape = 21) +
scale_fill_manual(values = metadata(sce)$colour_vectors$layer_1) +
theme_bw() +
theme(text=element_text(size=12)) +
ylab("Sensitivity (TPR)") +
xlab("1 - Specificity (FPR)")
legend <- get_legend(a)
a + theme(legend.position = "none")
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
plot(legend)
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
We will also observe the distribution of classification probabilities per image and class:
cur_pred <- predict(rffit, newdata = t(assay(test_sce, "asinh")[rowData(sce)$good_marker,]),
type = "prob")
cur_pred %>%
mutate(class = test_sce$layer_1_gated,
image = test_sce$ImageNumber) %>%
reshape2::melt(id.vars = c("class", "image"), variable.name = "celltype", value.name = "probability") %>%
filter(class == celltype) %>%
ggplot() +
geom_boxplot(aes(interaction(image), probability), outlier.size = 0.5) +
facet_wrap(. ~ class) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
This plot shows the median probability for each image and class.
Finally, we will predict the labels of all other cells. For cell-type classification, we will use the method that was trained across all images.
start = Sys.time()
cell_labels.class <- as.character(predict.train(rffit,
newdata = t(assay(unlab_sce[rowData(sce)$good_marker,], "asinh")),
type = "raw"))
cell_labels.prob <- predict.train(rffit,
newdata = t(assay(unlab_sce[rowData(sce)$good_marker,], "asinh")),
type = "prob")
end = Sys.time()
print(end-start)
Time difference of 59.39312 secs
Store predictions in SCE object. We will not overwrite the labels of the already labelled cells.
cell_labels <- sce$layer_1_gated
cell_labels[colnames(unlab_sce)] <- cell_labels.class
sce$celltype <- cell_labels
Here, we will visualize the predicted cell-types and their associated classification probabilities.
dittoDimPlot(sce[,sce$layer_1_gated != "unlabelled"], var = "celltype", reduction.use = "UMAP", size = 0.5,
color.panel = metadata(sce)$colour_vectors$layer_1, main = "Cell types gated")
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
dittoDimPlot(sce[,sce$layer_1_gated == "unlabelled"], var = "celltype", reduction.use = "UMAP", size = 0.5,
color.panel = metadata(sce)$colour_vectors$layer_1, main = "Cell types classified")
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
for (i in unique(cell_labels.class)) {
cur_df <- data.frame(UMAP1 = reducedDim(unlab_sce, "UMAP")[,1],
UMAP2 = reducedDim(unlab_sce, "UMAP")[,2],
prob = cell_labels.prob[,i],
class = cell_labels.class == i)
ggplot() + geom_point(aes(UMAP1, UMAP2), data = cur_df[!cur_df$class,],
color = "gray") +
geom_point(aes(UMAP1, UMAP2, color = prob), data = cur_df[cur_df$class,],
size = 0.5)+
scale_colour_viridis(name = paste0(i, " probability"))
}
Finally, we will visualize the marker expression per cell type using the classified cells.
unlab_sce <- sce[,sce$layer_1_gated == "unlabelled"]
agr_sce <- aggregateAcrossCells(unlab_sce, ids = colData(unlab_sce)[,c("ImageNumber", "celltype")],
average = TRUE)
Warning: 'average=' is deprecated, use 'statistics=' instead
assay(agr_sce, "asinh") <- asinh(counts(agr_sce))
colnames(agr_sce) <- paste0(agr_sce$ImageNumber, "_", agr_sce$celltype)
# Non-scaled
dittoHeatmap(agr_sce[cur_markers,], assay = "asinh",
annot.by = c("celltype"),
order.by = "celltype", cluster_rows = FALSE,
scale = "none", heatmap.colors = viridis(100),
annotation_colors = list(celltype = metadata(sce)$colour_vectors$layer_1))
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
# Centered and scaled
dittoHeatmap(agr_sce[cur_markers,], assay = "asinh",
annot.by = c("celltype"),
cluster_rows = FALSE,
annotation_colors = list(celltype = metadata(sce)$colour_vectors$layer_1),
heatmap.colors = colorRampPalette(c("dark blue", "white", "dark red"))(100),
breaks = seq(-3, 3, length.out = 101))
Version | Author | Date |
---|---|---|
1dc2e93 | toobiwankenobi | 2022-02-22 |
saveRDS(sce, "data/data_for_analysis/sce_RNA.rds")
# create data frame with class and probabilities and save as csv.
layer_1_dat <- as.data.frame(cell_labels.prob)
layer_1_dat$class <- cell_labels.class
write.csv(layer_1_dat, file = "data/data_for_analysis/layer_1_classification_rna.csv")
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] randomForest_4.6-14 ggpubr_0.4.0
[3] doParallel_1.0.16 iterators_1.0.13
[5] foreach_1.5.2 viridis_0.6.2
[7] viridisLite_0.4.0 dittoSeq_1.6.0
[9] forcats_0.5.1 stringr_1.4.0
[11] dplyr_1.0.7 purrr_0.3.4
[13] readr_2.1.2 tidyr_1.2.0
[15] tibble_3.1.6 tidyverse_1.3.1
[17] scater_1.22.0 scuttle_1.4.0
[19] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
[21] Biobase_2.54.0 GenomicRanges_1.46.1
[23] GenomeInfoDb_1.30.1 IRanges_2.28.0
[25] S4Vectors_0.32.3 BiocGenerics_0.40.0
[27] MatrixGenerics_1.6.0 matrixStats_0.61.0
[29] caret_6.0-90 lattice_0.20-45
[31] ggplot2_3.3.5 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] readxl_1.3.1 backports_1.4.1
[3] plyr_1.8.6 splines_4.1.2
[5] BiocParallel_1.28.3 listenv_0.8.0
[7] digest_0.6.29 htmltools_0.5.2
[9] fansi_1.0.2 magrittr_2.0.2
[11] ScaledMatrix_1.2.0 tzdb_0.2.0
[13] recipes_0.1.17 globals_0.14.0
[15] modelr_0.1.8 gower_0.2.2
[17] colorspace_2.0-2 rvest_1.0.2
[19] ggrepel_0.9.1 haven_2.4.3
[21] xfun_0.29 callr_3.7.0
[23] crayon_1.4.2 RCurl_1.98-1.5
[25] jsonlite_1.7.3 survival_3.2-13
[27] glue_1.6.1 gtable_0.3.0
[29] ipred_0.9-12 zlibbioc_1.40.0
[31] XVector_0.34.0 DelayedArray_0.20.0
[33] car_3.0-12 BiocSingular_1.10.0
[35] future.apply_1.8.1 abind_1.4-5
[37] scales_1.1.1 pheatmap_1.0.12
[39] DBI_1.1.2 rstatix_0.7.0
[41] Rcpp_1.0.8 proxy_0.4-26
[43] rsvd_1.0.5 lava_1.6.10
[45] prodlim_2019.11.13 httr_1.4.2
[47] RColorBrewer_1.1-2 ellipsis_0.3.2
[49] farver_2.1.0 pkgconfig_2.0.3
[51] nnet_7.3-17 sass_0.4.0
[53] dbplyr_2.1.1 utf8_1.2.2
[55] labeling_0.4.2 tidyselect_1.1.1
[57] rlang_1.0.0 reshape2_1.4.4
[59] later_1.3.0 munsell_0.5.0
[61] cellranger_1.1.0 tools_4.1.2
[63] cli_3.1.1 generics_0.1.2
[65] ggridges_0.5.3 broom_0.7.12
[67] evaluate_0.14 fastmap_1.1.0
[69] yaml_2.2.2 ModelMetrics_1.2.2.2
[71] processx_3.5.2 knitr_1.37
[73] fs_1.5.2 future_1.23.0
[75] nlme_3.1-155 sparseMatrixStats_1.6.0
[77] whisker_0.4 xml2_1.3.3
[79] compiler_4.1.2 rstudioapi_0.13
[81] beeswarm_0.4.0 e1071_1.7-9
[83] ggsignif_0.6.3 reprex_2.0.1
[85] bslib_0.3.1 stringi_1.7.6
[87] highr_0.9 ps_1.6.0
[89] Matrix_1.4-0 vctrs_0.3.8
[91] pillar_1.7.0 lifecycle_1.0.1
[93] jquerylib_0.1.4 BiocNeighbors_1.12.0
[95] cowplot_1.1.1 data.table_1.14.2
[97] bitops_1.0-7 irlba_2.3.5
[99] httpuv_1.6.5 R6_2.5.1
[101] promises_1.2.0.1 gridExtra_2.3
[103] vipor_0.4.5 parallelly_1.30.0
[105] codetools_0.2-18 MASS_7.3-55
[107] assertthat_0.2.1 rprojroot_2.0.2
[109] withr_2.4.3 GenomeInfoDbData_1.2.7
[111] hms_1.1.1 grid_4.1.2
[113] rpart_4.1.16 beachmat_2.10.0
[115] timeDate_3043.102 class_7.3-20
[117] rmarkdown_2.11 DelayedMatrixStats_1.16.0
[119] carData_3.0-5 git2r_0.29.0
[121] getPass_0.2-2 pROC_1.18.0
[123] lubridate_1.8.0 ggbeeswarm_0.6.0