vignettes/Guidelines_Contribution_Dataset-formatting.Rmd
Guidelines_Contribution_Dataset-formatting.Rmd
This vignette describes the procedure to contribute new datasets to
the imcdatasets
package and contains guidelines for dataset
formatting.
Contributions or suggestions for new imaging mass cytometry (IMC)
datasets to add to the imcdatasets
package are always
welcome. New datasets can be suggested by opening an issue at the
imcdatasets
GitHub
page. The only requirements are that the new dataset (i) is
publicly available and (ii) has been described in a published
scientific article.
Details about creating Bioconductor’s ExperimentHub
packages are
available here.
The first step is to create a new branch at the
imcdatasets
GitHub
page.
Then, create an R markdown (.Rmd
) script in
.inst/scripts/
to generate the data objects:
ExperimentHub
.The .Rmd
script must be formatted in the same way as
pre-existing scripts. Examples can be found here
and here.
Each step should be clearly and comprehensively documented.
For usability of the package and consistency across datasets,
the data objects must be formatted as described in
the Dataset format
section
below.
Other files in the imcdatasets
package should be updated
to include the new dataset:
./R/Lastname_Year_Type.R
file with a
function to load the new dataset and extensive documentation. Examples
can be found here
and here.man
documentation files (go
to the imcdatasets
directory and run
roxygen2::roxygenize(".")
)../inst/scripts/make-metadata.R
R script and
run it. This will update the ./inst/extdata/metadata.csv
file that is used by ExperimentHub
to provide metadata
information about datasets../inst/scripts/ref.bib
file../inst/extdata/alldatasets.csv
file../tests/testthat/test_loading.R
.After these steps have been completed, open a pull request at the imcdataset GitHub page.
The package maintainers will do the following:
AWS S3
and announce the
upload to Bioconductor Hubs.ExperimentHub
and check
the format again.NEWS
, DESCRIPTION
(add new
contributor, version bump) and README.md
(if needed)
files.imcdatasets
package
can be installed from there.Contributors will be recognized by having their names added to the
DESCRIPTION file of the imcdatasets
package.
The imcdatasets
package is meant to provide quick and
easy access to published and curated IMC datasets. Each dataset consists
of three data objects that can be retrieved individually:
The three data objects can be mapped using unique
image_name
values contained in the metadata of each
object.
For consistency across datasets, the guidelines below must be followed when creating a new dataset.
Single cell data should be formatted into a SingleCellExperiment
object named sce
that contains the following slots:
colData
: observations metadata.rowData
: marker metadata.assays
: marker expression levels per cell.colPairs
(optional): neighborhood
information.The colData
entry of the
SingleCellExperiment
object is a DataFrame
that contains observation metadata; i.e., cells, slides, tissue,
patients, …. It is recommended that all column names have a prefix that
indicates the level of observation (e.g. cell_
,
slide_
, tissue_
, patient_
,
tumor_
).
The following columns are required:
image_name
and/or image_number
: unique
image (ablated ROI) name, respectively number. Should map to the
image_name
/image_number
column(s) in the
metadata of the images
and masks
objects.cell_number
: integer representing cell numbers. Should
map to the values of cell segmentation masks.cell_id
: a unique cell identifier defined as
{image_number
_
cell_number
}
(e.g., 7_138
).cell_x
and cell_y
: position of the cell
centroid on the image. These columns are used as
SpatialCoords
when converting to a SpatialExperiment
object.In addition, colnames(sce)
should be set as
colData(sce)$cell_id
.
The rowData
entry of the
SingleCellExperiment
is a DataFrame
that
contains marker (protein, RNA, probe) information.
The following columns are required in the rowData
entry:
channel
: a unique integer that maps to the channels of
the associated multichannel images.metal
: the metal isotope used for detection, formatted
as { ChemicalSymbol
IsotopeMass
} (e.g.,
Y89
, In115
, Yb176
,
Bi209
).name
: marker name used in the publication that
describes the dataset.full_name
: full marker name.short_name
: abbreviated marker name, preferably
following the official UniProt
nomenclature.For the full_name
and short_name
columns,
the following guidelines apply:
short_name
, all dashes, dots and spaces should
removed or replaced with underscores.full_name
with the modification type (e.g.,
phospho-
, methyl-
) and suffix it with the
modified aminoacids (e.g., [S28]
).short_name
with an abbreviation of the
modification type (e.g., p_
, me_
) and do not
indicate the modified aminoacids, unless there is a possible confusion
with another target in the dataset.full_name | short_name |
---|---|
Carbonic anhydrase IX | CA9 |
CD3 epsilon | CD3e |
CD8 alpha | CD8a |
E-Cadherin | CDH1 |
cleaved-Caspase3 + cleaved-PARP | cCASP3_cPARP |
Cytokeratin 5 | KRT5 |
Forkhead box P3 | FOXP3 |
Glucose transporter 1 | SLC2A1 |
Histone H3 | H3 |
phospho-Histone H3 [S28] | p_H3 |
Ki-67 | Ki67 |
Myeloperoxidase | MPO |
Programmed cell death protein 1 | PD_1 |
Programmed death-ligand 1 | PD_L1 |
phospho-Rb [S807/S811] | p_Rb |
Smooth muscle actin | SMA |
Vimentin | VIM |
Iridium 191 | DNA1 |
Iridium 193 | DNA2 |
In addition, rownames(sce)
should be set as
rowData(sce)$short_name
.
The assays
slot of the SingleCellExperiment
contains counts matrices representing marker expression levels per cell
and channel.
It should at least contain a counts
matrix with raw ion
counts. The assays
slot can also contain additional
matrices with commonly used counts transformations, or counts
transformations that were used in the publication that describes the
dataset. All counts transformations must be documented in the
.R
function used to load the dataset. Common examples
include:
exprs
: asinh-transformed counts. For IMC, a cofactor of
1 is typically used.quant_norm
: counts censored (e.g., at the 99th
percentile) and scaled from 0 to 1.Neighborhood information, such as a list of cells that are localized
next to each other, can be stored as a SelfHits
object in the colPair
slot of the
SingleCellExperiment
object.
Multichannel images are stored in a CytoImageList
object named images
.
Channel names of the images
object
(channelNames(images)
) must map to
rownames(sce)
(marker short names).
The metadata slot (mcols(images)
) must contain an
image_name
column that maps to the image_name
column of colData(sce)
, and to the image_name
column of mcols(masks)
. This information is used by cytomapper to
associate multichannel images, cell segmentation masks, and single cell
data.
Cell segmentation masks are stored in a CytoImageList
object named masks
.
The values of the masks should be integers mapping to the
cell_number
column of colData(sce)
. This
information is used by cytomapper to
associate single cell data and cell segmentation masks.
The metadata slot (mcols(masks)
) must contain an
image_name
column that maps to the image_name
column of colData(sce)
, and to the image_name
column of mcols(images)
. This information is used by cytomapper to
associate multichannel images, cell segmentation masks, and single cell
data.
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/Zurich
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.2 knitr_1.45
## [4] rlang_1.1.3 xfun_0.42 purrr_1.0.2
## [7] textshaping_0.3.7 jsonlite_1.8.8 htmltools_0.5.7
## [10] ragg_1.3.0 sass_0.4.8 rmarkdown_2.26
## [13] evaluate_0.23 jquerylib_0.1.4 fastmap_1.1.1
## [16] yaml_2.3.8 lifecycle_1.0.4 memoise_2.0.1
## [19] bookdown_0.38 BiocManager_1.30.22 compiler_4.3.1
## [22] fs_1.6.3 systemfonts_1.0.6 digest_0.6.35
## [25] R6_2.5.1 magrittr_2.0.3 bslib_0.6.1
## [28] tools_4.3.1 pkgdown_2.0.7 cachem_1.0.8
## [31] desc_1.4.3