Contribution guidelines and dataset format

Introduction

This vignette describes the procedure to contribute new datasets to the imcdatasets package and contains guidelines for dataset formatting.

Contribution guidelines

Contributions or suggestions for new imaging mass cytometry (IMC) datasets to add to the imcdatasets package are always welcome. New datasets can be suggested by opening an issue at the imcdatasets GitHub page. The only requirements are that the new dataset (i) is publicly available and (ii) has been described in a published scientific article.

Details about creating Bioconductor’s ExperimentHub packages are available here.

Create a data generation script

The first step is to create a new branch at the imcdatasets GitHub page.

Then, create an R markdown (.Rmd) script in .inst/scripts/ to generate the data objects:

Download single cell data, multiplexed images and cell segmentation masks.
Format the single cell data into a SingleCellExperiment object.
Format the images and masks into CytoImageList objects.
Save the three data objects so that they can be uploaded to ExperimentHub.
The three data objects must contain matching information so that they can be associated by the cytomapper package, as described in the imcdataset vignette.

The .Rmd script must be formatted in the same way as pre-existing scripts. Examples can be found here and here. Each step should be clearly and comprehensively documented.

For usability of the package and consistency across datasets, the data objects must be formatted as described in the Dataset format section below.

Update the documentation

Other files in the imcdatasets package should be updated to include the new dataset:

Make a new ./R/Lastname_Year_Type.R file with a function to load the new dataset and extensive documentation. Examples can be found here and here.
Run roxygenize to generate man documentation files (go to the imcdatasets directory and run roxygen2::roxygenize(".")).
Update the ./inst/scripts/make-metadata.R R script and run it. This will update the ./inst/extdata/metadata.csv file that is used by ExperimentHub to provide metadata information about datasets.
Add the reference of the paper that describes the dataset to the ./inst/scripts/ref.bib file.
Add the new dataset to the ./inst/extdata/alldatasets.csv file.
Add the new dataset to the dataset list in ./tests/testthat/test_loading.R.

Open a pull request

After these steps have been completed, open a pull request at the imcdataset GitHub page.

The package maintainers will do the following:

Check the R markdown script for data generation.
Generate the data objects by knitting the R markdown script.
Make sure the data objects are well formatted and consistent with the other datasets.
Check all the new package metadata and documentation.
Upload the data objects to AWS S3 and announce the upload to Bioconductor Hubs.
Download the data objects from ExperimentHub and check the format again.
Update the NEWS, DESCRIPTION (add new contributor, version bump) and README.md (if needed) files.
Build and check the package, make sure it passes all R and Bioconductor checks.
Push to GitHub and check that the imcdatasets package can be installed from there.
Test the package functionality in R.
Once everything works, approve the pull request and merge with the master branch.

Contributors will be recognized by having their names added to the DESCRIPTION file of the imcdatasets package.

Dataset format

The imcdatasets package is meant to provide quick and easy access to published and curated IMC datasets. Each dataset consists of three data objects that can be retrieved individually:

Single cell data in the form of a SingleCellExperiment object.
Multichannel images formatted into a CytoImageList object.
Cell segmentation masks formatted into a CytoImageList object.

The three data objects can be mapped using unique image_name values contained in the metadata of each object.

For consistency across datasets, the guidelines below must be followed when creating a new dataset.

Single cell data

Single cell data should be formatted into a SingleCellExperiment object named sce that contains the following slots:

colData: observations metadata.
rowData: marker metadata.
assays: marker expression levels per cell.
colPairs (optional): neighborhood information.

colData

The colData entry of the SingleCellExperiment object is a DataFrame that contains observation metadata; i.e., cells, slides, tissue, patients, …. It is recommended that all column names have a prefix that indicates the level of observation (e.g. cell_, slide_ , tissue_, patient_, tumor_).

The following columns are required:

image_name and/or image_number: unique image (ablated ROI) name, respectively number. Should map to the image_name/image_number column(s) in the metadata of the images and masks objects.
cell_number: integer representing cell numbers. Should map to the values of cell segmentation masks.
cell_id: a unique cell identifier defined as {image_number _ cell_number} (e.g., 7_138).
cell_x and cell_y: position of the cell centroid on the image. These columns are used as SpatialCoords when converting to a SpatialExperiment object.

In addition, colnames(sce) should be set as colData(sce)$cell_id.

rowData

The rowData entry of the SingleCellExperiment is a DataFrame that contains marker (protein, RNA, probe) information.

The following columns are required in the rowData entry:

channel: a unique integer that maps to the channels of the associated multichannel images.
metal: the metal isotope used for detection, formatted as { ChemicalSymbol IsotopeMass} (e.g., Y89, In115, Yb176, Bi209).
name: marker name used in the publication that describes the dataset.
full_name: full marker name.
short_name: abbreviated marker name, preferably following the official UniProt nomenclature.

For the full_name and short_name columns, the following guidelines apply:

In short_name, all dashes, dots and spaces should removed or replaced with underscores.
For post-translationally modified proteins:
Prefix the full_name with the modification type (e.g., phospho-, methyl-) and suffix it with the modified aminoacids (e.g., [S28]).
Prefix the short_name with an abbreviation of the modification type (e.g., p_, me_) and do not indicate the modified aminoacids, unless there is a possible confusion with another target in the dataset.

‘full_name’ and ‘short_name’ examples for some commonly used markers
full_name	short_name
Carbonic anhydrase IX	CA9
CD3 epsilon	CD3e
CD8 alpha	CD8a
E-Cadherin	CDH1
cleaved-Caspase3 + cleaved-PARP	cCASP3_cPARP
Cytokeratin 5	KRT5
Forkhead box P3	FOXP3
Glucose transporter 1	SLC2A1
Histone H3	H3
phospho-Histone H3 [S28]	p_H3
Ki-67	Ki67
Myeloperoxidase	MPO
Programmed cell death protein 1	PD_1
Programmed death-ligand 1	PD_L1
phospho-Rb [S807/S811]	p_Rb
Smooth muscle actin	SMA
Vimentin	VIM
Iridium 191	DNA1
Iridium 193	DNA2

In addition, rownames(sce) should be set as rowData(sce)$short_name.

assays

The assays slot of the SingleCellExperiment contains counts matrices representing marker expression levels per cell and channel.

It should at least contain a counts matrix with raw ion counts. The assays slot can also contain additional matrices with commonly used counts transformations, or counts transformations that were used in the publication that describes the dataset. All counts transformations must be documented in the .R function used to load the dataset. Common examples include:

exprs: asinh-transformed counts. For IMC, a cofactor of 1 is typically used.
quant_norm: counts censored (e.g., at the 99th percentile) and scaled from 0 to 1.

colPairs

Neighborhood information, such as a list of cells that are localized next to each other, can be stored as a SelfHits object in the colPair slot of the SingleCellExperiment object.

Images and masks

Images

Multichannel images are stored in a CytoImageList object named images.

Channel names of the images object (channelNames(images)) must map to rownames(sce) (marker short names).

The metadata slot (mcols(images)) must contain an image_name column that maps to the image_name column of colData(sce), and to the image_name column of mcols(masks). This information is used by cytomapper to associate multichannel images, cell segmentation masks, and single cell data.

Masks

Cell segmentation masks are stored in a CytoImageList object named masks.

The values of the masks should be integers mapping to the cell_number column of colData(sce). This information is used by cytomapper to associate single cell data and cell segmentation masks.

The metadata slot (mcols(masks)) must contain an image_name column that maps to the image_name column of colData(sce), and to the image_name column of mcols(images). This information is used by cytomapper to associate multichannel images, cell segmentation masks, and single cell data.

Session info

## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.30.0
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5         cli_3.6.2           knitr_1.45         
##  [4] rlang_1.1.3         xfun_0.42           purrr_1.0.2        
##  [7] textshaping_0.3.7   jsonlite_1.8.8      htmltools_0.5.7    
## [10] ragg_1.3.0          sass_0.4.8          rmarkdown_2.26     
## [13] evaluate_0.23       jquerylib_0.1.4     fastmap_1.1.1      
## [16] yaml_2.3.8          lifecycle_1.0.4     memoise_2.0.1      
## [19] bookdown_0.38       BiocManager_1.30.22 compiler_4.3.1     
## [22] fs_1.6.3            systemfonts_1.0.6   digest_0.6.35      
## [25] R6_2.5.1            magrittr_2.0.3      bslib_0.6.1        
## [28] tools_4.3.1         pkgdown_2.0.7       cachem_1.0.8       
## [31] desc_1.4.3

Nicolas Damond

Created: 13 September 2022; Compiled: 16 March 2024

Introduction