Code&Data
We strive to make all of our tools, protocols and datasets available to the community. For the most up to date source codes, visit our group page on Github.
Source code
We are building computatational pipelines for weakly supervised image analysis in digital pathology. We strive to always incorporate the latest technologies, so our pipelines are constantly evolving.
- Our current standard pipeline for weakly supervised pathology image analysis is “marugoto” (2022-2023): https://github.com/KatherLab/marugoto
- “Deepmed” package by Marko van Treeck (Python implementation, 2020-2022): https://github.com/KatherLab/deepmed
- “Histology image analysis” (HIA) package by Narmin Ghaffari Laleh (Python implementation, 2020-2022): https://github.com/KatherLab/HIA
- Deep learning-based prediction of molecular alterations pan-cancer package (Matlab implementation, 2019-2021): https://github.com/jnkather/DeepHistology
- Deep learning for detecting virus presence in cancer images, from 2018/2019: https://github.com/jnkather/VirusFromHE
- Deep learning for detecting MSI in gastrointestinal cancer, original codes from 2018/2019: https://github.com/jnkather/MSIfromHE
Metadata
- Metadata for the TCGA cohort, preprocessed for computational pathology analyses: https://github.com/KatherLab/cancer-metadata
Trained models
- our latest models for MSI prediction in colorectal cancer (PyTorch) are available at https://zenodo.org/record/5151502
Cancer histology images
Human solid tumors are made up of many different tissue types. Image analysis pipelines often start with a classification of these regions (such as tumor, stroma, necrosis, etc.). These are labeled, quality-controlled sets of images that can be used to train tissue classifiers:
- Benchmark data sets - we show the functionalities of Deepmed on two benchmark datasets, TCGA-BRCA-A2 and TCGA-BRCA-E2, that are available at https://zenodo.org/record/5337009
- 5000 labeled images of colorectal cancer tissue (from this paper): download
- 100,000 labeled images of colorectal cancer tissue (from this paper): download
- 1,000,000 images of colorectal cancer tissue in: download
- ˜12k images for tumor detection in colorectal and gastric cancer (512x512 px at 0.5 µm/px, from this paper): download
After detecting tissue of interest in whole slide images, deep learning classifiers can extract clinically meaningful information from the images. These datasets can be used to train these classifiers:
-
˜400k image patches of microsatellite instable (MSI) vs. microsatellite stable (MSS) image patches of colorectal and gastric cancer (from this paper): download, derived from the TCGA data set at http://portal.gdc.cancer.gov.
-
image patches of all colorectal cancer (CRC) whole slide images from the TCGA database, conveniently cut into tiles of 512x512 px for subsequent deep learning analysis. Only the manually annotated tumor region was processed. Patient pseudonyms (TCGA barcodes) are preserved in the dataset: https://zenodo.org/record/3784345. Corresponding genetic information are available at https://cbioportal.org. Original data credit: http://portal.gdc.cancer.gov.
Generated images
Deep generative adversarial networks can generate realistic histology images:
- 2500 image tiles of generated colorectal cancer tissue, 256x256 px

Protocols
- The “Aachen Protocol” for data preprocessing in deep learning histology image analysis: https://zenodo.org/record/3694994
Others
- Manual tumor annotations on TCGA diagnostic slides: https://zenodo.org/record/5320076
- trained Pytorch models for MSI/dMMR status prediction in colorectal cancer: https://zenodo.org/record/5151502