HuBMAP + HPA — Hacking the Human Body
Our Winstars Technology team has recently participated in a Kaggle competition. HuBMAP + HPA — Hacking the Human Body finished in 95th place with a bronze medal among 1175 contenders. In this paper, we would like to present our solution and highlight all the essential techniques used. A big part of the given solution can be carried over to other deep-learning tasks with little or no modifications. The paper is structured as follows: first, we briefly present the competition and its main challenges. Next, we write about our solution in 3 parts: data processing, model design, and training procedure.
1. Competition
The competition aims to obtain a model for segmenting functional tissue units (FTU) in tissue section images of five organs: kidney, large intestine, spleen, lung, and prostate. A detailed description of the competition can be found at HuBMAP + HPA — Hacking the Human Body.
1.1. Data
The provided training dataset comes from Human Protein Atlas (HPA). One of the challenges is that each organ type’s FTUs are visually very different. Below are some examples.
However, the main challenge of the competition is that the hidden (unavailable for inspection) test dataset comes from another source Human BioMolecular Atlas Program (HuBMAP). It has different image parameters: pixel size, tissue thickness, and staining protocol. This means that we need to train a model and evaluate it on data from a different distribution, i.e., the model has to be very good at generalizing. The evaluation happens on Kaggle's servers, so it is impossible to see examples from the test set. Organizers provided a single image from the HuBMAP dataset for reference, which you can see in Figure 2.
As we can see, the images differ in colour distribution (because of different staining procedures), magnification (white blobs are much larger on the HuBMAP image), and shape (HPA is circular and HuBMAP is rectangular).
1.2. Evaluation
Model performance is evaluated by calculating the dice coefficient on the hidden test set and averaging it over all images. The dice coefficient is calculated as
where X is the predicted set of pixels corresponding to FTUs and Y is the true set of pixels. Furthermore, there are no images without FTUs.
2. Our solution
We divided the training data into 5 folds and trained 5 instances of our model — each on four of the folds, with the left-out fold acting as a validation set. In the end, the best-performing model was an average of 2 model instances and got a final score of 0.77.
Training on each of the folds followed the same pattern. In this section, we will describe the main components of our solution, namely: data augmentations, model architecture, and training procedure.
2.1. Data augmentations
The training dataset consists of 351 images total and is not balanced with respect to organs: the largest class is kidneys, with 99 images, and the smallest class is lungs with only 48 images. Combining this with the difference between HPA and HuBMAP makes the design of augmentations an essential part of the solution. Augmentations have to be chosen to solve 2 problems: the size of the training dataset and the large difference between training images and test images.
To increase the dataset size, it is sufficient to use simple geometric augmentations like flips, rotations, and translations. To deal with the train and test distribution differences, we employ various colour augmentations and randomly rescale our images by different factors.
In our solution, we used multiple combinations of the following augmentations: flips, rotations, rescalings, shearing, linear contrast, multiplying hue and saturation, and adding element-wise. Augmentations were taken from imgaug.
Below are 9 examples of different augmentations applied to the same image:
Previously mentioned augmentations are applicable for any computer vision task and do not use domain-specific information. For this reason, we also used stain augmentation, specifically designed to work with tissue samples. More information on this process can be found in Structure-Preserving Color Normalization, and Sparse Stain Separation for Histological Images, and a Python library stain tools implement it. Below are some examples of this augmentation:
2.2. Model architecture
In the competition, we tried several architectures, and the best results were obtained with a transformer-based encoder and a simple decoder. In the end, we settled with a Co-Scale Conv-Attentional Image Transformer as an encoder and a convolutional decoder.
Encoder
Co-Scale Conv-Attentional Image Transformers is a family of various models, and among those, we chose to use the CoaT Lite Medium variant. It is comprised of 4 Serial Blocks. Pytorch implementation of CoaT can be found at https://github.com/mlpc-ucsd/CoaT/blob/main/src/models/coat.py.
Furthermore, to improve training stability and speed, we initialized our model with weights pre-trained on imagenet-1k.
Our encoder returns the outputs of all 4 serial blocks for the decoder to use. This allows us to get features on different scales.
2.3. Decoder
Each embedding from a serial block gets fed into an MLP block. After this, all embeddings get concatenated and fed into a fuse layer to get the segmentation map.
Each MLP block is a 1x1 convolution followed by a Filter Response Normalization (FRN), Thresholded Linear Unit (TLU), and Upscaling (if necessary). We decided to use FRN instead of BatchNormalization because, in our training process, we were forced to use a batch size of 1 due to memory limitations, and BatchNormalization is notoriously unstable in such cases. You can read more details about FRN in the paper Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks, but the main takeaway is that FRN with TLU’s performance does not depend on batch size and it outperforms BatchNormalization at all batch sizes.
The fuse layer is a simple 3x3 convolution followed by Pixel Shuffle for upscaling. Pixel shuffle takes the output of a convolution layer and rearranges the pixels to increase the resolution but decrease the number of channels. We use Pixel Shuffle as it produces better upscaling rather than the default bilinear upsampling. You can read more about Pixel Shuffle in Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network.
3. Training procedure
In this section, we will discuss the main components of our training procedure, namely the loss function, optimizer, and learning rate schedule.
3.1. Loss function
We used Lovasz hinge loss in combination with binary cross entropy. Lovasz hinge loss demonstrates good performance on segmentation tasks, and its author suggests using it with cross-entropy for better convergence. For more information on this loss function, refer to The Lovász-Softmax loss: A tractable surrogate for optimizing the intersection-over-union measure in neural networks. The PyTorch implementation of this loss function can be found here.
3.2. Optimizer
We chose Ranger as our optimizer. Ranger is an optimizer composed of Rectified Adam (RAdam) and the Lookahead mechanism. This combination works particularly well because RAdam stabilizes training at the beginning and Lookahead stabilizes training and convergence in the rest of the training.
For more information on RAdam, refer to On the Variance of the Adaptive Learning Rate and Beyond, and for details on Lookahead, refer to Lookahead Optimizer: k steps forward, 1 step back.
Furthermore, we used gradient accumulation over eight batches. Since we do not compute any running statistics during training, this is equivalent to having a batch size of 8 but processing each input individually. We do this to get more stable gradient updates, as having a batch size of 1 lead to less regular training.
Additionally, we used Stochastic Weight Averaging (SWA) over 5 different epochs, which provided an extra increase in our score by 0.01–0.02. In short, SWA over 5 epochs means that during training, we save the model weights after several epochs, and at the end, rather than using the final weights, we average the weights from the saved epochs and use them. Since our model does not use running statistics (e.g., BatchNormalization), we do not need to take additional steps. SWA worked particularly well in this competition because one of its fundamental properties is that it improves generalization, and, as we mentioned before, generalization is the main challenge of this competition. For more information on SWA, refer to Averaging Weights Leads to Wider Optima and Better Generalization.
3.3 Learning rate
We used a cosine warmup from 0 to 0.001in 10 epochs, then decayed it to 0.00001 in 40 epochs and trained for additional 10 epochs with a constant learning rate of 0.00001. We chose such a schedule because FRN layers use- its authors showed that FRN benefits from such a learning rate.
4. Conclusion
In this article, we have presented our solution for the Kaggle competition HuBMAP + HPA — Hacking the Human Body. While some parts of the solution were competition-specific, e.g., stain augmentations, others are applicable for a broader range of tasks. We believe the presented pipeline can act as a good baseline when working on other problems and hope it will help you.
To have access to additional information, contact us here.
Atilla BAHCHEDJIOGLOU