Introduction
Cancer is one of the most common reasons of mortality in humans. One of the most prevalent cancers is melanoma cancer. This disease is started when melanocyte, a specific type of skin cell, starts to grow out of control. Based on the annual reports of American Cancer Society, melanoma causes the most rate of mortality in skin cancers. In 2020, about 100,350 new melanomas will be diagnosed (1). Therefore timely diagnosis of this disease is very important (2).
In recent decade, dermatologists have begun to use dermoscopy, one invasive imaging tool, to improve diagnosis of skin lesions of melanoma. This instrument provides an enlarged image of skin lesion through polarized light. It shows more details of skin structure, and improves the validity of the diagnosis (3). Diagnosis through large number of dermoscopy images by dermatologists is still difficult, time-consuming, and subjective. Thus, existence of automatic accurate skin lesion recognition systems is very helpful and even critical in timely diagnosis of skin cancers.
The first essential stage in any computer-based diagnostic systems is the object segmentation (4–6). The lesion segmentation is still as a challenge because of the large variety of skin lesions in shape, size, location, color, and texture. Existing additional factors such as hair, blood vessels, ruler sign, air bubbles, and also low contrast borders between lesions and surrounding tissues are amongst the obstacles of correct segmentation (7).
Generally, there are various methods for image segmentation, such as methods based on edge detection (8), thresholding (9), region detection (10), feature clustering (11), and also the methods based on deep neural networks (12). Recently segmentation of images by using deep learning and particularly, convolutional neural networks have achieved more accuracy in biomedical applications (13–18).
In 2017, Yu et al., developed an approach using deep residual networks (FCRN: fully convolutional residual network) for segmentation and classification of melanoma skin lesions. They tested their method on the ISBI 2016 dataset and achieved an accuracy of 94.9% (19). Lin et al., proposed two approaches for skin lesion segmentation, a method based on C-means clustering and a U-Net-based method. They evaluated their method on the ISBI 2017 dataset. The clustering-based technique and U-Net-based method achieved dice index of 61% and 77%, respectively (20). Yuan et al., offered a deep FCN-based skin lesion segmentation method. They used the Jaccard distance as a loss function and improved the basic FCN method. They examined their method on ISBI 2016 and PH2 datasets, and achieved accuracy of 95.5% and 93.8%, respectively (21). In 2017, Yuan et al., presented a skin lesion segmentation method using deep convolutional-deconvolutional neural networks (CDNN). They trained their model with different color spaces of dermoscopy images of the ISBI 2017 dataset. Their approach was ranked first in the ISBI 2017 Challenge achieving Jaccard Index of 76.5% (22).
Al-masni et al., at 2018 conducted a study on skin lesion segmentation and designed a full resolution convolutional network. They performed the examinations on ISBI 2017 and PH2 datasets and achieved the results of 77.11% and 84.79% by Jaccard criteria, respectively (7).
In our method, a two-stage CNN-based skin lesion segmentation method is proposed. Two different deep neural network structures is used in normalization and segmentation stages to improve the lesion segmentation performance.
Our contributions in this work are as follows:
- Proposing a two-stage segmentation method containing lesion detection stage before the segmentation stage.
- Employing robust deep neural architectures in both detection and segmentation stages.
- Use of 4 different modes of each input image in the detection stage and estimation of the bounding box based on the weighted averaging of the bounding boxes related to each mode.
- Use of 8 different modes of each normal image in the segmentation stage and constructing the final segmented image based on the segmentation results related to each mode.
Materials and Methods
Dataset
The proposed segmentation method will be evaluated on a well-known ISBI 2017 challenge dataset. This dataset was prepared by the International Skin Imaging Collaboration (ISIC) archive (23), and was presented online at (24). ISBI 2017 is the latest version of the datasets of dermoscopic images that contains segmentation ground-truth for all the training, test, and validation images. This dataset consists of 2750 8-bit RGB dermoscopy images of sizes from 540×722 to 4499×6748 pixels. A total of 2000, 150, and 600 images have been categorized for training, validation, and test, respectively.
To have a better evaluation, another dataset of non-dermoscopic images is used in our experiments. DermQuest consisted of 137 images (25).
Proposed method
In recent years, various single stage semantic segmentation methods such as U-net and FCN were used for medical image segmentation. The accuracy of these single-stage methods is sensitive to the size and location of the objects in images. Very large and very small lesions as well as various locations of lesions in images increase the complexity of trainable networks and reduce the performance. Therefore, it is better to perform a pre-segmentation step in order to normalize the size and location of lesions in images. It can reduce the complexity of the training procedure of the network in the segmentation stage and also increase the segmentation efficiency. In the proposed method, a detection stage is considered before the segmentation stage to normalize the size and location of a lesion in an input dermoscopy image. Figure 1 shows the framework of the proposed method.
Figure 1. Framework of the proposed method
Detection stage
The most important part of the proposed method is the estimation of the bounding box of the lesion in an image. Because the results of this stage substantially affect the segmentation performance. Any error in the detection stage results in high costs in the segmentation stage. Therefore, the accuracy of the detection stage is very important. We use object detection networks in our detection stage. Several methods based on deep convolutional neural networks have been proposed for object detection applications, such as R-CNN (11), Fast R-CNN (26), Faster R-CNN (27), Mask R-CNN (28), Single Shot multi-box detector (SSD) (29), You Only Look Once (Yolo) (30), and Retinanet (31).
Yolo is an object detection structure based on convolutional neural networks. This method divides the image into several sub-regions and for each region, predicts the boxes and their probabilities of belonging to the classes (30). Yolov2 structure is used in our detection stage to estimate the bounding box of the lesion in a dermoscopic image. The output of the Yolo is the coordinate and size of the bounding box of the detected lesion, as well as the detection score. In Figure. 1, x and y are coordinates of the left top corner of the box around the detected lesion. h and w are respectively the height and width of the bounding box, and score is the lesion detection score. The normalized image is constructed from the input image based on the location and size of the bounding box of the lesion detected by the Yolo. As shown in Figure. 1, the output of the detection stage is a normal image, in which the detected lesion is in the center. In this figure, red dashed rectangle is the bounding box of the lesion detected by Yolo. The Yolo v2 architecture requires a convolutional network as its backbone. We use some pre trained networks as its backbone and select the best one for our application.
To improve the accuracy of the detection stage, in addition to the original input image, three additional images are created by using rotation and flipping the input image. Each of these modes of the image is applied to the Yolo network and its corresponding bounding box of the lesion is estimated. Four modes of an input image are considered as follows: Input image, horizontal and vertical flips of input image, input image rotated by 180 degrees. The bounding box of each mode is flipped or rotated back to the original mode. The size and coordinate of the final box of the lesion is respectively calculated by applying weighted averaging on the sizes and coordinates of these four boxes estimated by Yolo. In the averaging operation, the weights are the scores of the detection results which are obtained at the output of the Yolo. Figure 2 shows the estimated bounding boxes related to four modes of an input image as well as the final bounding box calculated based on the weighted average of them.
Figure 2. Final bounding box estimation for an image of ISBI 2017 dataset. Red, green, blue, and yellow rectangles are the obtained bounding box related to 1st, 2nd, 3rd, and 4th modes of the input image, respectively. The black rectangle with dashed lines is the final bounding box calculated based on the weighted averaging of these 4 bounding boxes. The detection score is given with the corresponding color for each bounding box
Segmentation stage
The normalized image is delivered to the input of the segmentation stage. The output of this stage is a binary image in which the foreground pixels are the segmented lesion. Several methods and networks have been used for image segmentation in various applications. DeepLab architecture is one of the most recent structure which has performed well in many applications (32).
In general, the Deeplab architecture is based on a combination of two common architectures: Spatial Pyramid Pooling and Encoder-decoder networks (33). Different Deeplab structures have been improved over time. Deeplab V1 (32), Deeplab V2 (34), Deeplab V3 (35) and Deeplab V3+ (33), are the various structures of the Deeplab. In our segmentation stage, Deeplab V3+ is used to segment lesion from surrounding tissues. Several pre-trained network are considered as the backbone of the DeepLab and the best one is determined in our experiments.
In the segmentation stage, totally eight different modes of the normal image are considered to have more accurate segmentation result as follows: Input image, horizontal and vertical flips of the image, the image rotated by -45, 45, 90, 180, and 270 degrees. The output of each mode is flipped or rotated back to the original mode. The final binary result is obtained based on these eight binary images. A pixel in the final result is considered as foreground if it has nonzero value at least in 3 out of 8 result images.
Results
Evaluation metrics
A commonly used metric to evaluate object detection methods is mean average precision (mAP). In addition, for evaluating segmentation methods, the following metrics have been used in the literature. Sensitivity (SEN) represents the rate of pixels of correctly detected skin lesion. Specificity (SPE) is the rate of pixels of non-skin lesions which are correctly classified (36). The Jaccard index (JAC) is an intersection over :union: (IOU) of the result of the segmented lesions with the ground truth masks (37). Index of Dice (DIC) measures the similarity of the segmented lesions through ground truth (16). Accuracy (ACC) shows the overall performance of the segmentation (37). All these criteria have been computed from four elements of the confusion matrix as follows:
In skin lesion segmentation applications, the main metric is the Jaccard index. The competitions on lesion segmentation, such as ISIC, have ranked the participants in terms of the Jaccard index.
Experimental setup
Deep neural networks require a lot of images in the training phase in order to adjust the training parameters. On the other hand, there are only 2000 dermoscopic images in the training set of the ISBI 2017 dataset. In order to increase the number of training images, data augmentation was performed by applying horizontal/vertical flipping, rotation, brightness changing, and resizing operations in a random manner on the images of the training set of the ISBI 2017 dataset. Finally a set of 14000 images were constructed and used in training of the Yolo network. These images are manually normalized and the normal images construct the training set for the DeepLab network.
In a similar way, the validation set of the ISBI 2017 dataset was augmented and a set of 600 images was constructed as our validation set. The augmented validation set were used to determine the best backbones for Yolo and DeepLab as well as preventing overfitting in the training phase of the DeepLab.
To perform experiments on the DermQuest dataset, images was randomly divided to two sets: training and test sets containing 103 and 34 images, respectively. Random division of images was done 4 times and the average results were reported in Table 5. The training images augmentation was done similar to the augmentation of the images in ISBI 2017 dataset. Finally, each augmented training set of the DermQuest consisted of 2163 images.
Experimental results
Several experiments were implemented to determine the proper backbone networks for Yolo v2 and DeepLab3+ architectures. A 6GB NVIDIA GeForce RTX2060 GPU were used in our experiments. The Initial learning rate and mini batch size were set to 0.001 and 8 samples, respectively. The learning rate was constant during Yolo training, and decreased by a factor of 0.3 every 10 epochs during training of the DeepLab. In Table 1, the Jaccard values obtained by the DeepLab with various backbones are reported. In this table, the detection stage was ignored and the input image was directly applied to the input of the DeepLab. Results on the augmented validation set showed that the best backbones for the Yolo and DeepLab architectures were vgg19 and Resnet101 networks, respectively. In subsequent experiments, the DeepLab based on Resnet101was used in the segmentation stage. Different pre-trained networks were used as a backbone of the Yolo v2 in the detection stage. The results are given in Table 2.
Effect of using more than one mode of images in normalization and segmentation stages is shown in Table 3. The proposed method is compared with various methods over the ISBI 2017 and DermQuest datasets based on the evaluation metrics in Table 4 and Table 5, respectively. It can be observed that the proposed lesion segmentation approach based on CNNs outperformed other methods.
Table1. Performance of different DeepLab structures with various pre-trained networks as a backbone on ISBI 2017 dataset
Backbone network of the DeepLab |
Jaccard (%) |
No. parameters (millions) |
Resnet 101 |
77.02 |
44.6 |
Resnet 50 |
76.84 |
25.6 |
Vgg 19 |
76.72 |
144 |
Vgg 16 |
75.56 |
138 |
Resnet 18 |
74.97 |
11.7 |
Densenet 201 |
74.85 |
20 |
Alexnet |
74.82 |
61 |
Mobilenet v2 |
74.78 |
3.5 |
Googlenet |
74.18 |
7 |
Squeezenet |
69.87 |
1.24 |
Xception |
64.26 |
22.9 |
Table 2. Performance of different Yolo v2 structures with various pre-trained networks as a backbone on ISBI 2017 dataset
Backbone network of the Yolo v2 |
|