Covid19 Detection

COVID19 Detection




The Goal

Our team's goal was to look into the existing research on COVID-19 detection through image recognition and create multiple models and explore different methods for classification using CT scan data as our input.

Our datasets were as follows:

  1. UCSD Covid and Non-Covid Dataset
  2. Coronacase & Radiopaedia Dataset

Transfer Learning with Random Data Split

The first preliminary testing we did was attempting to reproduce Dr. Pham's results he obtained from using a random data split with 80% of the images going to training and 20% to testing. The CNNs we tested were resnet18, resnet50, densenet201, and shufflenet. For each CNN, we tested this data split one time without data augmentation (rotations, reflections, translations, and scaling) and another time with data augmentations on the images. For each test, the mean accuracy, mean F-score, and their standard deviations were calculated.

Results

Table 1: Transfer learning with a random data split and data augmentation.
Table 2: Transfer learning with a random data split and no data augmentation.

Similarly to Dr. Pham, we found that not using any data augmentations with a random data split led to higher accuracies and F-scores for the four specific CNNs that we tested. The accuracies achieved were almost 95% for each network. This gave suspicions that the accuracies were a bit too high so we investigated a technique called leave-one-patient-out cross validation to remove bias and get accuracies that are more represntative of how well the CNNs performed with our dataset.



Leave-one-patient-out Cross Validation

Leave-one-patient-out cross validation is a process where the entire dataset is divided so that each patient’s images are only grouped with their own lung images and put into an individual subset. In our case, we have 384 patients so we would have 384 subsets. Each subset represents one iteration where we will end up training and testing our network. For each iteration, one subset will be used for testing and the rest will be part of training. For the preliminary testing, after training the network each iteration, the accuracy and F-score was calculated. After every subset was tested, we calculate the average mean and F-score along with their standard deviations. This cross-validation is useful because it allows us to test our networks with the same dataset multiple times. This method also ensures that a patient’s images won’t be in training and testing at the same time, effectively getting rid of any bias that could occur if we had a random data split.

For leave-one-out, the first thing we do is create a folder called leave_out that contains training and testing folders for noncovid and covid images. At first, all images will go into the training folder. Depending on which patient number we are looking at, we will move that patient's images from the training folder into the testing folder then train our network as usual.

	
% Go through the COVID images and get the images that need to be moved.
if(isCOVID == 1)
	for i = 1:covid_size
		if(strcmp(COVID_sheet.Var2{i},patient_id) == 1)
			images_to_move = [images_to_move ; COVID_sheet.Var1{i}];
		end
	end
end

% Go through the nonCOVID images and get the images that need to be moved.
if(isCOVID == 0)
	for i = 1:noncovid_size
		if(strcmp(nonCOVID_sheet.Var3{i},patient_id) == 1)
			images_to_move = [images_to_move ; nonCOVID_sheet.Var2{i}];
		end
	end
end

% Get number of images to move.
move_num = size(images_to_move,1);

% Move the COVID patient's images into the validation folder.
if(isCOVID == 1)
	for i = 1:move_num
		file_name = images_to_move(i);
		movefile(fullfile(train_covid_path, file_name + "*"), val_covid_path);
	end
end

% Move the nonCOVID patient's images into the validation folder.
if(isCOVID == 0)
	for i = 1:move_num
		file_name = images_to_move(i);
		movefile(fullfile(train_noncovid_path, file_name + "*"), val_noncovid_path);
	end
end
	



For the preliminary testing of transfer learning with leave-one-out validation, we found that resnet50 had the highest accuracy and F-score while shufflenet had the lowest accuracy and F-score. Each network obtained an accuracy of at least 75%. The results obtained from using leave-one-out cross validation is significantly lower and have a large range of standard deviations compared to the results obtained with random data splits. This is because randomly spliting the data leads the bias which increases the accuracy and F-score obtained by the CNN.

For the preliminary testing of transfer learning with leave-one-out validation, we found that resnet50 had the highest accuracy and F-score while shufflenet had the lowest accuracy and F-score. Each network obtained an accuracy of at least 75%.
Figure 1: Preliminary classification results for Densenet201 for leave-one-patient-out cross validation.
Figure 1.1: Preliminary classification results for Shufflenet for leave-one-patient-out cross validation.
Figure 1.2: Preliminary classification results for Resnet50 for leave-one-patient-out cross validation.
Figure 1.3: Preliminary classification results for Resnet18 for leave-one-patient-out cross validation.

Feature Extraction from CNN for use on SVM Category Classifier:

(Leave One Patient Out Data Split)

Fscores Compared for all CNNs

Densenet201 performed the best across the board for the SVM classifier. Shufflenet was close behind, Resnet50 following that. Resnet18 performed the worse. This was interesting to see because resnet18 performed the best for

Densenet201

Densenet performed the best with feature extraction. The last concatination layer was used for this test. (conv5_block32_concat)

Below is the testing for a variety of parameters on the CNN densenet201.









SHUFFLENET

Shufflenet performed the 2nd best. Layer node_198 was used to achieve this result.

Below is the testing for a variety of parameters on the CNN Shufflenet.







RESNET50

Restnet50 performed 3rd best. Layer add_16 was used to achieve this result.

Below is the testing for a variety of parameters on the CNN Resnet50.







RESNET18

Restnet18 performed by far the worst. Layer pool5 was used to achieve this result.

Below is the testing for a variety of parameters on the CNN Resnet18.



Final Results

For our final testing, we used the same four CNNs (resnet18, resnet50, densenet201, and shufflenet). The entire UCSD dataset was put into training and the three image slices for each patient (from CT scans from Coronacase and Radiopaedia) was put into our testing set. We tested with transfer learning with and without data augmentation, this time without doing leave-one-out cross validation. When there were data augmentations, we trained and tested the network five times and calculated the standard deviation and average accuracy and F-score. We also tested this dataset with feature extraction with a SVM classifier and calculated the accuracy and F-score. To determine if a patient should be labelled as having COVID, two or more of the three images had to be classified as containing COVID.

Both methods for resnet18 yielded similar results (Figure 2) with a 81.67% accuracy without data augmentation and 77% accuracy without data augmentation.

Figure 2: Classification results for resnet18.

Resnet50 performed much better without data augmentations, getting around a 10% higher accuracy than with data augmentation (Figure 3).

Figure 3: Classification results for resnet50.

Shufflenet performed significantly worse when training with data augmentation, getting the worse accuracy out of all of the other networks (Figure 4). Using this method, shufflenet was only able to identify half of the patients as having COVID.

Figure 4: Classification results for shufflenet.

Densenet201 was the only network that performed better with data augmentation as shown by Figure 5. For our final tests with transfer learning, this network got the best accuracy and FScore whiel maintaining a very low standard deviation for both. Densenet201 was able to correctly identify that 190 out of the 20 patients had COVID.

Figure 5: Classification results for densenet201.
Figure 6: Comparison of F-scores of the different models we tested: SVM, transfer learning without data augmentations, and transfer learning with data augmentations.

Class Activation Mapping

Since most Neural Networks are looked at as black boxes, naturally our group was curious as to what was going on inside and what our CNN's were basing their decision making process on. Thankfully we were introduced to the option of using Heat-Mapping techniques that could show us what image features were being considered more heavily by each of our CNN's. The heat mapping method we chose was "Class Activation Mapping". To do Class Activation Mapping, we needed to use the final Convolutional layer of our CNN as input. This layer holds matrices that represent the feature-maps. We first get the max scores for each feature map through a process called Global Average Pooling. To do Global Average Pooling we must take the average of each feature map by summing all the values and dividing by the number of elements. This gives us K scores for K features maps. We can then get a score for a given class by multiply each of the weights that connect from this Global Average Pooling number to that class by the Global Average Pooling number and summing up those results (This sometime already happens in the form of a "fully connected layer" that follows the last Convolutional layer, as can be seen in densenet201 for example).


imageActivations = activations(net,imResized,layerName);
scores = squeeze(mean(imageActivations,[1 2]));
fcWeights = net.Layers(end-2).Weights;
fcBias = net.Layers(end-2).Bias;
scores =  fcWeights*scores + fcBias;

After we have that score we can now obtain the Heat Map by multiplying each of the weights directly with its respective feature map and summing them up.


[~,classIds] = maxk(scores,2);
weightVector = shiftdim(fcWeights(classIds(1),:),-1);
classActivationMap = sum(imageActivations.*weightVector,3);

We then use a nice function provided on a mathworks documentation article to help us properly display the heatmap.


subplot(1,2,2)
CAMshow(im,classActivationMap)
title(string(labels) + ", " + string(maxScores));

function CAMshow(im,CAM)
imSize = size(im);
CAM = imresize(CAM,imSize(1:2));
CAM = normalizeImage(CAM);
CAM(CAM < 0.2) = 0;
cmap = jet(255).*linspace(0,1,255);
CAM = ind2rgb(uint8(CAM*255),cmap)*255;
combinedImage = double(rgb2gray(im))/2 + CAM;
combinedImage = normalizeImage(combinedImage)*255;
imshow(uint8(combinedImage));
end

Heatmap Images

The heatmaps for resnet18 with data augmentation had differing results than heatmaps without data augmentations as seen by Figure 7. When there was data augmentation, the heatmap showed that the CNN was looking between the lungs rather than inside of them for classification.

Figure 7: Sample heatmaps for resnet18.

Resnet50 performed better than resnet18 as seen in Figure 8. With non-augmented data, the data classification heavily ultilized the areas outside the actual lungs. With the augmented data, we can see that there was a large focus inside of the patient's lungs.

Figure 8: Sample heatmaps for resnet50.

Shufflenet did very poorly as displayed by Figure 9. Without data augmentation, the inside of the lungs weren't being looked at very often and with data augmentations, it was only looking inside the lungs sometimes, but would include portions of the outside as well.

Figure 9: Sample heatmaps for shufflenet

Densenet201 performed the best with augmented data. With data augmentation, the CNN consistently looks inside the lungs to classify the images (Figure 10). With non-augmented data, the model would look both inside and outside of the lungs.

Figure 10: Sample heatmaps for densenet201.






Feature extraction (CNN) for SVM :

(Final Data Set)

For our final data set, The entire UCSD dataset was put into training and the three image slices for each patient (from CT scans from Coronacase and Radiopaedia) were put into our testing set. This allowed us to train upon a large amount of data. The one problem with our final data set is the fact we have no NON-covid images for our testing set. This is why the bottom of the confusion matrix is blank (due to there being no false positive and no true negatives). In future tests we would like to integrate NON-covid images into our testing set.





Resnet18 and Resnet50

Using our best parameters from what we learned with Leave One Out SVM, we achieved high accuracies for both Resnet18 and Resnet50.





Densenet201 and Shufflenet

Once again using our best parameters from what we learned with Leave One Out SVM, we achieved high accuracies for both Densenet201 and Shufflenet.

Our Thoughts on the Results:

Strangely 3 of our networks (Densenet201, Shufflenet, and Resnet18) performed the same. All 3 classified 19/20 covid patients successfully. In the future., we can explore why this is happening and see if its consistent. The inclusion of NON-covid patients in the testing set could possibly help these strange results.





Feature Extraction from SURF and SIFT:

Step 1:
Using the bagOfFeatures function from the Parallel Computing tool box, we were able to extract keypoints from each image.
These keypoints were then used to create a "visual bag of words". The function bagOfFeatures makes this an automated and easy process.
 bag = bagOfFeatures(imdsTrain); 

Step 2:
Once you have your bag of visual words, you then can use the trainImageCategoryClassifier(From Statistics and Machine Learning Toolbox) This function will create / train an image classifier. All you need to do is hand the function your trainingSet and a bag of words. You can also give it a specific type of classifier.
 
optionsSVM = templateSVM('KernelFunction', 'rbf', 'KernelScale', 'auto', 'Standardize', true)
categoryClassifier = trainImageCategoryClassifier(imdsTrain, bag,'LearnerOptions', optionsSVM);
					 


SURF Results:



SURF was fascinating to visualize. Using an rbf kernel we were able to produce our FScore and Accuracy. In the future we would like to look further into the possibilities of SURF, perhaps tuning parameters. We achieved 75.37% accuracy and a FScore of .7135



SIFT Results:

SIFT Edge Thresh: 8

SIFT seemed to pick up features out of the lung region. This was not what we wanted. However, with more tuning it could be used to produce better results in the future.

SIFT Edge Thresh: 4

We have yet to collect data with SIFT. Despite the lack of data it was interesting to see the difference in feature extraction.



Conclusions

Although we have done extensive testing, our models are not actually detecting whether an image of a lung has COVID-19. What they are finding is whether or not these lungs have features of viral pneumonia. The reason why we are able to say that the specific type of viral pneumonia is found is because we are in a middle of a pandemic. This means that this is is not the best clinical way to find if someone has COVID-19 or not. Although we are not experts with this type of data, it looked like some COVID CT images had no pneumonia features in a few UCSD Covid images and there were pneumonia features present in some UCSD non-Covid CT images.



References

Pham, T.D. A comprehensive study on classification of COVID-19 on computed tomography with pretrained convolutional neural networks. Sci Rep 10, 16942 (2020). https://doi.org/10.1038/s41598-020-74164-z