Peaceful Deer Hunting (Steven, Justin, Jon)

CS 385 : Animal Detection using Histogram of Oriented Gradients

As our final project we decided to work on detecting a few different animals from the Sonoma State University Preserves data set. The animals we chose to focus on were squirrels(Justin), bobcats(Jon), and deer(Steven). The majority of the examples will be showing in this presentation will be of squirrels. For the project we are using HOG(Histogram of Oriented Gradients) to train the image detection.

Dr. Gill provided us access to the SSU Preserves data set that Dr. Christopher Halle has collected. The images are from a trail in the preserves that are taken from motion detected with daytime photos in regular colour, and nighttime photos are collected with a type of night-vision. Though we were instructed to not use the preserves data as training and rather to collect images from online or other databases. The squirrel and deer images came from Image-Net, and the bobcat images were collected from Google Images.

We started off our project by working through an image detection tutorial from Oxford Visual Geometry Group (http://www.robots.ox.ac.uk/~vgg/practicals/category-detection/#step-10-loading-the-training-data). While the tutorial was helpful for getting us on the right track at the start of the project I think it hampered us later on, because of it obfuscating some of the finer details about the calculations that were being done. We ran into some issue particularly when it came to Negative Deep Mining. The Negative Deep Mining was the most computationally intensive portion of the assignment.

 

Method

Bringing in positives. Constucting a mean. Notice the impact of slight variations in pose of the bobcat, compared to the smooth, clean sign training data

cell8numOrientation18avg%20-%20Copy.PNG../../../../Desktop/Screen%20Shot%202016-12-13%20at%2010.23.23%2../../../../Desktop/Screen%20Shot%202016-12-13%20at%2010.23.27%2../../../../Desktop/Screen%20Shot%202016-12-14%20at%204.09.11%2

The signs are uniform, all front facing, nearly rotationally identical, and highly prevalent in the environment.

 

Extract HOG features, hog cell size can be altered. A histogram of gradient directions is compiled for each cell of an image. The descriptor is a concatenation of these histograms

svm_hog.JPG

For each animal, we created a positive training examples by simply taking the bounding box (or the entire cropped image, in the tutorial example), but creating negative examples is another story. The method we used is to generate random bounding boxes on the explictely negative training images provided, which do not contain the animals at all. From these results, we can take the top scoring false positives, and add them to the bank of negative to reduce the amount of false positives encountered.

HOG_gradient_bobcat.JPG

Results

The first training process used 100 images and 50 negatives (which took 1.25 hrs to mine)

https://lh3.googleusercontent.com/rARmFr9C96qZLwpXqaHwKFkjgvMQsBWU96Xb-FX6QfwMBm1dhqTO3cuLiAsDojh_4RJQJxryr5wQeUUvOWNzzeqFME2JjwW9UBrjf4dYN4dApzmlRwH1Oev5VdcRrdB8pXU4GbRf         deer1default.JPG

The best case involve multiple positives on parts of the squirrel, or a single square on a part of the squirrel.

 

 

https://lh5.googleusercontent.com/vtEo_dqEtp0lOYNiVJZpEv0MeYQXO7d1Jmhh9ovNf6Sjq6F73oVqzTpjOL5VG4xFkdVhGxjENKL96X5cYKx-wWdyQdeNHAERqZN_y67bdGG2V6oey8-LUHvRTH9Xx3V14Qc11QyU       deer2default.jpg

In the worst case there were false positives everywhere:

 

 

 

https://lh5.googleusercontent.com/ysoRqJO1t9bQI5Ql8rVBnU73_ZX3TXVrzfvN6_xd50kRf_C0gQum8ttWeWi94J4ImnjbL6RzomrLnUI03wQcvgqBdDl8HxCmMA3k5_AD4BOApG75MjuGY3t0uUxgTUqPESGymW4B                cell8orientation18side3.jpg

 

Given so many false positives in the previous attempts, I went back to the images I’d cropped with the intent to make them all perfectly square, discarding those that couldn’t be. I added additional images from Google of the Western Gray squirrel ( native to northern california ), and found a few squirrel side shots from the SSU preserves set.

 

Before running the next test ( which takes 1.5 hours to complete!! ), I also flipped all cropped training images horizontally so the squirrel was facing to the left, and removed right-facing images with aux border files, and got more restrictive by removing ‘nearly-sideways’ images that were slightly rotated at 3/4 view. For the next sprint I would like to create symmetric HOG templates to handle this problem.

 

 

 

https://lh3.googleusercontent.com/ry7uKGl1tgwkeOmmWsuPG7roJqjfZJH0HpNWzrzh5mlLU4UFQr1IA5DWwurYBIylRRn8eTA42R9BoaqZTC4ajqDGV1Ga_NR2TlubpMn0aTJSBblNrgrRyfOByeCgNGZyPx_l-43E

Here’s an image of the hard-negative mining process.

 

 

 

https://lh3.googleusercontent.com/GTdCZZmc-WcC6ZJ7TVV5_wAVBkd1XPVh0xC7gB1QypuRlYh8tgw6g2H2YqTnK-lbe0ynw1E88_KEDoY5S1geeDIEgYiAejX-RQroZ_RByTFArO3jqvXvQ6UM1rg0uA5wgltUc_yD                   cell8numorientation18%20-%20Copy.PNG

 

 

 

 

Here are a series of new test image results after mining on the all left-facing cropped images:

 

https://lh5.googleusercontent.com/NKuW5e7Wxo_tV_h8uzdLUseHX8JCs70f6jO4UiWuiVVP9C0l14ppUjcQPQ3rG63nSIzgylXcDtDq_JGVjSgRCSC1xgNXqYdLemTtSuok8LSqKC9N4sPra20lOmQFKMFrb2_HIg7-

The results always seem to be better on images like this having some texture consistency in the background.

 

 

https://lh6.googleusercontent.com/Diu5crupCzgcSJ9z98wvlFLzgvq86Qu3JsHxqHCvu8v0Cd9R1iGoLMhFXpaVsGoUQ9jsla3TTYMWj5-NXzHV-3Sl59R3uciCXPvrpsAZ8rP8adMZzHXo6vErY0h4xGgfm5zhxKWt

The greater the texture variety, the more false positives there seem to be.

 

https://lh5.googleusercontent.com/jsO9Cx8dv_8ZNKjZQB9TIPGQSxdSJIc1GynSkOgtX2hRVTplB2vXMnZkxzyPfQ0-9BCmNIqSdU_MFPA3A_WG31pL-OYlMu9uzlIHAsaWm7b2tewQem47JPk_rBLReX1Lq3rVj2ht

The new training with left-facing images did not improve results here.

 

 

https://lh4.googleusercontent.com/9YzCKFlamWacUjTtW3u6iUdB5swvYQZKCgIMTUs9ReWiy8MKhwTzTsWiniIw9d1wwR3bGdk-T_AY94eqZVqk5oseaNP5dwufnQTu6J7UvgwVgXvfPXXfSRCycKoBcjyiEcWfeNCV

https://lh3.googleusercontent.com/AqnDTtmg9r6oZg02cx431EeEYnzFrXegiH866d0tcbDiwwkzG41kKakFZ3GtuQsmMdp1YS-mNJM-7DZFaS9o2jkPCGlvh-2GtUB40GRRVtrrlpuF5RAm5ySqa1fheRWXU8B8Q-4G

 

Dismayed by the poor results, we decided to take a closer look into the NaN errors produced during during the hard negative mining process, and two major problems: 1) during code that evaluateModel.m is running, it seems that the current images in trainImage(i) is being checked to see if it is in the set of trainBoxImages, if not it ignores it completely; 2) even if the images are the same, detect.m still ignores them if they have different directory locations. To remediate the problem, I had to go back to the drawing board and ended up tracing through var mutations throughout the set of source files. In the end, I threw out pre-cropped images and only allowed full-size images having tandem bounding box files for my trainBoxImages, and then used a subset of those images for trainImages. Only some of the images had pre-fabricated bounding box files ( for those I wrote a script to extract the bounds from the .xml files); for many more I had to manually display the pixel info to find the bounds dimensions and then manually create .txt files.  

 

As a result, the negative mining process changed in a few ways: 1) it actually began displaying numerical percentages; 2) It showed a blue bounding box around the object ( sourced from trainBoxes ) 3) Occasionally it would draw another green box around the object.

 

 https://lh3.googleusercontent.com/r6zleAsf74hqJYsSsc-6QSt7fYCQKsLOL5qF7Yp0hPeukAmOmANCi2pPPOxhwOMN85nBdPRrlTtKt63VzubnzVqzxePQAi7hlo8VGuJW_WO_LPE3JW1cmfnekbesJowUtzJZIJIU

The red represents areas the SVM learned are not target objects. The blue represents what we are trying to teach the SVM.

 

 

https://lh6.googleusercontent.com/R1bNUwtwOYuDqGSa01iUpL-AQKVvpgVPPQMIf2-oIMgPeS_8R1ntsrl_kkgYVwPXaGc4EB8XazB3IslpJArcles071nhpivX7fdNLnc89BHVQ3q7kqfrcW4vFq7yTkL3WmXu70Sn

And for the first time, we see a green line, which represents the SVM confirming that it learned to detect the object in this image.

https://lh3.googleusercontent.com/hqkvNimJwIsQ3tnhiyyqVRRxeayS1IOekMtxI-lkLBDwDh2jQCKpCJTDwXUqb6hCgzr7R2snjN0SM9XV0ZJ4JNFGgEaZfaTjEYR0HgvNhd_pDPVV1Uy8TVvGmHZKDubdA6vtbVtJ

 

 

 

 

I tried out just about every combination of hogCellSize, variant and numOrientations and settled on 12, dallalTriggs and 21, respectively. Still, even with these settings and all objects facing the left, my results were full of false positives.

 

 

https://lh6.googleusercontent.com/18iOQV0Jgh-hMxlP-TrGjSb0ZpWtRKrBOmCUEWZMRH5vcXnE22PV7wHzl3-6ye166QH4P7ZKLiA5sL8Cfm8m9-reiOPBLpycF3eiX8xOjpqn37dRdTqfAGVgyZwsJbxUUKygzqF7                           cell8oriention4side3.jpg

 

https://lh3.googleusercontent.com/INqwNUrCVL6U9nYEXijd10bq-19uPb9bvtk0M28cciHmB_u4Fmp9GFD7q3frAzwE8BzTdLNsjKn3lWw9gILSG4D6azhh36I7ijv6NyxtYYqwzre5nBNYXTOMTjiuzn2xfMNUiViK                       cell8oriention9side3%20-%20Copy.jpg

 

I suspected that the bushiness of the squirrels tails and their varying position was introducing too much variance into the SVM, so as a last resort, I created a second image dataset with the tails cropped out.

 

This did in fact increase the rate of detection from ~15% to about 30%. Instead of only 3-4 images detected in a set of 24, the value went up to 7.

 

https://lh3.googleusercontent.com/68xjMe8cl0_BOxlQOqvIvVkBE6h8lxiwt9CnkvqmBLAktFjGWbjUdWo7aOj3omBdt3ozUlhB08tfj-XdQNdmcTNYPeF1I5fg5zYAwjbF_e3YiDHV8p7PwP2N0BSHJgeavKCsca3M

 

 

 

https://lh3.googleusercontent.com/qyCWTJqHe3KbXd9ji4luLW1V3nCrn45o8qQfrlCfw1Wh4h5v5ZTHLDYEb2nxgNYvDtfMDMi23BVBHCpIqPgLpk24rEhm8Dth4gLCUoY1Z_B7lB9rdmWqLTToBnSrXead-eia-cK9

And it finally found the black squirrel, in the following image, albeit swimming in a sea of false positives.

 

 

In the end, however, when I finally tested it with SSU preserves data, it did not find the squirrel very well:

https://lh4.googleusercontent.com/Sr-tDBsjjsQR2a-DzIFHuFjmT55B56sOm2G9nTiL5Fbf3pAkk2RuAcZEzP_1iz0mGvr_7jJc-CcTkUyApduYdT6zX1L6CWStC2Whx3ye8J-V3Uks_KBiTM1GJXDzeeCZMo78OalA

 

https://lh6.googleusercontent.com/55-6hLsuB3VuKiKX3U5DppWIpzrRZafes-glkrgRpMOMFj4H3GnsNFvPLox017nWmOFkd2rvb6XdTwBgs8-o4rfRbiRKlUJpcfXYWzUa2VF48pScJ_vdNFKRLbSqOnOHrxFsf5mA

 

I think part of this has to do with the size of all except 5 ( out of 54 )  of the trainBoxImages being smaller than the SSU preserves image sizes, which are each 600K. The median trainBoxImages size is 158K.

 

As a final test, I tried shrinking the SSU images to make them closer to the median dimensions in the trainBox image set, and found that it changed the detection results.

 

https://lh3.googleusercontent.com/_C6zIaMSI2iR6-E9A06aYwIM1Xh1gjVWoP1ILpyitIvbIuLWqTY9fjc0nf1TjgokQZJyGCCdq2zA6me_WvmO4e0TtJf31pddtEeOvty5Cidxid0i9uTH2oRjq1ofkN7oNvhovVow

 

This is with the image size halved.

 

https://lh4.googleusercontent.com/Sr3afmJ-V7oE6f6bKAgpS0S6SWx7uiKVRAem772dt6Ko6zelp9_n7Y-02MAgcCPphaE9kGx28JHWAq2IpBlUTY-LxXSxTn9ukn9i99ElxQyD7cbOtNzqFgYb8JR6vANf2GwOj209

 

Image size quartered

 

The results show that the rectangles get bigger as the scale decreases.  

 

Conclusion

I suspect that a lot of our detection issues in this project relates to the fact that the SVM was trained with lower resolution patches, which the classifier may have then tended to prefer over larger resolution patches. This could indicate that the vl_hog approach in this tutorial is not entirely scale-invariant.

Also, in retrospect I believe that our testing plan, in omitting all SSU preserves testing data from the training, may have subverted the success of the classifier accuracy in this project. Here�s why:

The image set terrain and breeds of squirrel reflects the internet�s representation of the squirrel population, which do not adequately represent that of the population and terrain in northern California. To handle that effectively, without too narrowly training the classifier, it may need a much larger sample size of code words.    

I believe that in order to remedy this, we would need to add some negative images that contain the terrain.

As a final test, I added 3 squirrel photos from the SSU preserves dataset, which did not significantly improve the results. It could have been because the additional 3 images only represent 5% of the total of 60 images.

I think if I had trained the entire classifier using strictly northern California and/or SSU data, the results would have been much more accurate, as it would reduce the amount of variety considerably during the training process. I think that the signs tutorial was able to achieve so much accuracy because: 1) the signs fit almost perfectly into square dimensions; 2) the backgrounds were all similar ( same the same rural/suburban area with a consistent environment) 3) The angles of the signs were all perfectly forward-facing and nearly perfectly aligned. Animals in wildlife shots can take so many more varying positions, and squirrels in particular have a tail position that can go anywhere. To truly model a squirrel classifier after the signs tutorial, one would need labels with: tails pointing at 3 O�clock, 12 O�clock, and every other direction, just as the signs are labeled for right arrow, right-down arrow, right-up arrow, etc. In short, we would need many many more squirrel images from northern California to be able to accurately train the SVM.