Our task is to gain a better understanding of whether the audio features generated by avian activty present from MEL spectrograms at sites across Sonoma County, were distinct and sufficient enough
to properly distinguish and identify across a global soundscape.
The goal of our project was split up into 3 seperate tasks:
One task was to take a Bag of Words approach with land cover type. The first objective would be to see if there was a strong enough separation of features to accurately classify different land cover types, this would give us some idea about the possible validity of this method.
The second objective would be to show a regression relationship between different land cover types at across sites.
One task was to extract features using VGGish and compute regression model. This would allow us to estimate the relationships between the features extracted and their corresponding richness values.
One task was to visualize the features extracted using UMAP
What was our Data?
Data was supplied by Colin Quinn and Dr Clark, in the form of MEL spectrograms. These Mel Spectrograms give us an idea about the audio of a given site over a period of time
(1 minute). In our case we had 154 differents sites, with 150+ spectrograms per site. Each site was also designated to a type of land cover based on features found at the site, as well as species richness number detailing how many unique avian species were present at the site. Dr Clark also made a suggestion that we use only use spectrograms recorded between 5AM - 10AM, as this is a period of high activity for birds, and could result in more clear trends.
Bag of Words
Landcover Type
RW - (Aquatic Vegetation, Herbaceous Wetland, Riparian Shrub, Salt Marsh, Water)
HB - (Herbaceous)
SH - (Non-native, Shrub)
FO - (Forest Silver, Hardwood Forest, Non-native Forest, Riparian Forest)
FC - (Conifer Forest, Mixed Conifer, Hardwood Forest)
I started by sorting my sites into 5 separate land cover categories (FO, HB, SH, RW, FC), placing one of each land cover into test, and the rest into training,
then trained using bagOfFeatures (one important decision here is I did not opt to create a Custom Feature Extractor, but used the inbuilt one), generate a bag
of audio features across all land cover types.
I then train my classifier on that bag, testing it on separate land cover types.
Then check the classifier against the test set.
Results for BoW Classification
Test 1(All land cover types in test have 90% or greater of a particular type of land cover)
We can see that even when SH is the primary land type cover (s2lam003_190304 is the sole SH land cover present in this example), our classifier sees it as HB.
This is interesting because HB is not present in anyway according to the CSV.
FO HB SH RW FC
Test 2(No care given to any particular land cover. Average land cover majority of 67% per site) Another thing that is interesting is that HB appears extremely dominant across the different land cover types (s2lam001_180604 is the sole SH land cover present in this example). RW is still the dominant land cover type, but the HB% is higher in RW than in HB, indicating a possible combination of features.
FO HB SH RW FC
Issues with using BoW for classification/regression.
The classifier is not performing as well as it should. Why?
As we can see from the previous percentages, when the test set is not composed of land covers that are largely dominant (which is unrealistic as sites are nearly always made up of multiple LC types), regardless of training composition or size, it will struggle better than an average guess.
What do I think this means?
I think this a likely indicator that features extracted by BoW are not unique to a particular land cover type, rather similar features are present in different land cover types, leading to the significant amount of error. (Preston will present some visualizations of land cover types later on that also indicate large amounts of overlap.)As all these sites were taken throughout Sonoma County is seems plausible that avian activity could be similar in a localized area.
Conclusion for BoW
Not as lit as I would like….
Overall I can’t really say my approach to the problem was successful. I was not able find a way to successfully classify different land cover types with a high degree of accuracy, and as a result there would be no point in showing a regression relationship as those relatives values can’t be correct anyways.
One worry I have is that the default SURF algorithm employed by bagOfFeatures, is not able to successfully separate and extract audio features. While I am sure that this detects visual features correctly, I may require a specific feature extraction technique for audio features. If I had this assignment to do over again, I would have spent more time looking into how to accomplish that.
Regression
VGGISH: Feature Extraction/Regression:
- First we wanted to extract features from 95,000 images and get the labels, which is each site name:
- Next we wanted to use the pre-trained vggish network to extract the features, but first we had to get the correct input size
for vggish. We used "ColorPreprocessing" & "rgb2gray"
to make each images the correct size. Once we extracted the features
we wanted to get a feature vector per site, so we took the average of each site and stored it in a cell array.
- Now that we have our images in the correct size and we have a feature vector per site, we can match up the richness values to their corresponding feature vectors
and concatenate the richness values to the end of the feature vector matrix.
- The first method we used was fitlm
which fits a linear model describing the relationship between a dependent variable Y(response), as a function of one
or more independent variables X(predictors). In our case, the response variable is the specie richness and the independent predictor variables are the feature
vectors per site.
- The second method we used was stepwiselm
which creates a linear model for the variables in the table or dataset array using stepwise regression to add or remove predictors.
This mean that if the predictors have a lot of zero values, then stepwise will get rid of the least contributing ones by using backwards selection.
Results:
Regression:
- fitlm: You can see that there are 144 total obervations, which are our sites.
Based on the root mean squared error we can see that there was a small amount of variablilty from the individual data values to the mean, and we use this to measure how accurately the model
predicts the response. The R-squared value suggests that the model explains approximetely 80% of the variability in the response variable.
The p-value here shows a measure of the probability that an observed difference could have occurred just by random chance. The points on this plot fit a lot more closely to the
best fit line than the stepwise backward selection method. This may be happening because we are using the true values from the feature vectors per site, while the stepwise is
cutting down the data to only use non zero values. This shows a better representation of the features extracted to their richness.
- stepwiselm: Stepwiselm used far fewer variables than fitlm did. Fitlm used all 128, but after backwards selection there were 4 sites chosen.
You can see that the root mean squared error is a bit higher here but not by much
This R-squared value suggests that the model explains around 55% of the variability in the response variable, since it is using only 4 sites that contribute the most.
The points on this graph have some values near the best fit line, but most of the values are spread out above or below the line. This may be happening because the relationship between the
features and the true richness may not corredspond as well as we would like them to be. We know that each site has a specific richness, but based off the plot, we can deduce that
this is not the case when extracting features and showing the relationship to their corresponding richness.
Dawn Chorus Data
The “Dawn Chorus” is the time period of the day between 5-10 AM every day that we expect there to be a lot of bird activity
Initially my plan for the dawnchorus was to put all of the files that were between 5-10 am in a mega folder that contained all of the data but that brought up a few issues
We have no way of determining what site the data is from
We want the data by site so we can look at the different landcovers that the dawnchorus data is covering
It makes parsing the data take way longer than if we had it at the site level
Understanding these difficulties I set out to learn how to copy files from oneplace into another in matlab. From this we get the algorithm that searches for any files with the dawnchorus timestamp and then parses the string of that file to determine the site it is in.
Results
From this we can see that the dawnchorus is a tigher fit to the linear model as opposed to all site data. We believe that the time in early morning is more likely to contain bird sounds rather than ambient/non-bird sounds.
Problems
With a large data set if I forgot to suppress output at the matlab level it can often cause my matlab to freeze/crash and this would happen quite often. Sometimes losing files and progress made on creating the folders necessary.
This is the code to get the timestamps of interest.
This code seperates them into individual folders
UMAP
Vggish feature extraction visualized with UMAP.
- We ran our spectrograms through vggish to extract features.
- these extracted features come out in 128 by N values. N being the number of images ran through vggish.
- We can think of each of these features as a point in 128 dimensional space
- We then used Umap to reduce the dimensionality down from 128 to 2 or 3 dimensions so that we can visualize it.
Below is an example of data entry for UMAP:
Originally we only had three sites to extract features from, and UMAP showed that the features extracted from these three
were clustered closely with themselves.
Once we had more sites we found that the features overlapped far too much for umap to show clusters in 2 dimensions.
We found that if we only reduced the dimensions down to 3 we could still visualize the features and were able to
see strong clustering in a 3 dimensional space
- This was a test of roughly 20 sites reduced down to 3 dimensions instead of 2.
You can clearly see in three dimensions that sites cluster together very well. Next we thought it would be interesting
to find features in two different ways.
1) The first way was to run images through umap based on indivual sites and once features were extracted we would
assign the correct land class label to those features.
2) The other way was to split sites up by land cover and then split the entire land cover into train and test
images to run through vggish.
SitesByLandMass
SitesBySitesByLandMass
For some land cover areas we only had a very small amount of date which lead to outliers that warped the overall 3d image.
So we thought it would be interesting to visualize features from the three sites that we had the most data from.
big3SiteByLandMass
3MainSitesBySitesByLandMass
As you can see when looking at only these three sites there is very strong clustering of features with very few outliers.
Below is an example command to run umap
References:
https://www.mathworks.com/help/vision/ug/image-category-classification-using-bag-of-features.html
https://www.mathworks.com/help/audio/ref/vggish.html
https://www.mathworks.com/matlabcentral/fileexchange/71902-uniform-manifold-approximation-and-projection-umap
https://www.inaturalist.org/lists/60082-Sonoma-County-Birds
Conclusions
We were able to accomplish a some of our goals, but were unfortunately blocked limitations of both technology and our current understanding. One main issue is the poor performance
of Vggish, even attempting to classify simple and from what we can tell, quite distinct features, leads to an accuracy of 60% or less. This leads to some doubt about the validity of our regression.
We were instructed to specifically use vggish, but it might be interesting to look into other pretrained neural nets and see how they perform. We were also unable to validate BoW as a legitimate
means of feature detection, though that could be more down to a lack of experience in audio feature dectection. UMAP produced some very interesting results, leading to a lot of excitement
from the research team we were corresponding with. Overall we feel quite satisified with the progress we were able to make in such a new and exciting area!