Description
Overview The bag-of-words (BoW) approach, which you learned about in class, has been applied to many recognition problems in computer vision. For example, object recognition [5, 7] and scene classification [6, 8]1 . Video Google https://www.robots.ox.ac.uk/~vgg/research/vgoogle/ is a good demonstration of this. This technique draws connection from Information Retreival, and has been studied extensively. Thus, there are many variants and extensions for the traditional approach explained in class. For example, two important extensions are pyramid matching [2, 4] and feature encoding [1]. What you will be doing: 1. You will take responses of a filter bank on images and build a dictionary of visual words (Section 2). 2. You will learn a model for the visual world based on a bag of visual words, and use nearest-neighbor to predict scene classes in a test set (Section 3). 3. Finally, we ask you to dig deep into visual words by trying to visualize them (Section 4). If the assignment looks long or difficult to you, DONT panic! We provide you with stepby-step instructions to implement a working scene recognition framework. There are not many lines of code to write. However, it may take a few hours for the base system to run, so make sure to try each component on a subset of the data set first before putting everything together. We have provided you with small subsets of the data for such testing (see Section 6). Image Data You will be working with a subset of the SUN database22 . The data set contains 1600 images from 9 scene categories like “kitchen”, “sky” and “desert”. A complete submission consists of a zip file with the following folders (please keep each system in a separate folder): 1. baseline/: the baseline spatial pyramid classification system that you implement through this homework; 1This homework aims at being largely self-contained; however, reading the listed papers (even without trying to truly understand them) is likely to be helpful. 2https://groups.csail.mit.edu/vision/SUN/ 2 2. improved/: (if you do the extra credit): the custom system that you design to boost the performance of your system; 3. a write-up (.pdf format). We provide you with a number of functions and scripts in the hopes of alleviating some tedious or error-prone sections of the implementation. You can find a list of files provided in Section 6. Please read these descriptions. Figure 3: The provided multi-scale filter bank 1 Warming up with some theory (10pts) In this section, you will answer some questions about the most basic Bag-of-Words approach. We provide a suggested number of lines for each answer are mentioned. Question 1.1 (3pts, 2-3 lines) Given an N × N image, and an h × h Gaussian filter, how would you convolve the image and the filter in O(h) instead of O(h 2 )? Question 1.2 (2pts, 2-3 lines) Does the bag-of-words approach respect spatial information? Why? Question 1.3 (5 pts, 2-3 lines) For images, why is it a better idea to use filter responses as features rather than raw pixel values? 2 Representing the World with Visual Words (40pts) We have provided you with a multi-scale filter bank that you will use to understand the visual world. You can get it with the following provided function: 3 Figure 4: An input image and filter responses for each of the 3 channels (a) Color Input (b) Output Map of Visual Words Figure 5: A sample output, rendered with imagesc. [filterBank] = createFilterBank() filterBank is a cell array3 Open the file createFilterBank.m to see how the filters were created. We have also provided you with a function to extract filter responses that takes a 3-channel RGB image and a filter bank and returns the responses of the filters on the image. [filterResponses] = extractFilterResponses(I, filterBank) filterResponses is an N ×M matrix, where N is the number of pixels in the input image, and M is the number of filter responses (three times the size of the filter bank, since you are applying it to a 3-channel image). Question 2.1 (5 pts, 3-4 lines) Theory: What properties do each of the filter functions pick up? You should group the filters into broad categories (i.e., all the Gaussians). Answer in your write-up. Creating Visual Words You will now create a dictionary of visual words from the filter responses using k-means. After applying k-means, similar filter responses will be represented by the same visual word. You will use a dictionary with fixed-size. Instead of using all of the filter responses (that can exceed the memory capacity of your computer), you will use responses at α random pixels4 . If there are T training images, then you should collect a matrix filterResponses over all the images that is αT × N, where N is the number of filter responses. Then, to generate a 3Look at MATLABs documentation for more details, but filterBank{i} is a 2D matrix, and filterBank{i} and filterBank{j} are not necessarily the same size. 4Try using randperm. 4 visual words dictionary with K words, you will cluster the responses with k-means using the built-in MATLAB function kmeans as follows: [unused, dictionary] = kmeans(filterResponses, K, ’EmptyAction’,’drop’); Question 2.2 (15 pts) You should write the following function to generate a dictionary given a list of images. [filterBank, dictionary] = getFilterBankAndDictionary(imPaths) As an input, getFilterBankAndDictionary takes a cell array of strings containing the full path to all images. You can load each file by iterating from 1:length(imPaths), and doing imread(imPaths{i}). Generate the αT filter responses over the training files and call k-means. A sensible initial value to try for K is between 100 and 300, and for α is between 50 and 150, but they depend on your system configuration and you might want to play with these values. Question 2.3 (5 pts, 3-4 lines) Theory How does the dictionary size affect the representation power of the bag-of-words pipeline, e.g., if I set k = 10 or k = 10, 000? What is the problem of a dictionary that is too small or too large? Once you are done with getFilterBankAndDictionary, call the provided script computeDictionary, which will pass in the file names, and go get a coffee. If all goes well, you will have a .mat file named dictionary.mat that contains the filter bank as well as the dictionary of visual words. Dont worry about “did-not-converge” errors. If it takes more than 10 minutes, reduce the number of clusters and samples. If you have debugging issues, try passing in a small number of training files manually. Computing Visual Words Question 2.4 (15 pts) We want to map each pixel in the image to its closest word in the dictionary. Create the following function to do this: [wordMap] = getVisualWords(I, filterBank, dictionary) wordMap is a matrix with the same width and height as I, where each pixel in wordMap is assigned the index of the closest visual word of the filter response at that pixel in I. We will use the standard Euclidean distance to do this; to do this efficiently, use the MATLAB function pdist2. A result is shown in Fig. 5. Since this can be slow, we have provided a function batchToVisualWords(numberOfCores) that will apply your implementation of the function getVisualWords to every image in the training and testing set. This function will automatically5 use as many cores as you tell it to use. For every image “X.jpg” in images/, there will be a corresponding file named “X.mat” in wordmaps/ containing the variable wordMap. You are highly encouraged to visualize a few of the resulting word maps; they should look similar to the ones in Figs. 2,5. 5 Interested parties should investigate batchToVisualWords.m and the MATLAB commands matlabpool and parfor. 5 3 Building a Recognition System (95pts) We have formed a convenient representation for recognition. We will now produce a basic recognition system with spatial pyramid matching. The goal of the system is presented in Fig. 1: given an image, classify (“name”) the scene where the image was taken. Traditional classification problems follow two phases: training and testing. During training time, the computer is given a pile of formatted data (i.e., a collection of feature vectors) with corresponding labels (e.g., “grass”, “sky”) and then builds a model of how the data relates to the labels: “if green, then grass”. At test time, the computer takes features and uses these rules to infer the label: e.g., “this is green, therefore it is grass”. In this assignment, we will use the simplest classification model: k-nearest neighbor. At test time, we will simply look at the querys top k nearest neighbors in the training set and transfer the majority label6 . In this example, you will be looking at the query image and looking up its k nearest neighbors in a collection of training images whose labels are already known. This approach works surprisingly well given a huge amount of data, e.g., a very cool graphics applications from [3]. The components of any nearest-neighbor system are: features (how do you represent your instances?) and similarity (how do you compare instances in the feature space?). You will implement both. Extracting Features We will first represent an image with a bag-of-words approach. In each image, we simply look at how often each word appears. Question 3.1 (10 pts) Create a function getImageFeatures that extracts the histogram7 of visual words within the given image (i.e., the bag of visual words). [h] = getImageFeatures(wordMap, dictionarySize) As inputs, the function will take: • wordMap is an H × W image containing the IDs of the visual words • dictionarySize is the number of visual words in the dictionary. As output, the function will return h, a dictionarySize × 1 histogram that is L1 normalized, (i.e ., Phi = 1). You may wish to load a single visual word map, visualize it, and verify that your function is working correctly before proceeding. Question 3.2 Extra credit (5 pts 2-3 lines) Theory: Why is it a good idea to normalize the histogram of visual words? (Hint: See Section 3.2 of [4]) 6 in case of a tie, use odd k 0 . 7Look at hist in MATLAB 6 Multi-resolution: Spatial Pyramid Matching Bag of words ignores the entire spatial structure of the image. In many cases, this may not be desirable. One way to alleviate this issue (and thus can often have a better representation) is to use spatial pyramid matching. The general idea is to divide the image into a small number of cells, and concatenate the histogram of these cells to the histogram of the original image, with a proper weight per cell. We will be using the work of [4] as reference, and will point you to specific sections as needed. Here we will implement a popular scheme that cuts the image into 2l × 2 l cells where l is the layer number. We treat each cell as a small image and count how often each visual word appears. This results in a histogram for every single cell in every layer. Finally to represent the entire image, we concatenate all the histograms together after normalization by the total number of features in the image. If there are L layers and K visual words, the resulting vector has dimensionality K PL l=0 4 l = K 4 (L+1) − 1 /3. Now comes the weighting scheme. Note that when concatenating all the histograms, histograms from different levels are assigned different weights. Typically (in Section 3 of [4]), a histogram from layer l gets half the weight of a histogram from layer l + 1, with the exception of layer 0, which is assigned a weight equal to layer 1. A popular choice is to set the weight for layer 0 and layer 1 to 2−L , and for the rest to 2l−L−1 (e.g., in a three layer spatial pyramid, L = 2 and weights are set to 1/4, 1/4 and 1/2 for layer 0, 1 and 2 respectively, see Fig. 6). Note that the L1 norm (absolute values of all dimensions summed up together) for the final vector is 1. Question 3.3 Extra credit (5 pts 2-3 lines) Theory Why do we weight different levels of the histogram differently? (Hint: See Section 3.1 of [4]) Figure 6: Spatial Pyramid Matching: From [4]. Toy example of a pyramid for L = 2. The image has three visual words, indicated by circles, diamonds, and crosses. We subdivide the image at three different levels of resolution. For each level of resolution and each channel, we count the features that fall in each spatial bin. Finally, weight each spatial histogram. Question 3.4 (20 pts) Create a function getImageFeaturesSPM that form a multi-resolution representation of the given image. [h] = getImageFeaturesSPM(layerNum, wordMap, dictionarySize)

