Learning Descriptors

This work has focussed on techniques to learn SIFT like local image descriptors from large training datasets. Training data was obtained from existing 3D reconstructions, created using Photo Tourism (which is related to my earlier work on 3D reconstruction from unordered image collections), and multi-view stereo. Though these methods used local feature techniques to generate geometry, in principle any source (such as LIDAR) could be use to generate training data for learning.

Given a large dataset of corresponding image data, we seek to find a transformation (descriptor function) of that data that maximises discrimination performance under a simple classifier (we use nearest neighbour). We have used two techniques to achieve this objective: 1) Powell minimisation: we optimise parameterised descriptors to maximise ROC peroformance 2) LDA: we find a discriminant embedding that maximises the ratio of between class / in class variance.

Results from these techniques are depicted above. The top figure shows optimal projections for LDE and an orthogonal variant (first two rows), the third row is PDA for comparision. The optimal linear discriminant features tend to focus on the centre of the image patch, and they tend to have the structure of circularly smoothed derivatives. The lower figure shows the pooling regions learnt using Powell minimisation. These tend to have a foveated structure, a strategy found to be successful by other computer vision researchers (e.g. GLOH, DAISY).

The localised filter responses and foveated pooling regions also bear some resemblence to similar functionality in the human visual system.


Our training datasets, consisting on hundreds of thousands of patches of corresponding image data, is available here:   Local Image Descriptors Training Data.