Representing visual objects is an interesting open question of relevance to many important problems in Computer Vision such as classification and location. State of the art allows thousands of visual objects to be learned and recognised, under a wide range of variations including lighting changes, occlusion, point of view, and different object instances. Only a small fraction of the literature addresses the problem of variation in depictive style (photographs, drawings, paintings etc.), yet considering photographs and artwork on equal footing is philosophically appealing and of true practical significance.
This paper describes a model for visual object classes that is learnable and which is able to classify over a broad range of depictive styles. The model is a graph in which simple shapes label region nodes. We use our model to classify twenty classes in CalTech 256, each class augmented by additional images to increase the variance in style. When compared to a Bag of Words classifier and to a structure only based classifier, our results show a significant increase in robustness to variance in depictive style.
We learn visual class models from input images, each labelled with the object they contain. There are three major steps: (i) build an "image graph" for each image in the training set; (ii) compute the class model as the median graph of the image graphs, and (iii) refine the class model by maximising classification performance over the training set. Figure 1 presents a framework of the proposed method.
Figure 1: Constructing a class model, from left to right. (a): An input collection (possibly different depictions) used for training. (b): Probability maps for each input image, and graph models for each map. (c): The median graph model for the whole class. (d): The refined median graph as the final class model.