Learning Graphs to Model Visual Objects Across Different Depictive Styles

Abstract


The cross-depiction problem is that of recognizing visual objects regardless of whether they are photographed, painted, drawn, etc. It is an under-researched area, but advance would be of genuine significance to computer vision. Experiments confirm the intuition that the variance across photo and art domains is much larger than either alone, which introduce more challenges. As this new area develops, a public dataset is important for comparing techniques. Currently no such dataset exists for cross-depiction -- a gap we fill. We also provide benchmarks for leading techniques, demonstrating that none perform consistently well given the cross-depiction problem.

We train across all depictions at once, and then test across all depictions. We provide a modeling schema (a framework) for visual class objects that generalises across a broad collection of depictive styles. Each object is modeled as a graph with multi-labeled nodes and learned weights. Experiments show that our representation is able to improve upon Deformable Part Models for detection and Bag of Words models for classification.

A New Dataset: Photo-Art-50


We release a challenging, annotated image dataset for researchers to evaluate their cross-depiction techniques. This dataset contains 50 object categories, 90 to 138 images for each object with approximately half photos and half art images. These 50 objects all appear in Catech-256 and a few also appear in PASCAL VOC Challenge and ETH-Shape dataset. Part of the photo images are copied from Caltech-256, the rests are from Google search. Art images are searched by a few keywords to cover a wide gamut of depiction styles, e.g., `horse cartoon', `horse drawing', `horse painting', `horse sketches', `horse kid drawing', etc. Then we manually select images with a reasonable size of a meaningful object area. We further manually provide the ground-truth bounding boxes for object categorisation with annotated object location.

Fig1. - Our photo-art dataset: Photo-Art-50, containing 50 object categories. Each category is displayed with one art image and one photo image.

K-L Divergence


In order to discover how much statistical difference between the feature distributions on the photo and art domains -- and to make sure our dataset is of value to the cross-depiction problem -- we compute the symmetric Kullbeck-Liebler divergence between art and photo feature distributions. A small K-L divergence means that the two distributions are similar.

Tab1. - Comparison of K-L divergence between domain pairs
Cross-domain sets in [Gong-cvpr12] Photo-Art-50
Caltech-Amazon Caltech-DSLR Amazon-Webcam DSLR-Amazon DSLR-Webcam Photo-Art Art-Art Photo-Photo
0.079 0.271 0.239 0.292 0.047 0.466 0.042 0.000

Evaluation of Baseline Approaches


We evaluate three baseline methods on our dataset.

1. Bag-of-words (BoW) [csurka-eccv04]. It is chosen because it is well known, widely used, and performs well on standard image classification problems. We assess the performance of this popular framework with different local descriptors, including the well-known self-similarity descriptor (SSD) [shechtman-cvpr07] designed to solve cross-depiction matching problem.

2. Deformable part model (DPM) [felzenszwalb-pami10]. It is a state-of-the-art detection method. We adapt it to the classification task using annotated object bounding boxes in the training set. It is chosen because it powerfully models variations both in appearance and non-rigid deformation.

Tab2. - Comparison of categorisation performance on our proposed Photo-Art-50 dataset, with 30 images per category for training. Average correct rates are reported by running 5 rounds with random training-test split. 'A+P' stands for a mixture training set of 15 photo images and 15 art images.

3. In order to investigate whether domain adaptation techniques help solve the cross-depiction problem, we evaluate a recent state-of-the-art domain adaptation technique, called Geodesic Flow Kernel (GFK) [gong-cvpr12].

Classification accuracies without (OrigFeat, PCA_S and PCA_T) and with (GFK_PCA and GFK_LDA) domain adaptive methods on Photo-Art-50. Left: training on artworks, test on photographs. Right: training on photographs, test on artworks. The experiments are carried out with 30 images per class for training, repeated 5 times with random training-test split. 'OrigFeat' means classifying with the original 5000-bin histogram. Except OrigFeat, the rest methods are with 49 dimensional projected features.

Algorithm - Learning Graphs to Model Visual Objects


We model visual classes using a graph with multiple labels on each node; weights on arcs and nodes indicate relative importance (salience) to the object description. Visual class models can be learned from examples from a database that contains photographs, drawings, paintings etc.

The algormithm 2-matching

Fig4. - Detection and matching process. A graph G will be firstly extracted from the target image based on the input model <G*,β>, then the matching process is formulated as a graph matching problem. The matched subgraph from G indicates the final detection results. Φ(H,o) in the figure denotes the attributes obtained at position o.

 

The algormithm 2-learning

Fig5. - Learning a class model, from left to right.(a): An input collection (different depictions) used for training. (b): Extract training graphs. (c): Learning models in two steps, one for G*, one for &beta. (d): Combination as final class model.

 

table detection results

Tab4. - Detection results on Photo-Art-50 dataset: average precision scores for each class of different methods, DPM, a single labeled graph model with learned β, our proposed multi-labeled model graph with and without learned β The mAP (mean of average precision) is shown in the last column.

 

people results car results horse results bike results bottle results giraffe results

Fig7. - Examples of high-scoring detections on our cross-depictive style dataset, selected from the top 20 highest scoring detections in each class. The framed images (last one in each row) illustrate false positives for each category. In each detected window, the object is matched with the learned model graph. In the matched graph, each node indicates a part of the object, and larger circles represent greater importance of a node, and darker lines denote stronger relationships.