The cross-depiction problem is that of recognizing visual objects regardless of whether they are photographed, painted, drawn, etc. It is an under-researched area, but advance would be of genuine significance to computer vision. Experiments confirm the intuition that the variance across photo and art domains is much larger than either alone, which introduce more challenges. As this new area develops, a public dataset is important for comparing techniques. Currently no such dataset exists for cross-depiction -- a gap we fill. We also provide benchmarks for leading techniques, demonstrating that none perform consistently well given the cross-depiction problem.
We train across all depictions at once, and then test across all depictions. We provide a modeling schema (a framework) for visual class objects that generalises across a broad collection of depictive styles. Each object is modeled as a graph with multi-labeled nodes and learned weights. Experiments show that our representation is able to improve upon Deformable Part Models for detection and Bag of Words models for classification.
A New Dataset: Photo-Art-50
We release a challenging, annotated image dataset for researchers to evaluate their cross-depiction techniques. This dataset contains 50 object categories, 90 to 138 images for each object with approximately half photos and half art images. These 50 objects all appear in Catech-256 and a few also appear in PASCAL VOC Challenge and ETH-Shape dataset. Part of the photo images are copied from Caltech-256, the rests are from Google search. Art images are searched by a few keywords to cover a wide gamut of depiction styles, e.g., `horse cartoon', `horse drawing', `horse painting', `horse sketches', `horse kid drawing', etc. Then we manually select images with a reasonable size of a meaningful object area. We further manually provide the ground-truth bounding boxes for object categorisation with annotated object location.
In order to discover how much statistical difference between the feature distributions on the photo and art domains -- and to make sure our dataset is of value to the cross-depiction problem -- we compute the symmetric Kullbeck-Liebler divergence between art and photo feature distributions. A small K-L divergence means that the two distributions are similar.
|Cross-domain sets in [Gong-cvpr12]||Photo-Art-50|
Evaluation of Baseline Approaches
We evaluate three baseline methods on our dataset.
1. Bag-of-words (BoW) [csurka-eccv04]. It is chosen because it is well known, widely used, and performs well on standard image classification problems. We assess the performance of this popular framework with different local descriptors, including the well-known self-similarity descriptor (SSD) [shechtman-cvpr07] designed to solve cross-depiction matching problem.
2. Deformable part model (DPM) [felzenszwalb-pami10]. It is a state-of-the-art detection method. We adapt it to the classification task using annotated object bounding boxes in the training set. It is chosen because it powerfully models variations both in appearance and non-rigid deformation.
3. In order to investigate whether domain adaptation techniques help solve the cross-depiction problem, we evaluate a recent state-of-the-art domain adaptation technique, called Geodesic Flow Kernel (GFK) [gong-cvpr12].
Classification accuracies without (OrigFeat, PCA_S and PCA_T) and with (GFK_PCA and GFK_LDA) domain adaptive methods on Photo-Art-50. Left: training on artworks, test on photographs. Right: training on photographs, test on artworks. The experiments are carried out with 30 images per class for training, repeated 5 times with random training-test split. 'OrigFeat' means classifying with the original 5000-bin histogram. Except OrigFeat, the rest methods are with 49 dimensional projected features.
Algorithm - Learning Graphs to Model Visual Objects
We model visual classes using a graph with multiple labels on each node; weights on arcs and nodes indicate relative importance (salience) to the object description. Visual class models can be learned from examples from a database that contains photographs, drawings, paintings etc.
Fig4. - Detection and matching process. A graph G will be firstly extracted from the target image based on the input model <G*,β>, then the matching process is formulated as a graph matching problem. The matched subgraph from G indicates the final detection results. Φ(H,o) in the figure denotes the attributes obtained at position o.
Fig5. - Learning a class model, from left to right.(a): An input collection (different depictions) used for training. (b): Extract training graphs. (c): Learning models in two steps, one for G*, one for &beta. (d): Combination as final class model.
Tab4. - Detection results on Photo-Art-50 dataset: average precision scores for each class of different methods, DPM, a single labeled graph model with learned β, our proposed multi-labeled model graph with and without learned β The mAP (mean of average precision) is shown in the last column.