Qasim Iqbal and J. K. Aggarwal
Computer and Vision Research Center
Department of Electrical and Computer Engineering
The University of Texas at Austin
Austin, Texas 78712, USA
In this paper, we have applied perceptual grouping principles for image retrieval. Perceptual grouping refers to the human visual ability to extract significant image relations from low-level primitive image features without prior knowledge of the image content . We illustrate the efficacy of our idea by applying the general rules of perceptual grouping to an image database consisting of still monocular grayscale outdoor images taken from a ground-level camera in order to retrieve images that contain buildings.
The human visual system can detect many classes of patterns and statistically significant arrangements of image elements. Gestalt psychologists have observed the tendency of the human visual system to perceive configurational wholes, with rules that govern the uniformity of psychological grouping for perception and recognition, as opposed to recognition by analysis of discrete primitive image features. The grouping principles proposed by Gestalt psychologists embodied such concepts as grouping by proximity, similarity, continuation, closure, and symmetry . The grouping of low-level features provides a higher-level structure. These higher-level structures may be further combined to yield another level of higher-level structures. The process may be repeated until a meaningful semantic representation is achieved that may be used by a higher-level reasoning process.
A number of techniques have been proposed for the application of perceptual grouping principles to solve practical computer vision problems [2,3,4,5,6,7]. A collection of current techniques for the detection of manmade objects, including buildings, may be found in . Building detection using perceptual grouping is an emerging area of the application of the Gestalt laws of psychology to computer vision. Generic models are used to locate buildings in the images [9,10]. Detection of buildings in aerial images using the principles of stereo vision has also been investigated .
Computer vision imposes unique requirements on the representation and manipulation of image data and knowledge . At the lowest level of a computer vision system, an image may be interpreted purely as numbers. At the highest level, the semantic information presented by the image may be interpreted as a semantic world model that provides the final understanding of the scene. Consequently, the image metadata extracted by the current techniques in CBIR falls into two broad categories: view-based and model-based.
Current view-based retrieval techniques analyze image metadata at a lower level on a strictly quantitative basis for color, intensity, contrast and texture features. A survey of the current techniques used for CBIR is presented in . One of the earliest techniques for global image similarity was to use a color histogram . Several modifications have been suggested to this approach. Analyses of local image properties using texture orientation , multiscale Gaussian derivative filters , Gabor filters  and wavelet transforms  have also been used. User-defined queries, comprised of color, texture, and user-supplied shape models to match with images in a database have been treated in [19,20,21,22,23].
Model-based retrieval, which uses extracted image metadata to define the model properties, constructs a 3D CAD model of the manmade object of interest. Model-based techniques extract semantic information. These techniques require a priori knowledge about the shape of objects (object models), which is used to predict image features, for matching to features in the image or in a transformed feature space. Segmentation of an image to obtain different regions, followed by the analysis of their structural information to recognize the desired model, is a top-down approach. Automatic segmentation and recognition of objects via object models is a difficult step. Consequently, image retrieval using recognition of similar objects is used less frequently than view-based approaches.
At the lowest level of computer vision, potentially useful image events such as edges and line segments can be extracted from an image without any knowledge of the image content. In an unconstrained environment, without the knowledge of the viewing angle and the depth information, a bottom-up approach appears to be more promising for the extraction of semantic information. What the bottom-up approach suggests is to start from the lower-level primitive image features and hierarchically group them into higher-level structures according to the principles of perceptual grouping. We use this approach for the retrieval of building images.
Current view-based techniques may not be suitable for our task, as mentioned before, they analyze image metadata at a lower level on a strictly quantitative basis for color, intensity and texture features. The current techniques are unable to extract semantic information that describes the structural interrelationships of different primitive image features with each other at a higher level in a manmade structure such as a building.
Due to the difficulties in automatic segmentation, resulting in inaccurate model extraction, the top-down approach embodied by current model-based techniques also may not be feasible. Lack of viewing angle and depth information precludes the estimation of the 3D projection of a building as a specific 2D structure in the image plane. In addition, there is no well-defined (rigid) peripheral shape representation of a building in ground-level images as it is governed by architecture. Unless the model database is large enough to incorporate a large number of views, the results may be inaccurate. Additional complexity may be imposed by the number of buildings present in the image, occlusion, and lack of a priori information about the scene for model matching.
It may be observed that the work mentioned above relating to detection (localization) of buildings using perceptual grouping has utilized aerial images [9,10,11], whereas we have used images taken from ground-level. Moreover, retrieval of building images in a CBIR framework has not been treated so far using either aerial or ground-level images. The only work related to this task is presented in , where orientation of texture has been used to sort photos and classify them as ``city or suburb'' scenes.
The organization of the rest of the paper is as follows: section 2 outlines the identification of salient features extracted by perceptual grouping, section 3 presents a Bayesian framework for utilizing these features for decision-making, section 4 presents the results obtained, and, finally, section 5 presents the conclusions.
Buildings are manmade objects with sharp edges and straight boundaries. Searching for the highest-level features representing the peripheral shape of a building may give inaccurate results because of the large search space. However, the presence of a building in an image will generate a large number of significant edges, junctions, parallel lines and groups, in comparison with an image with predominantly non-building objects. These structures are generated by the presence of corners, windows, doors, boundaries of the building, etc. These intermediate-level features exhibit regularity and relationships, and are strong evidence of structure present in an image.
Straight lines extracted from non-building images are generally randomly distributed. The presence of the distinguishing intermediate-level features mentioned above follow the ``principle of non-accidentalness''  and, therefore, are more likely to be generated by buildings. Hence, these features can be considered to be the discriminating criteria between a building image and a non-building image.
To detect the presence of buildings from the intermediate-level features
using the principles of perceptual grouping, the following features are
extracted hierarchically from an image:
straight line segments,
``significant'' parallel groups.
The symmetric orthogonally elongated region of width with a line segment as the medial axis (figure 1a) is searched to collect a set of segments, (which includes the original segment denoted as base segment also), that are replaced by a representative line provided the following conditions are satisfied
Equations 2 and 3 ensure that all segments in are approximately collinear to . Equations 4 and 5 require that any segment in must be either close to (end-points fall in a circular neighborhood of radius), or overlap at least one other segment in , respectively, to ensure continuity.
To fix the line we need one point through which it passes (we calculate the mid-point of ), its orientation, and its length. The mid-point and orientation of are obtained as the weighted average of the mid-points and the orientations of all line segments in the set , respectively. The weights are determined using the lengths of the segments. To obtain the length of the end-points of all segments in are orthogonally projected onto , and the two farthest points are taken as the end-points of .
The process is continued until no merging occurs. Termination of the process is guaranteed after a finite number of lines, as there are a finite number of line segments and their number decreases after each iteration. Next, all lines obtained are analyzed and only those lines meeting the following criteria are retained
For the second case, two ``L'' junctions are grouped together to form a ``U'' junction if they satisfy
where and are the lines in the two ``L'' junctions which are perceived to be ``pointing'' towards each other (figure 1c), and are the two ``other'' lines in the ``L'' junctions, is a representative line obtained by joining the mid-points of the internal end-points of the lines forming the ``L'' junctions, and is a threshold.
Equations 9 and 10 imply that the angles between and , and and should be less than to ensure a valid ``U'' junction. The threshold in equation 11 may be set to any value close to but greater than (in section 2.2). If this threshold is set less than or equal to , then no groupings from disjoint ``L'' junctions, with their end-points falling in a small neighborhood, may be possible as and may already have been grouped together to form a larger collinear line. We have chosen the threshold value to be for convenience. If this condition is not satisfied, then equation 12 examines the lines and to see if there is overlap between them to constitute valid lines for a possible ``U'' junction. If more than one ``L'' junction fulfills all of the above criteria for a particular target ``L'' junction, then the ``L'' junction for which the length of is the smallest is matched with the target ``L'' junction.
It should be noted that a ``U'' junction resulting from an ``L'' junction and a ``single'' line is not possible. This is due to the fact that if the single line is close to one of the lines of an ``L'' junction, then that single line already forms a valid ``L'' junction with the line in the ``L'' junction close to it. Hence, only combinations of ``L'' junctions yield the desired ``U'' junctions.
(a) and have ``similar'' lengths, i.e.,
(b) and are ``relatively'' close, i.e.,
(c) and have ``sufficient overlap'' in one of the three projections, i.e.,
(a) at least one line in the parallel group is enclosed by an ``L'' or a ``U'' junction.
As evident, , where , i.e., an image is mapped into a feature space bounded by a cube that has an edge of length 1. The feature vector represents the coordinates of the mapped image in this space. In our experiments is obtained by using the threshold values shown in table 1. The angles are displayed in degrees. These thresholds values are kept constant for the generation of the results presented in section 4.
startsection subsection2@.2ex plus 1ex .2ex plus .2exsetsizesetsizesubsize12ptxipt12ptxiptBayesian formulation of the approach Bayes' rule is fundamental in decision theory. Mathematically it is expressed as