nextupprevious
Next:Identification of classes

Applying perceptual grouping to content-based image retrieval: Building images[*]


Qasim Iqbal and J. K. Aggarwal
Computer and Vision Research Center
Department of Electrical and Computer Engineering
The University of Texas at Austin
Austin, Texas 78712, USA
aggarwaljk@mail.utexas.edu

Abstract

This paper presents an application of perceptual grouping rules for content-based image retrieval. The semantic interrelationships between different primitive image features are exploited by perceptual grouping to detect the presence of manmade structures. A methodology based on these principles in a Bayesian framework for the retrieval of building images, and the results obtained are presented. The image database consists of monocular grayscale outdoor images taken from a ground-level camera.

Introduction

With recent developments in multimedia, content-based image retrieval (CBIR) is becoming increasingly important in visual information management systems. Research is underway on CBIR systems that can retrieve images in an unconstrained environment. However, general solutions to the key problems relating to regions of interest, segmentation, and retrieval are still being sought [1].

In this paper, we have applied perceptual grouping principles for image retrieval. Perceptual grouping refers to the human visual ability to extract significant image relations from low-level primitive image features without prior knowledge of the image content [2]. We illustrate the efficacy of our idea by applying the general rules of perceptual grouping to an image database consisting of still monocular grayscale outdoor images taken from a ground-level camera in order to retrieve images that contain buildings.

The human visual system can detect many classes of patterns and statistically significant arrangements of image elements. Gestalt psychologists have observed the tendency of the human visual system to perceive configurational wholes, with rules that govern the uniformity of psychological grouping for perception and recognition, as opposed to recognition by analysis of discrete primitive image features. The grouping principles proposed by Gestalt psychologists embodied such concepts as grouping by proximity, similarity, continuation, closure, and symmetry [3]. The grouping of low-level features provides a higher-level structure. These higher-level structures may be further combined to yield another level of higher-level structures. The process may be repeated until a meaningful semantic representation is achieved that may be used by a higher-level reasoning process.

A number of techniques have been proposed for the application of perceptual grouping principles to solve practical computer vision problems [2,3,4,5,6,7]. A collection of current techniques for the detection of manmade objects, including buildings, may be found in [8]. Building detection using perceptual grouping is an emerging area of the application of the Gestalt laws of psychology to computer vision. Generic models are used to locate buildings in the images [9,10]. Detection of buildings in aerial images using the principles of stereo vision has also been investigated [11].

Computer vision imposes unique requirements on the representation and manipulation of image data and knowledge [12]. At the lowest level of a computer vision system, an image may be interpreted purely as numbers. At the highest level, the semantic information presented by the image may be interpreted as a semantic world model that provides the final understanding of the scene. Consequently, the image metadata extracted by the current techniques in CBIR falls into two broad categories: view-based and model-based.

Current view-based retrieval techniques analyze image metadata at a lower level on a strictly quantitative basis for color, intensity, contrast and texture features. A survey of the current techniques used for CBIR is presented in [13]. One of the earliest techniques for global image similarity was to use a color histogram [14]. Several modifications have been suggested to this approach. Analyses of local image properties using texture orientation [15], multiscale Gaussian derivative filters [16], Gabor filters [17] and wavelet transforms [18] have also been used. User-defined queries, comprised of color, texture, and user-supplied shape models to match with images in a database have been treated in [19,20,21,22,23].

Model-based retrieval, which uses extracted image metadata to define the model properties, constructs a 3D CAD model of the manmade object of interest. Model-based techniques extract semantic information. These techniques require a priori knowledge about the shape of objects (object models), which is used to predict image features, for matching to features in the image or in a transformed feature space. Segmentation of an image to obtain different regions, followed by the analysis of their structural information to recognize the desired model, is a top-down approach. Automatic segmentation and recognition of objects via object models is a difficult step. Consequently, image retrieval using recognition of similar objects is used less frequently than view-based approaches.

At the lowest level of computer vision, potentially useful image events such as edges and line segments can be extracted from an image without any knowledge of the image content. In an unconstrained environment, without the knowledge of the viewing angle and the depth information, a bottom-up approach appears to be more promising for the extraction of semantic information. What the bottom-up approach suggests is to start from the lower-level primitive image features and hierarchically group them into higher-level structures according to the principles of perceptual grouping. We use this approach for the retrieval of building images.

Current view-based techniques may not be suitable for our task, as mentioned before, they analyze image metadata at a lower level on a strictly quantitative basis for color, intensity and texture features. The current techniques are unable to extract semantic information that describes the structural interrelationships of different primitive image features with each other at a higher level in a manmade structure such as a building.

Due to the difficulties in automatic segmentation, resulting in inaccurate model extraction, the top-down approach embodied by current model-based techniques also may not be feasible. Lack of viewing angle and depth information precludes the estimation of the 3D projection of a building as a specific 2D structure in the image plane. In addition, there is no well-defined (rigid) peripheral shape representation of a building in ground-level images as it is governed by architecture. Unless the model database is large enough to incorporate a large number of views, the results may be inaccurate. Additional complexity may be imposed by the number of buildings present in the image, occlusion, and lack of a priori information about the scene for model matching.

It may be observed that the work mentioned above relating to detection (localization) of buildings using perceptual grouping has utilized aerial images [9,10,11], whereas we have used images taken from ground-level. Moreover, retrieval of building images in a CBIR framework has not been treated so far using either aerial or ground-level images. The only work related to this task is presented in [15], where orientation of texture has been used to sort photos and classify them as ``city or suburb'' scenes.

The organization of the rest of the paper is as follows: section 2 outlines the identification of salient features extracted by perceptual grouping, section 3 presents a Bayesian framework for utilizing these features for decision-making, section 4 presents the results obtained, and, finally, section 5 presents the conclusions.

Extracting salient information from the images

This section focuses on the development of the methodology for the extraction of salient information from an image via perceptual grouping.

Buildings are manmade objects with sharp edges and straight boundaries. Searching for the highest-level features representing the peripheral shape of a building may give inaccurate results because of the large search space. However, the presence of a building in an image will generate a large number of significant edges, junctions, parallel lines and groups, in comparison with an image with predominantly non-building objects. These structures are generated by the presence of corners, windows, doors, boundaries of the building, etc. These intermediate-level features exhibit regularity and relationships, and are strong evidence of structure present in an image.

Straight lines extracted from non-building images are generally randomly distributed. The presence of the distinguishing intermediate-level features mentioned above follow the ``principle of non-accidentalness'' [3] and, therefore, are more likely to be generated by buildings. Hence, these features can be considered to be the discriminating criteria between a building image and a non-building image.

To detect the presence of buildings from the intermediate-level features using the principles of perceptual grouping, the following features are extracted hierarchically from an image: straight line segments, longer linear lines, ``L'' junctions, ``U'' junctions, parallel lines, parallel groups, ``significant'' parallel groups.
 

Extraction of straight line segments

Burns' straight lines detector [24] has been used to extract straight line segments. To eliminate low contrast line segments, the edge strength associated with a segment is examined and only those segments which meet the following criteria are retained
 
\begin{displaymath}{\cal E}(L_i) > \delta_e \end{displaymath} (1)
where ${\cal E}(\cdot)$ represents the ratio of the edge strength associated with a segment $L_i$ to the maximum edge strength in the image, and $\delta_e$ is a threshold. Edge strength is calculated as the average gradient magnitude in the support region bounded by the segment.

Extraction of longer linear lines

In practice, the straight line segments obtained from the Burns' operator are fragmented and should be grouped together to obtain longer linear lines at a higher granularity level. The distinction between the terms ``line segments'' and ``lines'' will be made on the basis that a line is obtained by grouping these fragmented line segments.

The symmetric orthogonally elongated region of width $2\delta_n$ with a line segment $L_b$ as the medial axis (figure 1a) is searched to collect a set of segments, $S_{fe}$ (which includes the original segment denoted as base segment also), that are replaced by a representative line$L_r$ provided the following conditions are satisfied

(a)

 
\begin{displaymath}{\cal A} (L_b,L_i) < \delta_{a}\end{displaymath} (2)
 
\begin{displaymath}\max\{{\cal D}_{o}(L_b,e_{i_1}),{\cal D}_{o}(L_b,e_{i_2})\} < \delta_{n}\end{displaymath} (3)
(b) either
 
\begin{displaymath}{\cal D}(e_i,e_j) < \delta_{n}\end{displaymath} (4)
or
 
\begin{displaymath}{\vartheta} (\varphi_{L_j}(L_{i}), L_j) > 0\end{displaymath} (5)
where $L_b$ denotes the base segment, $L_i$ and $L_j$ denote any two segments in $S_{fe}$${\cal A}(\cdot)$ denotes the absolute value of the smaller angle in radians between the two segments, and $\delta_{a}$ is a threshold. In the above equations $e_i$ and $e_j$ are the end-points of the segments $L_i$ and $L_j$, respectively, ($e_{i_1}$ and $e_{i_2}$ are the two end-points of the segment $L_i$), ${\cal D}_{o}(\cdot)$ is the orthogonal distance of an end-point to a segment, and${\cal D}(\cdot)$ represents the distance between those end-points of $L_i$ and $L_j$ which are nearer to each other. In the last equation$\varphi_{L_j}(L_{i})$ is the orthogonal projection of $L_{i}$ onto $L_{j}$, and $\vartheta(\cdot)$ outputs the length of the overlap between any two lines.

Equations 2 and 3 ensure that all segments in$S_{fe}$ are approximately collinear to $L_b$. Equations 4 and 5 require that any segment in $S_{fe}$ must be either close to (end-points fall in a circular neighborhood of radius$\delta_n$), or overlap at least one other segment in $S_{fe}$, respectively, to ensure continuity.

To fix the line $L_r$ we need one point through which it passes (we calculate the mid-point of $L_r$), its orientation, and its length. The mid-point and orientation of $L_r$ are obtained as the weighted average of the mid-points and the orientations of all line segments in the set $S_{fe}$, respectively. The weights are determined using the lengths of the segments. To obtain the length of $L_r$ the end-points of all segments in $S_{fe}$ are orthogonally projected onto $L_r$, and the two farthest points are taken as the end-points of $L_r$ [5].

The process is continued until no merging occurs. Termination of the process is guaranteed after a finite number of lines, as there are a finite number of line segments and their number decreases after each iteration. Next, all lines obtained are analyzed and only those lines meeting the following criteria are retained

 
\begin{displaymath}{\cal L}(L_i) > \delta_l\end{displaymath} (6)
where ${\cal L}(\cdot)$ outputs the length and $\delta_l$ is a threshold. These lines are referred to as the ``retained lines''.

Extraction of ``L'' junctions

A set of non-parallel lines terminating at a common point is called a cotermination. In practice, a small neighborhood is constructed around a point to allow the lines to terminate in a small common region. ``L'' junctions are formed by those coterminations which have an internal angle close to $\frac {\pi} {2}$ radians. These junctions are strong evidence of salient corners in an image and, hence, indicators of a manmade structure that may be attributed to a building. Each pair of lines $\{L_i,L_j\}$, where $i \neq j$, in an ``L'' junction satisfies the following conditions
 
\begin{displaymath}{\cal D}(e_i,e_j) < {\delta_n}\end{displaymath} (7)
 
\begin{displaymath}\frac{\pi} {2} \: - \: {\cal A}(L_i,L_j) < \delta_{la}\end{displaymath} (8)
where $\delta_{la}$ is a threshold. Equation 8 requires that the angle formed by $L_i$ and $L_j$ be within a threshold $\delta_{la}$ from$\frac {\pi} {2}$. In an oblique view the angle enclosed by an ``L'' junction may be far from $\frac {\pi} {2}$. Relaxation of the value of $\delta_{la}$ can accommodate an oblique viewing angle.

Extraction of ``U'' junctions

A ``U'' junction is formed by the alignment of two ``L'' junctions. ``U'' junctions are generated by regular manmade structures such as windows and doors in a building, and their presence in an image is a strong indication of the presence of a building. Two type of ``U'' junctions are possible: (1) those ``U'' junctions which have a common line between the ``L'' junctions, and (2) those which do not have a common line. For the first case, the direction in which the ``non-common'' lines ``point'' should be approximately similar for the two ``L'' junctions to constitute a valid ``U'' junction (figure 1b).

For the second case, two ``L'' junctions are grouped together to form a ``U'' junction if they satisfy

(a)

 
\begin{displaymath}{\cal A}(L_{11},L_{ur}) < \delta_{ua}\end{displaymath} (9)
 
\begin{displaymath}{\cal A}(L_{21},L_{ur}) < \delta_{ua}\end{displaymath} (10)
(b) either
 
\begin{displaymath}{\cal D}(e_{11},e_{21}) < {2\delta_{n}}\end{displaymath} (11)
or
 
\begin{displaymath}{\vartheta (\varphi_{L_{21}}(L_{11}),L_{21}) > 0}\end{displaymath} (12)
(c) $L_{12}$ and $L_{22}$ point in approximately similar direction.

where $L_{11}$ and $L_{21}$ are the lines in the two ``L'' junctions which are perceived to be ``pointing'' towards each other (figure 1c), $L_{12}$ and $L_{22}$ are the two ``other'' lines in the ``L'' junctions,$L_{ur}$ is a representative line obtained by joining the mid-points of the internal end-points of the lines forming the ``L'' junctions, and $\delta_{ua}$ is a threshold.

Equations 9 and 10 imply that the angles between $L_{ur}$ and $L_{11}$, and $L_{ur}$ and $L_{21}$ should be less than $\delta_{ua}$ to ensure a valid ``U'' junction. The threshold in equation 11 may be set to any value close to but greater than $\delta_n$ (in section 2.2). If this threshold is set less than or equal to $\delta_n$, then no groupings from disjoint ``L'' junctions, with their end-points falling in a small neighborhood, may be possible as $L_{11}$ and $L_{21}$ may already have been grouped together to form a larger collinear line. We have chosen the threshold value to be $2\delta_n$ for convenience. If this condition is not satisfied, then equation 12 examines the lines $L_{11}$ and $L_{21}$ to see if there is overlap between them to constitute valid lines for a possible ``U'' junction. If more than one ``L'' junction fulfills all of the above criteria for a particular target ``L'' junction, then the ``L'' junction for which the length of $L_{ur}$ is the smallest is matched with the target ``L'' junction.

It should be noted that a ``U'' junction resulting from an ``L'' junction and a ``single'' line is not possible. This is due to the fact that if the single line is close to one of the lines of an ``L'' junction, then that single line already forms a valid ``L'' junction with the line in the ``L'' junction close to it. Hence, only combinations of ``L'' junctions yield the desired ``U'' junctions.

Extraction of parallel lines

All ``retained'' lines are searched to extract sets of parallel lines. A set is collected by picking a base line $L_b$ in the image and finding all lines that satisfy
 
\begin{displaymath}{\cal A} (L_b,L_i) < \delta_{a}\end{displaymath} (13)
where $L_i$ is a line other than $L_b$ and $\delta_{a}$ is a threshold. The parallel lines obtained are grouped into clusters of similar orientations to avoid a brute-force search in the detection of parallel groups.

Extraction of parallel groups

Parallel groups are obtained by grouping those parallel lines which significantly overlap each other. The overlapping is determined by orthogonal projection. However, certain groupings of parallel lines that have intrinsic orientation different from their local orientation may not be grouped together (figure 1d). We employ a general procedure for extracting parallel groups that also accounts for the difference in the intrinsic and local orientations of lines [2]. A parallel group is a set of lines $S_{pg}=\{L_1,L_2,\ldots,L_M\}$, where $M \geq 2$, which satisfies the following conditions, which state that for each line $L_i \inS_{pg}$, there exists a line $L_j \in S_{pg}$, where $i \neq j$, such that

(a) $L_i$ and $L_j$ have ``similar'' lengths, i.e.,

 
\begin{displaymath}\frac {{\cal L}(L_i)} {{\cal L}(L_j)} > \delta_{pg_1} \end{displaymath} (14)
where $\delta_{pg_1}$ is a threshold. ($L_i$ is assumed to be shorter in length than $L_j$, here, and in the subsequent.) The threshold $\delta_{pg_1}$ may be relaxed to accommodate a possible oblique view.

(b) $L_i$ and $L_j$ are ``relatively'' close, i.e.,

 
\begin{displaymath}\frac {{\cal D}(e_{{mid}_{i}},e_{{mid}_{j}})} {{\cal L}_{avg}(L_i,L_j)} <\delta_{pg_2}\end{displaymath} (15)
where $e_{{mid}_{i}}$ and $e_{{mid}_{j}}$ are the mid-points of $L_i$ and $L_j$ respectively, ${\cal L}_{avg}(\cdot)$ denotes the average length of the two lines, and $\delta_{pg_2}$ is a threshold.

(c) $L_i$ and $L_j$ have ``sufficient overlap'' in one of the three projections, i.e.,

 
\begin{displaymath}\frac {\vartheta(\varphi_{{\cal Y}_{axis}}(L_i),\varphi_{{\c......)} {{\cal L}(\varphi_{{\cal Y}_{axis}}(L_i))}> \delta_{pg_3} \end{displaymath} (16)
 
\begin{displaymath}\frac {\vartheta(\varphi_{{\cal X}_{axis}}(L_i),\varphi_{{\c......)} {{\cal L}(\varphi_{{\cal X}_{axis}}(L_i))}> \delta_{pg_3} \end{displaymath} (17)
 
\begin{displaymath}\frac {\vartheta(\varphi_{L_j}(L_i),L_j)}{{\cal L}(\varphi_{L_j}(L_i))} > \delta_{pg_3}\end{displaymath} (18)
where ${\cal Y}_{axis}$ and ${\cal X}_{axis}$ represent the $Y$-axis and the $X$-axis of an image, respectively, and $\delta_{pg_3}$ is a threshold.
 
(a) Longer linear line

 
(b) ``U'' junction

 
(c) ``U'' junction

 
(d) Parallel groups

 
Figure 1:Visualization of some of the groupings. (a) Longer linear line (b) ``U'' junction (c) ``U'' junction (d) Parallel groups.

Extraction of ``significant'' parallel groups

Not all of the parallel groups obtained in the last section may be constituted as ``significant'', since some of them may be generated by structures other than the sharp corner edges generated by windows, doors, etc., in a building. Therefore, the following criteria is adopted to identify significant parallel groups

(a) at least one line in the parallel group is enclosed by an ``L'' or a ``U'' junction.

(b) either

 
\begin{displaymath}{\cal A}(L_i,{\cal Y}_{axis})< \delta_{spga}\end{displaymath} (19)
or
 
\begin{displaymath}{\cal A}(L_i,{\cal X}_{axis})< \delta_{spga}\end{displaymath} (20)
where $L_i$ is any line in a parallel group and $\delta_{spga}$ is a threshold. The threshold $\delta_{la}$ (in section 2.3) accommodates an oblique viewing angle in (a). Equations 19 and 20 provide a control on the obliqueness of the viewing angle. The threshold $\delta_{spga}$ may be set equal to $\frac {\pi} {4} $ radians to allow parallel groups in any orientation (viewing angle).

Feature extraction

In the general form, the feature vector extracted is expressed as
 
\begin{displaymath}{\bf X} = (x_1, x_2, x_3)^t\end{displaymath} (21)
where
 
\begin{displaymath}x_1 = \frac { \mbox{lines in \lq\lq L'' junctions}}{\mbox{ Total \char93  of \lq\lq retained'' lines}}\end{displaymath} (22)
 
\begin{displaymath}x_2 = \frac { \mbox{lines in \lq\lq U'' junctions}}{\mbox{ Total \char93  of \lq\lq retained'' lines}}\end{displaymath} (23)
 
\begin{displaymath}x_3 = \frac { \mbox{lines in \lq\lq significant'' parallel groups}}{\mbox{ Total \char93  of \lq\lq retained'' lines}}\end{displaymath} (24)
The numerators in equations 22, 23 and 24 are normalized by the number of ``retained'' lines to ensure fair comparison between images. Common lines in the respective calculations of$x_1$$x_2$ and $x_3$ are only counted once.

As evident, $x_i \in [0,1]$, where $i \in\{1,2,3\}$, i.e., an image is mapped into a feature space bounded by a cube that has an edge of length 1. The feature vector $\bf X$ represents the coordinates of the mapped image in this space. In our experiments $\bf X$ is obtained by using the threshold values shown in table 1. The angles are displayed in degrees. These thresholds values are kept constant for the generation of the results presented in section 4.

Table 1: Values of thresholds used for generating the results.
$\delta_e$ $\delta_{a}$ $\delta_{n}$ $\delta_l$ $\delta_{la}$
0.1 $5^o$ 5 10 $30^o$
$\delta_{ua}$ $\delta_{{pg}_1}$ $\delta_{{pg}_2}$ $\delta_{{pg}_3}$ $\delta_{{spga}}$
$30^o$ 0.5 2.5 0.5 $30^o$

We have striven to develop a system with no constraints on the viewing angle or depth. In the absence of any knowledge of these two fundamental quantities, we believe that the empirical selection of the above parameters is justified, and that the threshold values shown in table 1may be treated as constants instead of variable parameters. Moreover, since these values are applied constantly to all images in the database, all images are equally affected. These values were carefully chosen to accommodate a wide variety of viewing angles and depths.

Bayesian framework for retrieving building images

This section focuses on the treatment of extracted feature vectors in a Bayesian framework.

Image acquisition

The 150 images in our database were acquired using Sony's Digital Mavica camera. They included a wide variety of building and non-building images. The $640 \times 480$ color images were converted to grayscale by using the XV program. Figure 2 shows some of the images in the database.

startsection subsection2@.2ex plus 1ex .2ex plus .2exsetsizesetsizesubsize12ptxipt12ptxiptBayesian formulation of the approach Bayes' rule is fundamental in decision theory. Mathematically it is expressed as [25]

 
\begin{displaymath}P(\Omega_i\vert{\bf X}) = \frac {p({\bf X}\vert\Omega_i)P(\Omega_i)}{p({\bf X})}\end{displaymath} (25)
where ${\bf X} = (x_1, x_2, x_3)^t$ is an extracted feature vector,$P(\Omega_i\vert{\bf X})$ is the a posteriori probability of observing the class $\Omega_i$ given $X$$p({\bf X}\vert\Omega_i)$ is the class conditional probability density function of $\bf X$ w.r.t. the class $\Omega_i$$P(\Omega_i)$ is the a priori probability of observing the class $\Omega_i$$p({\bf X}) = \sum_{j=1}^{N} p({\bf X}\vert\Omega_j)P(\Omega_j)$ is the probability density function of observing $\bf X$, and $N$ is the number of classes.



nextupprevious
Next:Identification of classes
Qasim Iqbal 2001-03-01