![[*]](footnote.png)
Qasim Iqbal and J. K. Aggarwal
Computer and Vision Research Center
Department of Electrical and Computer Engineering
The University of Texas at Austin
Austin, Texas 78712, USA
aggarwaljk@mail.utexas.edu
Our overall motive is to extend the current stage of content-based image retrieval (CBIR), which is limited to the treatment of lower-level image descriptions, such as histograms of pixel values [3] and texture analysis [4]. The histogram and texture analysis techniques analyze an image at a lower level on a strictly quantitative basis and are unable to capture higher-level scene descriptions that relate different primitive image features with each other. These descriptions will help in image retrieval where queries are concerned with locating images containing manmade objects (such as buildings or architectural objects). Moreover, such descriptions are relatively less sensitive to illumination changes as compared to lower-level histogram and texture analysis.
It is known that higher-level semantic knowledge, exhibited as the structural information in an image, may be used as effective domain knowledge to isolate potential regions of interest comprised of manmade objects [5]. In our previous work we have successfully developed a retrieval methodology based on structure that detects the presence of manmade objects in an image [1]. In this paper we have undertaken a study to compare the efficacy of using structure to the use of histogram and texture analysis for that purpose.
We extract structure by applying the general principles of perceptual grouping. Perceptual grouping refers to the human visual ability to extract significant image relations from lower-level primitive image features without any knowledge of the image content. It uses such concepts as grouping by proximity, similarity, continuation, closure, and symmetry [5] to group primitive image features into meaningful higher-level image relations.
For the comparison of the three paradigms, viz., histograms, texture
and structure, we develop retrieval methodologies for each of them that
query a database of monocular grayscale outdoor images taken from a ground-level
camera in order to retrieve images that contain buildings. The organization
of the rest of the paper is as follows: section 2 describes the framework
underlying the formulation of the retrieval methodologies, section 3 outlines
the results obtained, and finally, section 4 presents the conclusions.
We assume that the image space consists of three classes, building,
non-building, and intermediate, which are denoted as
,
,
and
,
respectively. The intermediate class is added to account for the fact that
in natural outdoor images, some images may be ambiguous and, therefore,
difficult to classify even for human operators, and it is convenient to
treat them as belonging to a third class.
Each of these three classes,
,
,
,
has an associated discriminant function, denoted as
,
,
and
,
respectively. Representing an image classifier in a canonical form through
a set of these discriminant functions, the classifier assigns a feature
vector
,
and hence, the image from which it is extracted, to class
if
In the histogram and texture analysis methods, classification is performed using a nearest neighbor classifier, where the structural analysis is performed using a classifier based on Bayesian decision theory. The development of the methodologies outlining the formulation of the appropriate discriminant functions for the three paradigms is described in the next sections.
Our database consists of 150 images of size
,
with 55 building images, 51 non-building images, and 44 intermediate images.
For each of the three classes we have employed a total 30 images, with
10 images in each of the three classes, as training images for the individual
classifier. The remaining 120 images are used for testing.
The normalized grayscale histogram extracted from an image is a
256 dimensional vector that is contained in the histogram space
(represented by a unit hypercube), i.e.,
| (2) |
Images are assigned to one of the three classes using the nearest neighbor
classifier. The nearest neighbor classifier assigns a pattern histogram
(feature vector),
,
to the same class
,
,
as the training feature vector nearest in the histogram space, i.e.,
is assigned to that class which has the highest discriminant value given
by equation 1:
| (3) |
| (4) |
| (5) |
| (6) |
| (8) |
| (9) |
| (10) |
![]() |
(11) |
| (12) |
| (13) |
![]() |
(14) |
Buildings are manmade objects with sharp edges and straight boundaries.
Searching for the highest level features representing the peripheral shape
of a building may give inaccurate results because of the large search space.
However, the presence of a building in an image will generate a large number
of significant edges, junctions, parallel lines and groups, in comparison
with an image with predominantly non-building objects. These structures
are generated by the presence of corners, windows, doors, boundaries of
the building, etc. These intermediate-level features exhibit regularity
and relationships, and are strong evidence of structure present in an image.
Straight lines extracted from non-building images are generally randomly distributed. The presence of the distinguishing intermediate-level features mentioned above follow the ``principle of non-accidentalness'' [6] and, therefore, are more likely to be generated by buildings. Hence, these features can discriminate between a building image and a non-building image.
We detect the presence of buildings in an unconstrained environment, i.e., with no constraints on the viewing angle and depth, by extracting these intermediate-level features using the principles of perceptual grouping. The following features are extracted hierarchically from an image: line segments, longer linear lines, ``L'' junctions, ``U'' junctions, parallel lines, parallel groups, ``significant'' parallel groups. Perceptual grouping rules of similarity, continuity, and parallelism have been used to extract these features. The details of the extraction of these descriptions may be found in [1].
Some of these features are self-explanatory, others are explained in
the following. Longer linear lines are obtained by the extension of approximately
collinear fragmented line segments that either overlap or are close to
each other. The lines obtained are further pruned to eliminate those lines
which are very small. All other features are extracted using these longer
linear lines. Parallel groups are obtained by putting constraints on the
amount of the overlaps of orthogonal projections of parallel lines and
projections along the
and
axes, while incorporating differences in local and intrinsic orientation
of the lines. ``Significant'' parallel groups are extracted by further
constraining the search to only those parallel groups in which at least
one member line is enclosed by an ``L'' or ``U'' junction, while accommodating
the obliqueness of the viewing angle.
|
| (19) |
We have assumed that
is multivariate Gaussian:
|
We performed two experiments on the database for each of the three
paradigms. The first experiment measured recall and precision. Recall is
defined as the fraction of the total number of images in a particular class
that are retrieved correctly by the system for that class. Precision is
defined as the fraction of images retrieved that actually belong to that
class. Images are retrieved by classifying them into one of the three classes
by utilizing their respective discriminant functions,
.
Recall and precision are shown in Table 1 for the histogram, texture, and structural analysis, respectively. The first column shows the three classes. The second, third and fourth columns show the number of images (T) in each of the three classes, the number of images retrieved (R) in the respective classes, and the number of correct images (C) in the set of images retrieved, respectively. As is evident from the table, the recall and precision obtained by histogram and texture analysis are lower than those obtained by structural analysis.
The second experiment retrieved the best matches for each of the three
classes and measured the efficiency of the system. The best matches were
obtained by sorting the corresponding values of
in descending order. The number of images that actually belong to a particular
class within the best matches are shown in ranges of 20 images in columns
-
of Table 2. Efficiency is defined
as the number of images (O) that actually belong to a particular class
that are obtained in the first T best matches for that class, expressed
as a fraction. These values are shown in columns
-
of the table. It is observed again that the distribution of best matches
and efficiency obtained by histogram and texture analysis are inferior
to that obtained by using structure.
A grayscale histogram is a global description of an image and lacks the ability to directly relate to spatial locations in the image. Texture deals with the analysis of an image at local scales, but a wide variety of images in all of the three classes may have both smooth textures or rapidly varying textures (e.g. close-up images of a uniform surface and images of vegetation would have smooth and rapidly varying textures, respectively, although both of them belong to the non-building class). Both histogram and texture analysis techniques provide a lower-level quantitative description of an image. However, buildings have well-defined higher-level spatial relationships exhibited by the lines, corners, junctions and parallel groups. Therefore, structure readily discriminates a building image from a non-building image.
Finally, Figure 1 shows some
of the images retrieved by the system that are classified as building images
using the structural analysis. As seen from the figure, the results encompass
retrieved building images in a wide variety of viewing angles and depths.
![]() |
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.48)
Copyright © 1993, 1994, 1995, 1996,
Nikos
Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross
Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2htmlSIP99_4.tex
The translation was initiated by Qasim Iqbal on 2001-03-02
![[*]](footnote.png)