Many research groups [5,6,7,8] are actively pursuing content-based indexing, storage and retrieval of images. Some systems have already been built that provide content-based image retrieval [9,10,11]. These systems require user interaction, which emphasizes shape, color, and texture features to build queries. The histogram and texture analysis techniques analyze an image at a lower level on a strictly quantitative basis and are unable to capture higher-level scene descriptions that relate different primitive image features with each other. These descriptions will help in image retrieval where queries are concerned with locating images containing manmade objects. Moreover, such descriptions are relatively less sensitive to illumination changes as compared to lower-level histogram and texture analysis. Hence, we develop retrieval methodologies incorporating both lower-level and higher-level image analysis methods.
A study for the comparison of separate lower-level (histogram and texture) and higher-level (structure) approaches for the retrieval of images containing manmade objects with significant structure (represented by buildings) is presented in [2]. This paper presents an extension to that approach by integrating lower-level and higher-level approaches for the retrieval of images containing large manmade objects. However, in the current integration we do not consider the histogram. Our general framework is outlined in figure 1. An image is treated separately for two different types of analysis: global higher-level analysis and global lower-level analysis.
Higher-level analysis consists of using the general rules of perceptual grouping. Perceptual grouping refers to the human visual ability to extract significant image relations from lower-level primitive image features without any knowledge of the image content [1,2,3]. It uses such concepts as grouping by proximity, similarity, continuation, closure, and symmetry to group primitive image features into meaningful higher-level image relations. The global extraction of these higher-level features that represent significant structure to construct a 3-dimensional feature vector is outlined briefly in the next section. A detailed discussion of the extraction of these features may be found in [1].
Lower-level analysis is performed by employing Gabor filters to extract texture features. The texture features are computed globally in a manmade object region of interest (ROI) extracted by using perceptual grouping. We integrate a multiscale, multiorientation channel energy model within the higher-level ROI, to extract a 16-dimensional feature vector consisting of fractional energies in various spatial channels. The framework is also capable of computing the texture features without confinement to the manmade ROI (i.e. the texture features may be computed for the whole image).
The underlying idea is to analyze the structure present in an image (using the higher-level analysis module) to make a decision regarding the presence of manmade objects. This decision is examined in an analysis block, as shown in figure 1, to use the texture represented by the structure for refinement and enhancement in the analysis of recall and precision while serving queries. Refinement refers to the process of fine tuning the set of images obtained by higher-level analysis to eliminate false alarms, whereas enhancement refers to the extension of the set obtained by the higher-level analysis to include some images that contain manmade objects that might have been missed by the higher-level analysis module. A monocular grayscale image database consisting of outdoor images taken from a ground-level camera is utilized.
The organization of the rest of the paper is as follows: section 2 details the higher-level analysis, section 3 outlines the lower-level analysis, section 4 presents the results obtained, and finally, section 5 presents the conclusions.
![]() |
Some of these features are self-explanatory, others are explained in
the following. Longer linear lines are obtained by the extension of approximately
collinear fragmented line segments that either overlap or are close to
each other. The lines obtained are further pruned to eliminate lines that
are very small. All other features are extracted using the longer linear
lines. Parallel groups are obtained by putting constraints on the amount
of the overlaps of orthogonal projections of parallel lines and projections
along the
and
axes, while incorporating differences in local and intrinsic orientation
of the lines. ``Significant'' parallel groups are extracted by further
constraining the search to only those parallel groups in which at least
one member line is enclosed by an ``L'' or ``U'' junction, while accommodating
the obliqueness of the viewing angle.
The feature vector
extracted from each image is expressed as:
where
![]() |
|||
![]() |
|||
![]() |
(1) |
where
,
,
i.e., an image is mapped into a feature space bounded by a unit cube.
Let
be a cotermination graph, where
and
are the set of vertices and the set of edges of
,
respectively. Let
be an edge connecting vertices
, ![]()
.
The weight of
is defined as
,
where
is the
degree of a vertex, that is, the number of edges incident
with the vertex. The edge weights are collected by extracting the adjacency
matrix of the graph. The connected components of the graph are found, and
each sub-graph corresponding to each component is processed separately.
The weight of a spanning tree
is the sum of the weights of all the branches in
.
We search for the maximal spanning tree, which may be found by slightly
altering the minimal spanning tree algorithm to incorporate the vertices
resulting in maximal-weight spanning tree [12].
The maximal spanning tree is employed to extract the fundamental circuits.
Each fundamental circuit represents a closed figure in the image, where
edges on this circuit correspond to line segments on the closed figure.
A polygon is defined to be that fundamental circuit extracted that meets the following requirements [3]: (a) the polygon is simple, i.e., the edges of the polygon do not intersect among themselves, (b) the polygon is relatively compact, (c) the polygon does not have many cavities, and (d) the number of edges on the polygon does not exceed a given threshold.
Let
be the set of convex hulls of the PS (where the superscript denotes the
current level). Let
be the function that maps the convex hull set
to the PS graph
.
Let
be the PS graph of the PS in the image, where
is the set of vertices, and
is the set of edges.
Let
be the set of the connected components of
.
Then
,
and
is either one convex hull, or a set of convex hulls that either intersect
or are close. In the former case
represents one PS,
,
hence,
,
and
.
In the latter case, the PS whose regions (convex hulls) belong to
are grouped into a larger structure
.
The region of
is:
A new PS graph
is then established, where
.
The connected components of
(
)
are then found, leading to the further grouping of PS. This process continues
until no PS can be further grouped. That is, the iteration stops at
with
.
All potential ROIs obtained are analyzed to identify the ROI with the largest
area, which is labeled as
.
Figure 2 displays an image and its
extracted manmade object ROI.
![]() |
The image
is treated by Gabor filters to extract spatial channel-dependent texture
features. Gabor filters have been utilized for texture analysis because
they have optimal joint localization (resolution) in both the spatial and
the spatial frequency domains. The impulse response of an even-symmetric
2-dimensional Gabor filter is expressed as:
| (2) |
| (3) |
The set of self-similar Gabor filters is obtained by appropriate rotations
and scalings of
through the generating function:
| (4) |
where
and
are integers,
is the rotated and scaled version of the original filter,
is the scale factor,
is the current orientation index,
is the total number of orientations,
is the current scale index,
is the total number of scales, and
and
are the rotated coordinates:
| (5) |
where
is the orientation. The scale factor
ensures that the filter energy is independent of
.
The values of
,
and
are calculated as described in [2,4].
|
| (6) |
To reduce the ``boundary effect'' because of the segmentation of the
ROI, a Gaussian smoothing filter is applied to all pixels inside the ROI
for which the orthogonal distance between them and the nearest boundary
line of the ROI is within a given threshold. Due to the localized nature
of Gabor filtering, the energy leakage at the boundary of
is minimized also. However, to further eliminate energy leakage outside
,
we set
| (7) |
where
,
the fractional energy at the output of the filter in the
orientation and the
scale, is given as:
![]() |
(8) |
The feature vectors extracted via the higher-level analysis were classified using a Bayesian classifier, whereas the feature vectors extracted using the lower-level analysis were classified using a nearest-neighbor classifier. A multivariate Gaussian class-conditional probability density function was assumed for the higher-level features, where the mean vector and the covariance matrix of the density function were obtained by using maximum likelihood estimation [1,2]. A total of 120 images were used for testing, whereas 30 images, with 10 images in each class, were used for training.
The refinment process was accomplished by taking the logical ``and'' of the results obtained from the higher-level and lower-level analyses modules, i.e., both modules classify an image to the same class, for fine tuning. The enhancement process was accomplished by taking the logical ``or'' of the results of the two modules in the sense that at least one module classifies an image to the desired class.
Table 1 lists the results obtained
for a query for the retrieval of images containing buildings. The first
column shows the type of experiment performed, the second column shows
the total number of images containing structures that are either fully
buildings or an extended visible portion of a building (
),
the third column shows the images retrieved (
),
the fourth column shows the number of correct images in the set of images
retrieved (
),
the fifth column shows the recall (
),
and the last column shows the precision (
).
Recall is defined as the fraction of correct images retrieved. Precision
is defined as the fraction of images retrieved that are actually correct.
The first row in table 1 shows the recall and precision obtained by only using the higher-level module. The second row shows the results obtained by using only the lower-level analysis, without the extraction of the manmade object ROI, i.e., the texture features are computed for the whole image. The third row shows the results obtained by using only the lower-level analysis, but confining it to the manmade object ROI. The fourth and the fifth rows show the results obtained by integrating the results obtained in the first row and the second row, for refinement and enhancement, respectively, without using the ROI (whole image). The sixth and the seventh rows show the results obtained by integrating the results obtained in the first row and the third row, for refinement and enhancement, respectively, using the ROI.
The recall and precision obtained by using only the higher-level analysis
module are
and
,
respectively. The goal of the paper is to increase them by using a lower-level
analysis module in conjunction with the higher-level analysis module. However,
it must be noted that generally for a CBIR system either recall or precision
may be improved on the expense of the other. As seen in the fourth and
the fifth rows, without using the manmade object ROI, refinement increases
the precision to
at the corresponding reduction in recall (
),
whereas enhancement increases the recall to
at the reduction of precision to
.
The sixth and seventh row show that the use of ROI increases the precision
to
at the recall of
,
for refinement, and increases the recall to
at the precision of
,
for enhancement.
It must be noted that results indicate that for the problem of the retrieval of images containing large manmade objects, the higher-level analysis module gives good results, and recall and precision individually may be increased by using a lower-level analysis module in conjunction with the higher-level analysis module. The justification of employing the higher-level analysis module to extract structure to serve queries regarding the retrieval of images containing manmade objects is demonstrated in figure 3. The figure shows some images from the output of a query regarding the retrieval of images containing buildings using higher-level analysis.
![]() |