Using The Sum of Distance Technique to Classify
TM Scenes of Eastern Amazonia

 

 

 

By Chris Funk
For Dar Roberts
Geography 214
Fall 1997
 

 


 

Introduction

 

Statistics developed side by side with the nation-state in 18th century Europe. The word stems from l’etat, which is French for state. Thus Statistics has had, in general, the central tendency as its primary concern. A key idea in statistics is that the mean minimizes the sum of the squared deviations between the mean value and all the points in the distribution. This paper builds on this idea – that the sum of distances between a given point in ‘distribution space’ and all the data points, when raised to a power (not necessarily 2) and summed, provides a measure of the relative ‘location’ of a given distribution. Furthermore, it should be noted that this approach is inductive rather than deductive. Our focus is on giving the data another voice "to speak for itself" [Tukey,1975], rather than trying to prove some optimal values based on assumptions that are very likely not going to be true for the majority of real world datasets. The data being analyzed is a TM scene of the Maraba region of Brazil, [Roberts, In Press]. The Sum of Distance technique is used to qualitatively explore this data and quantitatively determine a natural number of ‘best’ clusters.

 

Overview

 

This paper is comprised of six sections. In the first section we briefly state the background for this topic. In section two the data being analyzed is discussed. In the third section the Sum of Distance method is defined. In the fourth section a bivariate clustering of the Maraba scene is performed with the SoD technique. In the fifth section multivariate classification is discussed. In the sixth section an overview of this research is given, and goals for further work are set forth.

 

Background

 

The problem of clustering/classification has had a very long and broad history. Hartigan has provided one classic reference surveying this broad field [Hartigan, 1975]. His book summarizes the traditional methods of clustering, which can be generally partitioned into three regimes: optimal, agglomerative (hierarchical), and divisive. Optimal techniques make use of the rigorous methods of integer programming to ensure partitions which meet certain optimality constraints. Generally, these techniques are limited to several hundred, or at most thousands of points, and are thus inappropriate for remote sensing applications. Agglomerative techniques work by pairing individual points, and then iteratively collapsing two clusters or swapping points between them. These techniques build up a hierarchy of clusters. While more robust than optimal methods, agglomerative techniques have difficulty handling the millions of observations typically encountered in remote sensing applications. For this reason, various flavors of partitioning techniques such as k-means [Hartigan, 1975], maximum likelihood and parrallelpiped [Richards, 1993, Lillesand, 1987] are typically used to classify images. These methods commonly share the following attributes:

 

    1. They partition a set of observations into some number of classes
    2. The number of classes must be given a priori
    3. The methods are black boxie – they provide classes without providing the user much understanding of the underlying structure of the data
    4. The methods are fast, and robust, but provide no guarantees of performance
 

In this paper we explore an innovative new direction – the sum of distances transformation. This technique has as its roots in the work of Geographers and Location Scientists who have developed methods to determine the spatial median, modes and dispersion of points in geographic space [Neft,1966]. Our insight has been that these techniques may be fruitfully applied to non-geographic information spaces. Further study into this area will be required to make the appropriate citations.

 

Classification in remote sensing has continued to maintain a place of significance, in part due to the need to identify and quantify landcover change. Our study draws heavily from the work of Roberts, et al. [Roberts, In Press]. Roberts’ work stresses the need for remotes sensing techniques to map land cover change, and presents the Multitemporal Spectral Mixture Model Analysis (MeSMA) as a standard frame of reference for comparing scenes from different years. While we have chosen to work with principal components in this study, since they most compactly express the variance of the TM data allowing us to work effectively with just 2 dimensions, we plan on using Robert’s MeSMA generated fractions in future studies, due to the increased comprehensibility and temporal stability provided by this method.

 

 


 

Data

 

A 1024 by 1024 pixel TM scene was analyzed. The statistics for the bands are listed below (table 1). Note that all the bands exhibit a high degree of skew and kurtosis, with means very close to their minimum values, and maximum values ranging up to 35 times the mean for that band. Also note that the correlation matrix (table 2) denotes a high degree of cross-correlation between the bands, indicating that actual rank of the TM matrix is considerably less than 6. Bands 1 through strongly correlated with each other, and somewhat weakly with bands 5 and 7, which were also strongly correlated with each other. The interesting exception to this pattern of inter-band linkage is TM4, which is only weakly negatively correlated to bands 3 and 7. The negative correlation between bands 3 and 4 makes sense, since vegetative material reflects in 4 and chlorophyll absorbs in 3 [Roberts, A Spectral Primer], and a large amount of the scene is covered in vegetation. Similarly, leaves reflect more in the NIR than in the MIR, which seems to generate the negative correlation between bands 4 and 7.

 

Table 1: Basic Statistics of Maraba 92 TM Bands

 
 
Band Min Max Mean Stdev Standardized  

Max

TM1
51
255
62.88
8.06
25.31
TM2
15
168
24.88
4.86
31.45
TM3
12
204
22.64
7.88
24.36
TM4
10
189
62.52
10.03
17.85
TM5
4
255
57.51
19.81
12.67
TM7
0
255
17.56
10.13
25.18
 

 

Table 2: Correlation Matrix of Maraba 92 TM Bands

 
 
Band TM1 TM2 TM3 TM4 TM5 TM7
TM1
1.00
0.96
0.91
0.04
0.70
0.76
TM2
0.96
1.00
0.95
0.05
0.77
0.82
TM3
0.91
0.95
1.00
-0.09
0.87
0.92
TM4
0.04
0.05
-0.09
1.00
0.03
-0.11
TM5
0.70
0.77
0.87
0.03
1.00
0.96
TM7
0.76
0.82
0.92
-0.11
0.96
1.00
 

 

A principal components analysis was performed on the TM bands, which resulted in the factor loadings presented in table 3. A map of the first three components is shown in image 1. The first factor is, as expected, an indication of ‘brightness’, with all bands except 4 loading positively. This ‘Brightness’ factor loaded most heavily on the MIR, suggesting that it was picking up the clouds most strongly. Indeed, the high skew and kurtosis in each of the raw TM bands can largely be accounted for by the existence of a relatively few extremely reflective pixels. While the cloud pixels could have been removed, or another, less cloudy, time period could have been examined, we chose to include the cloudy regions in our data in the hope that the sum of distance analysis would reveal them and successfully isolate them from other types of pixels. After brightness, the second principal component was ‘Greenness’. We have already described how TM4 was poorly correlated with the other bands. Since PC1 is largely just a weighted average of the other bands, it is not surprising that the PC2 would largely be just TM4, which is independent of the other bands. Note that the negative correlations between TM4 and TM3 and TM7 reappear as negative factor loadings. The third component used in this study was interpreted as an NPV factor. This band differentiated reflectance in the visible with reflectance in TM5. The fourth component appeared to differentiate between ‘wet’ black bodies (e.g. rivers) and ‘dry’ dark bodies such as burned areas. Factors 5 did not have any immediate physical interpretation and Factor 6 was almost complete noise. Below table 3 is a small plot of the eigenvalues of the components. One question this study would like to address is the correct number of components to use. Clearly, the first 2 factors ‘explain’ most of the data, but will they be sufficient to differentiate between burned areas and rivers, as well as bare soil and NPV?

 

Table 3: Factor Loadings of Principal Components
 
Band Bright Green NPV Burn Band5 Band6
TM1
0.26
0.03
-0.73
0.15
-0.60
-0.16
TM2
0.17
0.02
-0.36
0.11
0.32
0.85
TM3
0.30
-0.07
-0.38
0.02
0.72
-0.49
TM4
0.00
0.99
-0.03
-0.13
0.06
-0.04
TM5
0.80
0.07
0.45
0.37
-0.10
0.01
TM7
0.41
-0.11
0.01
-0.90
-0.09
0.08
Eigenvalue
593
103
44
4
2
.5

 
 


Maraba 1992 Principal Components


Red=PC1, Green=PC2, Blue=PC3

 
Image 1. This scene from Maraba is composed large stands of primary and secondary growth, as well as pasture and degraded farmland. Note the large number of clouds and their associated shadows.


Methods

 

One of the most significant features of space is that a large amount of it is empty. This is true of variable ‘space’ as much as it is of ‘geographic’ space. The fact that our measurements of the world fall, in reality, into only a limited subspace allows us to formulate theories and generate predictions, thereby making sense out of our sensations. In remote sensing, researchers routinely face a need to map sets of wavelength measurements to more understandable types of variables. One of two general approaches are usually followed: Classification Analysis (CA) or Spectral Mixture Analysis (SMA). Cluster Analysis takes an image and generates a thematic transform describing the scene as composed of a limited number of classes. Spectral Mixture Analysis takes an image and generates a set of fractions; metric variables describing the relative membership of a pixel in a class defined by a set number of endmembers. SMA is clearly a much more rich description of a given data set, yet, in the end, a clustering of data is often required. Maps must be made, or the change of landcover types must be analyzed. Since the need for clustering is not likely to go away, it might be worthwhile to ask how we might improve upon the current methods. In this paper we explore a new way of looking at our data, using a statistic called the Sum of Distance.

 

Sum of Distance (SoD) analysis takes a given set of m points of n-dimensional data (x) and a grid G of potential locations in n-space. From these generates a sum of distances grid, SD. The relative values of SD will be indicative of the distribution of the original points of X. Formally, let X be an n-dimensional vector of m observations. A matrix G, typically of the same dimension as X, is defined which contains points regularly distributed over the span of x. Each element of G is an n-dimensional vector defining a potential cluster location. For simplicity, we assume that we have broken up the variable range of vector x into q quintiles. G will then contain qn locations. We further define a matrix SoD of the same rank as G, composed of scalar quantities equal to the sum of distance between the corresponding grid point and all the observations in X. The following equation expresses this value:

 

                                             qn    m
        let     SoD = S  S [ D( Gi, Xj)p ]
                                            i=1    j=1

where

        SoD is a sum of distance matrix
        G is an n-dimensional grid of potential locations
        X is a vector of m observed values
        m is the number of observed values
        n is the number of variables in each observation
        D is the Euclidean Distance Function
        p is the power to which the distance is raised

 
It is a well established fact that if p is greater than 1.0 then the value of SoD will be convex up, with a minimum value at the mean of X. The following sum of distance plot (image 1) of PC1 versus PC2 illustrates this. This plot was generated from a 300 x 300 sub-pixel area. The brightness axis flows along bottom of the image, and greenness goes up and down. The plot was calculated using a power value of 2.0. This image gives us almost no information concerning the distribution of points within the Maraba 1992 scene. Compare this image with the one next to it. This image was calculated with a power of –2.0. Thus metaphorically, the plot represents the sum of ‘gravitational attractions’ between each pixel in our original image and every grid cell in PC space. Note the tasseled cap structure. This structure is more clearly revealed by subtracting out the ‘expected value’ of the sum of distance distribution, represented by the third plot.

The values of this plot are calculated as:

 

                                    qn    m
    let SoDT = S  S [ D( Gi, Xj)p ]
                                   i=1   j=1

where

        SoDT is a theoretically derived sum of distance matrix
        G is an n-dimensional grid of potential locations
        X is a vector of m observed values
        m is the number of observed values
        n is the number of variables in each observation
        D is the Euclidean Distance Function
        p is the power to which the distance is raised
        P is the chance of a pixel landing in grid cell Gi

 

P was calculated by assuming a bivariate normal distribution, centered at the mean of X, and having the same variance as PC1 and PC2. Image is the ‘residual’ left after we subtract SoDT from SoD. Note how the outliers are emphasized. The images are difficult to see, but there are clearly endmembers at three points of the triangle. The images on the bottom have been convolved by a 3x3 highpass filter matrix of the form:

 
-1 -1 -1
-1 9 -1
-1 -1 -1
 

This highpass filter raises the values of abnormally high pixels, bringing up abnormally high values even further. In particular, note the bright pixels to the left of the tassled cap. These pixels lie outside the ‘veg line’ and would likely generate errors in a 3 end member mixing model. After these explorations showed promise, the study was repeated for the full Maraba image.



PC1 vs. PC2, p=2                                                             PC1 vs. PC2, p=-2


 PC1 vs. PC2, p=-2, Null Hypothesis                         PC1 vs. PC2, p=-2, ‘Normalized’

 

Highpass Filter of p=-2                                                 Highpass filter of Residuals
 
 


Maraba 1992 Full Image

                PC1 vs. PC2, p=-2                                                                                        PC1 vs. PC2, p=-2, Residuals

 


        PC1 vs. PC2, p=-2, Highpass Filtered                                 PC1 vs. PC2, p=-2, Residual, Highpass

 
Bivariate Clustering with the Sum of Distance Technique

These 2d plots show the sum of distance measurement calculated to a range of greenness and brightness values for the full Maraba image. The images were calculated to a coarser resolution in order to keep processing times reasonable. The first plot shows the sum of inverse distance squared plot for the entire 1024 by 1024 image. Note the tasseled cap structure. The orange area defines the mean for the image, which is mostly composed of forest. To the right of this image is the ‘normalized version’, which has had the expected value of the sum of distance measure subtracted away. Note how this accentuates the outliers located to the far right and on the bottom of the images. The areas are almost four standard deviations away from the mean center, and should ‘normally’ be empty. These pixels, therefore, are almost certainly pure, and make good candidates for an endmember analysis. In order to better differentiate cluster centers the normalized SoD field was passed through a highpass filter, and the twlve grid cells with highest values were extracted. Two of these points were deemed redundant, and removed. These centroids are plotted below. Note that the ‘tasseled cap’ seems stretched out to the right. Presumably this is caused by the existence of clouds in our image.

 

We took these coordinates and used them to perform a ‘minimum distance classification. This classification appears below

 

These results seem reasonable, but the classification failed to adequately distinguish between clouds and pasture. Another considerable error in this clustering is the confusion of burned and river regions.

 
To test the accuracy of our method, we reintroduced the clusters that we deemed ‘redundant’ and reran the cluster analysis. We considered it possible that we might need more contrast along the brightness dimension to catch the difference between clouds and pasture.
 

This is the resulting image. Our cloud problem has become less severe and fewer burned areas appear as ‘water’. The vegetation seems to have been nicely differentiated.

In general this is rather impressive for an unsupervised classification based on just two variables. The problems that we encountered seem to be mostly attributable to the nature of the principal components we didn’t use: PC3 – the NPV component, would have provided additional information helping us

differentiate burned areas, pastures and degraded farmland. PC4, which appears to load positively on water and negatively on burned areas, would have helped to split the river/burn class appropriately. The Sum of Inverse Distance plots for these components can be found on the next two pages. Clearly, these plots are telling us something very powerful about a small number of pixels. This is a significant advantage of the SoD approach – small portions of a large population can have a significant impact on the SoD field, if they are strongly clustered and adequately far enough away from the other points. This suggests that the SoD approach might prove very useful in detecting and visualizing weak, yet distinct, signals within remotely sensed data. This could be useful in a wide array of applications.

 
Our original intention with the SoD programs was to extend our G matrix into three, four, and possibly more dimensions. Code for the three dimensional case, as well as for 3d visual display was written. When these were tested on a 300 by 300 sub-region of the study site, for PC’s 1,2 and 3, nothing exciting happened – because the region was mostly composed of forest. Experiments with applying the method to the full 1024x1024 image were not encouraging. The computational complexity of the SoD technique, at present, is O(qn*m), where q is the number of steps in variable space (the width, height, etc. of our grid), n is the number of dimensions, and m is the total number of pixels. Thus for a 20x20x20 ‘Component Space’, which is not really a sufficient resolution, for our small subset of 10242 pixels, roughly 8x1016 distance calculation/additions were needed. This is not very appropriate for an ‘exploratory’ data analysis tool. The curse of dimensionality is indeed difficult to overcome. Still, we feel that we have shown that this method is worth exploring further. There are several approaches to exploring/utilizing higher dimensions within this framework. These are discussed in the next section: Multidimensional Clustering with the SoD technique.


Maraba 1992 Full Image
 
 
                PC1 vs. PC3 , p = -2.0                                             Highpass Filtered

                PC2 vs. PC3 , p = -2.0                                                     Highpass Filtered  

                        PC1 vs. PC4                                                            Highpass Filter

                        PC2 vs. PC4                                                             Highpass Filter

Multidimensional Clustering with the Sum of Distance Technique

 

In general, moving data analysis techniques into higher dimensions is usually either trivial, or very difficult. Performing a k-means clustering, on one single additional ‘dimension’ is as simple as adding up one more number for every point examined. Yet this ‘simplicity’ may be very misleading. Please examine the brightness-NPV plot. The NPV values ranged from –13 to +4 standard deviations. The interesting behavior all occurs at the extreme bottom of the range. There are almost no statistical models which would predict this occurrence, and a set of seed points spread ‘randomly’ throughout the data would almost surely miss these outliers entirely. The SoD method shows the outlying patterns pretty clearly, even without looking at the ‘residuals’ from a normal model as we did earlier in this study. How to take advantage of this additional structures is another matter. Working on an n-dimensional hypercube might be feasible, but since this space grows exponentially, and since visualizing higher dimensions becomes exceedingly difficult, the next section of this study will focus on ‘faking it in higher dimensions’ – that is using multidimensional relationships derived from multiple sets of bivariate plots.

 
We begin with our cluster centroids from the end of the previous section. What we would like to do is extend this set of clusters out to four dimensions to include the NPV and wetness components. We list the 12 clusters here, ranked by shade and greenness. Some of these are redundant, as shown by the chart below. What we would like to do is to add components to the ‘cloud’ and ‘bright soil’ classes so that they can differentiate better. By observing the plots above we see that indeed, both the NPV and wetness plots show 2 clusters of outliers at high brightness ranges – thus high brightness, high NPV, and low wetness is probably East Campina sand or a recently burned area, while the opposite would hold true of clouds. We modify the table below appropriately, based on the values indicated in the SoD plots. A similar analysis of the plots above allows us to modify the darker elements of our classes in a similar fashion. The middle values we assign a value of 127, the average for each PC. Note that there are still interesting aspects of the Greenness SoD plots listed above. The extremely high Greenness, low NPV values might reveal Capoeria stands, but we let this pass for now, defining a new set of clusters as:

 
Brightness Greenness NPV Wetness
222.52
127
25 189
210.58
127
200 10
186.7
124
200 20
162.82
127
127 127
103.12
148.264
127 127
103.12
91.56
127 127
79.24
127
127 127
79.24
91.56
189 25
55.36
77.384
189 25
43.42
91.56
25 189
 

This set of values produced a set of clusters so bad that we are not going to even include them in this paper. Clearly, teasing out appropriate clusters from multiple bivariate plots must be done appropriately. To increase our comprehension of this problem we ran our Sum of Distance algorithm for a 20x20x20 cube and wrote a program to interactively explore this space. A screen shot, which fails to capture the full excitement, shows the first three components, after being transformed by a high pass filter.
 

Clearly, as we can see from this image, there is a significant outlying group in the PC1/PC2/PC3 combination. Figuring out how to map and understand these spaces remains a challenge, which until resolved, will impose an impediment to the contributions that the Sum of Distance method can make. There is however, hope. At present the code which generates the Sum of Distance matrix is exhaustive, calculating the distance to every data point for every grid point. This is massively inefficient if the power exponent p is less than zero, since each inverse distance value will approach zero quickly. This suggests that order-of-magnitude sppeed enhancements could be made by cleverly written code, if a maximum distance was defined for each grid element. For example, assume that a high resolution 4 dimensional, 50 x 50 x 50 x 50 cube was being generated, but that the distance limit constrained calculations such that each data point only contributed to 64 grid cells (3 on each side). The latter process would run in something like 2/1000th of the time required for the former.

If such a speed-up was effective, then higher dimensional datasets could be generated, and the spaces searched automatically for interesting points. These sets of higher dimensional points could in turn be used to perform clustering, or to help define the convex hull of a Spectral Mixing Model. Multidimensional Scaling can reduce higher order topologies with a minimum of distortion in the inter-point distances [Kruskal and Wish, 1978]. This technique might be reasonably applied to automatically generated higher dimensional sets of clusters, bringing them down to smaller sets of dimensions comprehensible to our mortal minds.

 

Summary
 

In this study we developed a new data analysis technique, the Sum of Distance method, and used it to examine and cluster a TM Scene from Maraba. The TM scenes were converted to Principal Component scores, and it was seen that these scores reiterated the basic covariance structure of the data. The principal components appeared in the ‘typical’ KT transform format – brightness, greenness, NPV and Wetness. Bivariate sum of inverse distance squared plots were then generated for PC1 and PC2. A model SoD plot was also generated for a theoretical a bivariate (independent) normal distribution with the same variance values. This theoretical model could be improved by including a covariance term. The residuals (SoD-SoDT) were examined, and showed large outliers, lying near the usual locations of the 3-em shade, vegetation, and soil members. Both the raw SoD grid and the residual SoD grid were passed through a 3 by 3 highpass filter. The highest 12 filtered residual points were then selected as cluster centroids. The resultant clustered image appears to be physically reasonable, except for some confusion regarding clouds and burned areas. This is perhaps to be anticipated, given the principal components which were used.

 

Our experiments with higher order clustering were evocative but frustrating. Since a straight forward application of the SoD method seemed to generate good clusters in the bivariate case, this would probably be the case in higher dimensions. One issue, which has heretofore been avoided, is the question of appropriate distance metrics. In this study, simple euclidean distance was used. It is not clear that this was better than using measures of statistical distance, where each deviation value is normalized by the standard deviation of the associated variable. As higher principal components are used this issue becomes increasingly important, since by definition they have lower and lower variance levels. It may be that using an MNF transform, or MeSMA produced fractions will obviate this issue.

 
Future Research

This present study is lacking in many respects, paramount of which is the need to tie the actual clusters more tightly with the biophysical character of the remotely sensed image. Do the clusters make sense? Do they map to physical reality, or are they chimerical? These can only be answered by examining this and other SoD clusterings more rigorously. The SoD program needs to be written much efficiently, allowing fast calculations of higher dimensional grids. An image display module need to be written and linked to the SoD displays, allowing users to circle SoD regions and have the corresponding pixels highlighted on an image. Together these two innovations would allow the technique to be applied to higher dimensional data.

We hope to apply the improved SoD clustering technique to each of the Maraba scenes analyzed by Roberts: 1984, 1989, 1992, 1994. Comparisons of the resultant annual SoD fields will hopefully offer a unique way to visualize and quantify landcover change. Additionally, interclass transition matrices will be generated, and the possibility of simulating landcover change will be examined.


Bibliography

 
Breiman, L, Freidman, J. H., Olshen, R.A., Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, CA.

Everitt, B. (1980). Cluster Analysis (second edition). Halsted, New York.

Hartigan, J. A. (1975). Clustering Algorithms. Wiley. New York, NY.

Lillesand, T.M., Kiefer, R. W., (1987). Remote Sensing and Image Interpretation, John Wiley and Sons, New York, NY.

Neft, D. S. (1966). Statistical Analysis for Areal Distributions, Regional Science Research Institute, Philadelphia, PA.

Richards, John A., (1993). Remote Sensing Digital Image Analysis, An Introduction. Springer-Verlag, Berlin.

Roberts, D.A., Batista, G., Pereira, J., Waller, E., and Nelson, B., (In Press), Remote Sensing Change Detection: Environmental Monitoring Application and Methods. Elvidge, C.D., and Lunetta, R. Eds, Ann Arbor Press, Ann Arbor, MI.

Tukey, J. (1977), W., Exploratory Data Analysis, Addison-Wesley.