p19

LECTURE 14: ACCURACY AND ERROR IN GIS

A. INTRODUCTION

maps may appear to be the result of accurate, scientific measurement
even a topographic map will show the biases, conventions of the creator
maps often reveal much about the agenda of their creators
Denis Wood, The Power of Maps (Routledge, 1993): "whose agenda is in your glove compartment?"
compare the two topographic maps of S Tirol
map made by the Austrian administration prior to the Treaty of Versailles in 1919
map at the same scale, of the same area made by Italian administration
when such maps are digitized, their conventions become part of the database
results of analysis from these two maps might be very different
GIS databases built from maps are not necessarily objective, scientific measurements of the world
it is impossible to create a perfect representation of the world in a GIS database
therefore all GIS data are subject to uncertainty
uncertainty regarding what the data tell us about the real world
a range of possible truths
that uncertainty will affect the results of analysis
all GIS results should have confidence limits, "plus or minus"
it can be very difficult to determine what those confidence limits are
because GIS results come from a computer, we tend to treat them as more accurate than they really are, tend to ignore uncertainty
uncertainty can arise because:
measurements were not perfectly accurate
maps were distorted to make them more readable
e.g. lines are often repositioned
e.g. 101 and the railroad through Goleta at a scale of 1:250,000
at this scale both objects are thinner than their map symbols
the symbols would overlap if they weren't moved by the cartographer
definitions are vague, ambiguous, subjective
the real landscape has changed since the data were collected
the map is generalized
here is an example of positional errors in two commercial street centerline databases of Goleta
the background fill is darkest where errors are smallest
note how the errors are often up to 100m
a problem if someone reports the location of a fire using one map, and a response is dispatched using the other map
the response vehicle could be sent to the wrong street
note also how many streets are not in both databases
notice how errors persist over large areas
the error at one point is not independent of error at neighboring points
this is a general characteristic of error in GIS databases

B. ERROR PROPAGATION

1. Area map measurement example

an example of how uncertainty in a database can propagate into uncertainty about GIS products
a common GIS product is the measurement of polygon area
this example comes from the city of Melbourne, Australia
a layer is to be built of land ownership
showing each parcel of land
to be derived by digitizing a map at a known scale
this GIS layer will be used to produce maps of land ownership
on these maps the area of each parcel is to be shown
this can be computed from the polygon vertices using the trapezium algorithm
but what is the "plus or minus" on area?
it depends on the scale of the input map and the accuracy of digitizing
what accuracy can we get when a map is digitized?
as a general rule of thumb, errors from digitizing, map drafting, registration, stretching of the paper, amount to 0.5mm at the scale of the map
the table shows what this means for maps of given scales
third column shows the same expressed as area rather than length
e.g. a map at a scale of 1:3,000 will produce a GIS layer with a positional accuracy on the ground of 1.5m

the average suburban parcel in Melbourne is about 1,000 sq m
what did the map users want in the way of precision on parcel area?
two decimal places (hundredths of square meters)
what data accuracy would be needed to get this accuracy in the calculation of area?
it's a simple calculation, table shows the results
for a parcel of 1,000 sq m, it takes a map at 1:300 (15cm positional accuracy) to produce areas that are accurate to 1%
that is, the tens digit is accurate, but the units digit and all decimal places are spurious
a map at 1:3,000 (the actual maps) would result in areas accurate to 10%
only the hundreds digit is reliable
to get the required accuracy, the map would have to be at a scale of 3:1
three times larger than the city
Lewis Carroll and Umberto Eco have both written about the fantasy of maps at 1:1 and larger

C. ERROR IN DEMS

1. USGS quality description

a DEM provides measurements of the elevation of the land surface at each grid point
errors are due to:
measurement of the wrong elevation at the grid point
measurement of the right elevation at the wrong location
any combination of these
it is impossible to determine which case applies
the USGS provides simple quality statements for its DEMs
given as "root mean square error"
this is the square root of the average squared difference between recorded elevation and the truth
roughly interpreted as the average difference
e.g. many DEMs have RMSE of 7m
an error of 7m is common
errors of 10m, even 20m occur sometimes
RMSE can be interpreted as the standard deviation of the error distribution
diagram shows what this means in terms of relative frequencies of errors
small errors are commonest
32% of errors will be more than 7m
5% of errors will be more than 14m

2. Effects on contour maps

contour maps do not normally attempt to show uncertainty
width of any contour is determined only by cartographic emphasis, pen size
a contour map of central Pennsylvania near State College
generated from a USGS DEM with 30m spacing
the 350m contour has been widened for emphasis
suppose the RMSE is 7m, what does this mean in terms of the contour position?
the colored area shows where the contour might actually be
there is still a 5% chance the contour lies outside the colored area
reddish areas are recorded as greater than 350m but might actually be 350m
greenish areas are recorded as less than 350m but might actually be 350m
what does this error mean in terms of slope estimates?
slopes are calculated by comparing neighboring elevations
suppose two adjacent points are both at 350m
one point could actually be at 360m
the neighbor could actually be at 340m
instead of a steep slope (20m change over a 30m spacing) we would get what appears to be flat land
this assumes that neighboring errors are independent
if they were, the DEM would be virtually useless for many purposes
in fact, errors at adjacent points tend to be similar
both points are erroneously high, or erroneously low

3. Sources of error

some clues about the nature of errors can be got from looking at the data carefully
a detailed contour map of part of the data
notice how there are vertical and horizontal ridges in parallel lines
these are created by the DEM production process, which compares blocks on air photos and tends to concentrate errors at block edges

D. ERRORS IN AREA CLASS MAPS

1. Nature of errors

area class maps show a class at every point
examples are vegetation cover maps, soil maps, land use maps
they imply that class is uniform within areas, changes abruptly between areas
in fact both assumptions are wrong
there is variation within areas (heterogeneity)
due to inclusions of other classes of unknown size and frequency
there is blurring across boundaries
zones of transition
area class maps have been described as "maps showing areas that have little in common, surrounded by lines that do not exist"
example of an area class map
a map of soils in part of Northern Ohio (the Medina Quad)
originally digitized by Peter Fisher, University of Leicester
original map scale 1:15,840
4 inches to the mile

focus on the area shown in yellow
let's assume the legend says this class is "80% sand, with 20% inclusions of clay"
this map is used for many purposes
some involve land use regulation
some involve taxation, compensation
in principle, all of these are uncertain if the map is uncertain
GIS applications are in deep trouble in court if it can be shown that regulations, taxes were based on uncertainty and that no effort was made to deal with that uncertainty

2. Simulation model

one way to deal with uncertainty is to simulate the effects of the unknown variation
this map is made by a random process, constrained so that inclusions of clay (red) are small, and randomly located, and amount to 20% of the area
here is another map with the same constraints but different random locations and sizes
a parameter of the model (rho, shown in the top left) determines average inclusion size
in the first simulations it was set to .200
here and here it is increased to .240
notice how the inclusions get larger, but still occupy about 20% of the area
here and here it is increased to .250, the theoretical limit
the inclusions are still about 20% of the area

3. Impacts on GIS products

in practice we don't know rho
no one has ever tried to measure it for these kinds of data
but it is essential to know it if we want to determine the uncertainty of certain GIS products
e.g. uncertainty in the area of soil that is clay in a particular farmer's field
this table shows the impacts of rho on uncertainty in area
the left hand column shows rho
the top line (rho=0) corresponds to complete mixing, inclusion sizes close to zero
the bottom line (rho=.250) corresponds to the situation where the area is either all clay (probability 0.2) or all sand (probability 0.8)
this might happen in a crop example if we knew that the entire area was planted to one crop, but were uncertain which crop it was based on remote sensing
notice how the uncertainty in area estimates (last column) rises sharply with rho
it is much more of a problem with large rho, that is, large inclusions

E. GENERAL STRATEGIES

1. Simulation strategy

these models are complex
it's not likely that the average GIS user would be able to understand them
describing uncertainty as "a spatially autoregressive model with parameter rho" doesn't help many GIS users
how to get the message across?
there are many models out there
much recent research has focused on modeling uncertainty
the average user can't be expected to understand them all
the producer of data is the person best able to describe uncertainty
uncertainty must be communicated through data quality statements
e.g. RMSE is 7m
various standards exist for describing data quality
the Spatial Data Transfer Standard (Federal Information Processing Standard 173) has five elements of quality
positional accuracy
attribute accuracy
logical consistency
do the data follow all of the expected logical rules?
e.g. do polygons close?
many problems of logical consistency can be corrected automatically
completeness
are all features represented?
lineage
how the data were created, by what process

standards like this don't help the user who wants to know only what impact uncertainty will have on the results of analysis
e.g. knowing which model of digitizer was used is less helpful than knowing the accuracy that it produces

a general strategy for communicating about uncertainty
proposition: a method of simulation of uncertainty is a complete description
the method is defined by the data producer
it produces simulations, each of which is an equally likely and possible true map
variation among simulations represents uncertainty
the user examines the effects of different simulations on the result of analysis
the diagram compares a normal analysis done with a single data set with an analysis done repeatedly with the actual data plus a series of simulations
three strategies
ignore the issue completely
describe uncertainty with measures, e.g. RMSE
simulate equally probable versions of the data

2. Applets

an applet is a small piece of code, written in Java or a similar language, and distributed with the data
the picture shows a mockup of a user interface for examining possible data sets from a library or data archive
a DEM
the bounding box is shown
the sampling interval is shown, and the name of the area
the lower right shows a button
when clicked, the button will initiate a simulation process

an example
the example shows a simulation of uncertainty in the survey of a square parcel of land
each corner point is subject to an independent error in both coordinates
a RMSE of 2m
the simulation tracks the average area, standard deviation, and other statistics
it also shows a histogram of areas
execute the simulation (this will initiate a piece of Java code on your machine)
how well does this approach do at communicating understanding of uncertainty?