LECTURE 14: ACCURACY AND ERROR IN GIS

A. INTRODUCTION

B. ERROR PROPAGATION

C. ERROR IN DEMS

D. ERROR IN AREA CLASS MAPS

E. GENERAL STRATEGIES



A. INTRODUCTION
maps may appear to be the result of accurate, scientific measurement
even a topographic map will show the biases, conventions of the creator

maps often reveal much about the agenda of their creators

Denis Wood, The Power of Maps (Routledge, 1993): "whose agenda is in your glove compartment?"

compare the two topographic maps of S Tirol
map made by the Austrian administration prior to the Treaty of Versailles in 1919

map at the same scale, of the same area made by Italian administration

when such maps are digitized, their conventions become part of the database

results of analysis from these two maps might be very different

GIS databases built from maps are not necessarily objective, scientific measurements of the world

it is impossible to create a perfect representation of the world in a GIS database

therefore all GIS data are subject to uncertainty
uncertainty regarding what the data tell us about the real world
a range of possible truths
that uncertainty will affect the results of analysis

all GIS results should have confidence limits, "plus or minus"

it can be very difficult to determine what those confidence limits are

because GIS results come from a computer, we tend to treat them as more accurate than they really are, tend to ignore uncertainty

uncertainty can arise because:

measurements were not perfectly accurate

maps were distorted to make them more readable

e.g. lines are often repositioned

e.g. 101 and the railroad through Goleta at a scale of 1:250,000

at this scale both objects are thinner than their map symbols

the symbols would overlap if they weren't moved by the cartographer

definitions are vague, ambiguous, subjective

the real landscape has changed since the data were collected

the map is generalized

here is an example of positional errors in two commercial street centerline databases of Goleta
the background fill is darkest where errors are smallest

note how the errors are often up to 100m

a problem if someone reports the location of a fire using one map, and a response is dispatched using the other map

the response vehicle could be sent to the wrong street
note also how many streets are not in both databases

notice how errors persist over large areas

the error at one point is not independent of error at neighboring points

this is a general characteristic of error in GIS databases



B. ERROR PROPAGATION

1. Area map measurement example

an example of how uncertainty in a database can propagate into uncertainty about GIS products

a common GIS product is the measurement of polygon area

this example comes from the city of Melbourne, Australia

a layer is to be built of land ownership

showing each parcel of land

to be derived by digitizing a map at a known scale

this GIS layer will be used to produce maps of land ownership
on these maps the area of each parcel is to be shown

this can be computed from the polygon vertices using the trapezium algorithm

but what is the "plus or minus" on area?

it depends on the scale of the input map and the accuracy of digitizing

what accuracy can we get when a map is digitized?
as a general rule of thumb, errors from digitizing, map drafting, registration, stretching of the paper, amount to 0.5mm at the scale of the map

the table shows what this means for maps of given scales

third column shows the same expressed as area rather than length

e.g. a map at a scale of 1:3,000 will produce a GIS layer with a positional accuracy on the ground of 1.5m

the average suburban parcel in Melbourne is about 1,000 sq m

what did the map users want in the way of precision on parcel area?

two decimal places (hundredths of square meters)

what data accuracy would be needed to get this accuracy in the calculation of area?

it's a simple calculation, table shows the results

for a parcel of 1,000 sq m, it takes a map at 1:300 (15cm positional accuracy) to produce areas that are accurate to 1%
that is, the tens digit is accurate, but the units digit and all decimal places are spurious
a map at 1:3,000 (the actual maps) would result in areas accurate to 10%
only the hundreds digit is reliable
to get the required accuracy, the map would have to be at a scale of 3:1
three times larger than the city

Lewis Carroll and Umberto Eco have both written about the fantasy of maps at 1:1 and larger



C. ERROR IN DEMS

1. USGS quality description

a DEM provides measurements of the elevation of the land surface at each grid point

errors are due to:

measurement of the wrong elevation at the grid point

measurement of the right elevation at the wrong location

any combination of these

it is impossible to determine which case applies

the USGS provides simple quality statements for its DEMs
given as "root mean square error"

this is the square root of the average squared difference between recorded elevation and the truth

roughly interpreted as the average difference

e.g. many DEMs have RMSE of 7m

an error of 7m is common

errors of 10m, even 20m occur sometimes

RMSE can be interpreted as the standard deviation of the error distribution
diagram shows what this means in terms of relative frequencies of errors

small errors are commonest

32% of errors will be more than 7m

5% of errors will be more than 14m

2. Effects on contour maps
contour maps do not normally attempt to show uncertainty
width of any contour is determined only by cartographic emphasis, pen size

a contour map of central Pennsylvania near State College

generated from a USGS DEM with 30m spacing

the 350m contour has been widened for emphasis

suppose the RMSE is 7m, what does this mean in terms of the contour position?
the colored area shows where the contour might actually be

there is still a 5% chance the contour lies outside the colored area

reddish areas are recorded as greater than 350m but might actually be 350m

greenish areas are recorded as less than 350m but might actually be 350m

what does this error mean in terms of slope estimates?
slopes are calculated by comparing neighboring elevations

suppose two adjacent points are both at 350m

one point could actually be at 360m

the neighbor could actually be at 340m

instead of a steep slope (20m change over a 30m spacing) we would get what appears to be flat land

this assumes that neighboring errors are independent

if they were, the DEM would be virtually useless for many purposes

in fact, errors at adjacent points tend to be similar

both points are erroneously high, or erroneously low
3. Sources of error
some clues about the nature of errors can be got from looking at the data carefully
a detailed contour map of part of the data

notice how there are vertical and horizontal ridges in parallel lines

these are created by the DEM production process, which compares blocks on air photos and tends to concentrate errors at block edges



D. ERRORS IN AREA CLASS MAPS

1. Nature of errors

area class maps show a class at every point
examples are vegetation cover maps, soil maps, land use maps
they imply that class is uniform within areas, changes abruptly between areas
in fact both assumptions are wrong

there is variation within areas (heterogeneity)

due to inclusions of other classes of unknown size and frequency
there is blurring across boundaries
zones of transition
area class maps have been described as "maps showing areas that have little in common, surrounded by lines that do not exist"
example of an area class map
a map of soils in part of Northern Ohio (the Medina Quad)

originally digitized by Peter Fisher, University of Leicester

original map scale 1:15,840

4 inches to the mile
focus on the area shown in yellow
let's assume the legend says this class is "80% sand, with 20% inclusions of clay"

this map is used for many purposes

some involve land use regulation

some involve taxation, compensation

in principle, all of these are uncertain if the map is uncertain

GIS applications are in deep trouble in court if it can be shown that regulations, taxes were based on uncertainty and that no effort was made to deal with that uncertainty

2. Simulation model
one way to deal with uncertainty is to simulate the effects of the unknown variation

this map is made by a random process, constrained so that inclusions of clay (red) are small, and randomly located, and amount to 20% of the area

here is another map with the same constraints but different random locations and sizes
a parameter of the model (rho, shown in the top left) determines average inclusion size
in the first simulations it was set to .200

here and here it is increased to .240

notice how the inclusions get larger, but still occupy about 20% of the area
here and here it is increased to .250, the theoretical limit
the inclusions are still about 20% of the area
3. Impacts on GIS products
in practice we don't know rho
no one has ever tried to measure it for these kinds of data

but it is essential to know it if we want to determine the uncertainty of certain GIS products

e.g. uncertainty in the area of soil that is clay in a particular farmer's field

this table shows the impacts of rho on uncertainty in area
the left hand column shows rho

the top line (rho=0) corresponds to complete mixing, inclusion sizes close to zero

the bottom line (rho=.250) corresponds to the situation where the area is either all clay (probability 0.2) or all sand (probability 0.8)

this might happen in a crop example if we knew that the entire area was planted to one crop, but were uncertain which crop it was based on remote sensing

notice how the uncertainty in area estimates (last column) rises sharply with rho

it is much more of a problem with large rho, that is, large inclusions



E. GENERAL STRATEGIES

1. Simulation strategy

these models are complex
it's not likely that the average GIS user would be able to understand them

describing uncertainty as "a spatially autoregressive model with parameter rho" doesn't help many GIS users

how to get the message across?

there are many models out there
much recent research has focused on modeling uncertainty

the average user can't be expected to understand them all

the producer of data is the person best able to describe uncertainty
uncertainty must be communicated through data quality statements

e.g. RMSE is 7m

various standards exist for describing data quality
the Spatial Data Transfer Standard (Federal Information Processing Standard 173) has five elements of quality
positional accuracy

attribute accuracy

logical consistency

do the data follow all of the expected logical rules?

e.g. do polygons close?

many problems of logical consistency can be corrected automatically

completeness
are all features represented?
lineage
how the data were created, by what process
standards like this don't help the user who wants to know only what impact uncertainty will have on the results of analysis
e.g. knowing which model of digitizer was used is less helpful than knowing the accuracy that it produces
a general strategy for communicating about uncertainty
proposition: a method of simulation of uncertainty is a complete description

the method is defined by the data producer

it produces simulations, each of which is an equally likely and possible true map

variation among simulations represents uncertainty
the user examines the effects of different simulations on the result of analysis

the diagram compares a normal analysis done with a single data set with an analysis done repeatedly with the actual data plus a series of simulations

three strategies
ignore the issue completely

describe uncertainty with measures, e.g. RMSE

simulate equally probable versions of the data

2. Applets
an applet is a small piece of code, written in Java or a similar language, and distributed with the data

the picture shows a mockup of a user interface for examining possible data sets from a library or data archive

a DEM

the bounding box is shown

the sampling interval is shown, and the name of the area

the lower right shows a button

when clicked, the button will initiate a simulation process
an example
the example shows a simulation of uncertainty in the survey of a square parcel of land

each corner point is subject to an independent error in both coordinates

a RMSE of 2m
the simulation tracks the average area, standard deviation, and other statistics

it also shows a histogram of areas

execute the simulation (this will initiate a piece of Java code on your machine)

how well does this approach do at communicating understanding of uncertainty?