GRASS GIS - R: Learning R statistical data language

Short Introduction to Geostatistical and Spatial Data Analysis with GRASS and R statistical data language

[Table of contents]
DRAFT Document!

Distribution tests

Incomplete paragraph!
To check values stored in a vector x for normal distribution, Quantile-Quantile (Q-Q) plots can be used. In our example x should be normal distributed as it was generated by rnorm. First we display the distribution of x in a Q-Q plot:

qqnorm(dataset$x)

As the graph is close to the 45° line, is shows normal distribution. As we have generated our variables by rnorm() this was expected. Plot a line which passes through the first and third quartiles:

qqline(dataset$x)

Now we throw away our normally distributed dataframe and use "build-in" demo data for further analysis:

rm(dataset, dataset.new)

5. Working with elevation data (DEM)

[If starting a new session from here, load the MASS library (see above).]

There is a nice volcano elevation model (from the description: "Maunga Whau (Mt Eden) is one of about 50 volcanos in the Auckland volcanic field. This data set gives topographic information for Maunga Whau on a 10m by 10m grid"). We just load it:

data(volcano)

To find out about the structure of this dataset (should be a grid, means a matrix):

str(volcano)

"Filled contours" looks nice with terrain color table:

filled.contour(volcano, col=terrain.colors(30))

Note: You can only display surfaces, if you actually have a surface. The "volcano" dataset is such a data matrix, but the "topo" dataset consists of irregularly distributed elevation points (which need to be interpolated).

A method to analyse the DEM quality is to use the ECDF (empirical cumulative distribution function):

library(stats)
plot(ecdf(volcano))
plot(ecdf(volcano), verticals=T, do.points=F)

The ecdf function is enveloped by the plot command. The ECDF is also known as hypsometric integral.

We can see that steps in the curve are visible. They indicate interpolation artefacts.

Another function to analyse DEM quality is the density function (smoothed histogram). The function "density" computes kernel density estimates with the given kernel (default: gaussian) and bandwidth. If the DEM was interpolated from contour lines, we can probably see a local over-representation of the contours comparing to the interpolated values (depending on interpolation method). See Jacoby, W.G. (1997) "Statistical graphics for univariate and bivariate data" [p. 23 ff., SAGE publisher] for details:

plot(density(volcano, bw=2))
lines(density(volcano, bw=4), col="green")
lines(density(volcano, bw=8), col="red")
lines(density(volcano, bw=12), col="cyan")

Let's add a legend:

legend(locator(), legend=c("bw=2", "bw=4", "bw=8", "bw=12"),
col=c("black", "green", "red", "cyan"), lty=1)

You will have to use the mouse to position the legend (click button 2).

What is the purpose of density plots? Here the elevation is plotted at different bandwidths (used in kernel density estimation). The black line represents 2m bandwidth, green is 4m bandwidth, black 8m bandwidth for the gaussian kernel. It can be seen that from 4m bandwidth onwards the interpolation artifacts are less intrusive. In conclusion this would require a smoothing of the elevation model using a 8x8 matrix (for example in a GIS or directly in R). There are no general rules to find the optinal bandwidth (according to Jacoby 1997).

--
Sidenote:
#incomplete
Now we try some smoothing of the volcano DEM. To smooth the elevation values, we need to convert the data structure from the elevation matrix into a data vector:

zvector <- as.vector(volcano)
str(zvector)

The smooth algorithm expects the value number, which we can generate using the sequence function (from 1 to max value number with stepsize 1):

valuenumber <- seq(1,5307,1)

To smooth the elevation values we will use the kde2d function of the GRASS/R interface (see related text below).

6. Interpolation methods/Spatial statistics

[Note: You can start a new session from here]

Interpolations are useful especially for irregularly distributed point data. R provides a sample elevation dataset. Loading the "topo" dataset of irregular elevations:

library(MASS)
data(topo)

Before interpolating, we convert the x, y and z values to SI units (see ?topo for original imperial units):

topo.meter <- data.frame(topo$x*0.3048*50, topo$y*0.3048*50, topo$z*0.3048)
names(topo.meter) <- c("x", "y", "z")
str(topo.meter)
rm(topo)

The "names" function is used to rename the variables within the dataframe (for convenience). Please keep in mind that some R functions expect the variables being named as "x", "y", "z" etc. The "c()" function is composing a character vector, which is assigned to the names within the dataframe.

Plot the elevation points, add text labels (pos=1 puts the lables south of the data points, round to get rid of the decimals):

plot(topo.meter$x, topo.meter$y)
text(topo.meter$x, topo.meter$y, round(topo.meter$z), pos=1)

Instead of plotting with the "text" function you can also identify data points by mouse:

plot(topo.meter$x, topo.meter$y)
identify(topo.meter$x, topo.meter$y, labels=topo.meter$z)

Next we want to interpolate a surface. Different interpolation methods are provided by R. Additionally the GRASS interpolation methods could be used (s.surf.rst etc. through "system()" function in R).

6 a) Constant and Linear Interpolation (without error prediction)

[base package]

approxfun()

This function is limited to bivariate data, cannot be used for spatial interpolation.

6 b) Trend surface fitting by least squares (without error prediction)

[spatial package]
The first step is to build the model (second order polynom):

library(spatial)
topo.meter.ls2 <- surf.ls(2, topo.meter)

A model description we can get by str():

str(topo.meter.ls2)

Then the surface will be evaluated from the trend model. The trmat function needs to know the boundaries of the region. We can get these values by calculating the min() and max() from x and y coordinates. The last parameter is the number of cells in each direction. Here we want 2m resolution (n=100/2=50):

summary(topo.meter)
topo.meter.surface2 <- trmat(topo.meter.ls2, 0, 100, 0, 100, 50)

The surface can be visualized by image() or contour() functions:

image(topo.meter.surface2)
contour(topo.meter.surface2)
contour(topo.meter.surface2, labcex = 0.8)

If you can't read the contour labels (like me), use the labcex parameter. You can even overlay contour and image:

image(topo.meter.surface2)
contour(topo.meter.surface2, labcex = 0.8, add=T)

Probably a second order polynom is not sufficient to describe this topo surface. We can try higher polynomial degrees (here: 6th order):

topo.meter.ls6 <- surf.ls(6, topo.meter) 

str(topo.meter.ls6)
topo.meter.surface6 <- trmat(topo.meter.ls6, 0, 100, 0, 100, 50)
image(topo.meter.surface6)  
contour(topo.meter.surface6, labcex = 0.8, add=T)
points(topo.meter$x, topo.meter$y)

A filled contour plot is also nice (as it provides also a legend):

filled.contour(topo.meter.surface6, col=terrain.colors(20))

Cleaning up the memory:

ls()
rm(topo.meter.surface2, topo.meter.surface6)

6 c) Local polynomial regression fitting with error prediction

[modreg package]
First we build a model with local regression. Loess fits a quadratic surface (degree=2) by weight least squares. Degree is the degree of the polynomials to be used, up to 2. span controls the degree of smoothing. Set normalize to false for spatial coordinate predictors:

library(modreg)
topo.meter.loess <- loess(z ~ x*y, topo.meter, degree=2, span=0.25, normalize=F)
str(topo.meter.loess)

For error prediction, we have to generate a grid. The predict() function calculates the surface and error surface from the loess model:

topo.meter.points <- list(x=seq(0, 100, 1), y=seq(0, 100, 1))
topo.meter.surf <- predict(topo.meter.loess, expand.grid(topo.meter.points), se=T)

The object topo.meter.surf$se.fit contains the distributed error of local regression. Both the resulting surface and the error surface can be displayed side-by-side. par with mfrow is splitting the screen (here 1 by 2):

par(mfrow=c(1,2))
xlen <- length(topo.meter.points$x)
ylen <- length(topo.meter.points$y)
contour(topo.meter.points$x, topo.meter.points$y, matrix(topo.meter.surf$fit,
        xlen, ylen), xlab="interpolated surface", labcex=0.8)
contour(topo.meter.points$x, topo.meter.points$y, 
        matrix(topo.meter.surf$se.fit, xlen, ylen),
        xlab="standard error", labcex=0.8)

Note: The topo.meter.surf$fit needs to be converted into a matrix using the matrix() function.

You can try loess with degree=1. For comparison purpose we draw again the filled contours surface (to compare with above trend surface). Because filled.contour wants the x,y coordinates and the z values in a list, we have to convert the data types (converting z into a matrix on-the-fly):

topo.meter.surfaces <- list(topo.meter.points$x,topo.meter.points$y,
                               matrix(topo.meter.surf$fit, xlen, ylen))
names(topo.meter.surfaces) <- c("x", "y", "z")
filled.contour(topo.meter.surfaces, col=terrain.colors(20))

Note the differences to above trend surface (which ignored local effects). Note additionally the sligthly smaller loess surface.

Cleanup:

rm(topo.meter.surfaces, topo.meter.points, topo.meter.surf,
topo.meter.loess)

6 d) Trend surface fitting by generalized least squares

# incomplete
GLS can be used for kriging interpolation (see below). This method is somewhat similar to above least squares. First reset the graphical output:

library(spatial)
par(mfrow=c(1,1))

Build the Least squares model (for comparison purpose):

topo.meter.ls2 <- surf.ls(2,topo.meter)

The correlogram() function divides the range of data into nint bins (here 10), and computes the covariance for pairs with separation in each bin, then divides by the variance:

correlogram(topo.meter.ls2, 10)
....

Generalized least squares:

topo.meter.gls6 <- surf.gls(6, expcov, topo.meter, d=0.7)
str(topo.meter.gls6)

The model used here is "expcov".

Evaluation of the surface from surface model:

topo.meter.gsurface6 <- trmat(topo.meter.gls6, 0, 100, 0, 100, 50)

Finally look at resulting surface:

image(topo.meter.gsurface6)
contour(topo.meter.gsurface6, labcex = 0.8, add=T)
points(topo.meter$x, topo.meter$y)
filled.contour(topo.meter.gsurface6, col=terrain.colors(20))

6 e) Kriging with automated variogram model estimation

#incomplete/draft!:

Example: Maas contaminated soils sample dataset

[You can start a new session from here, you need GRASS 5.0 to be installed]

In this example the "maas" dataset is used. "maas" represents zinc measurements as groundwater quality variable along the Dutch bank of the River Maas just north of Maastricht (source: "gstat" software written by E.J. Pebesma, http://www.frw.uva.nl/~pebesma/gstat/). Burrough 1998 (pp.102 ff) uses a subset of the same dataset (reduced area). The examples will only work if R is started from inside GRASS. The data have been reprojected to UTM zone 32, using the WGS84 ellipsoid. An empty GRASS location has to be defined with following parameters:
Projection UTM, ellipsoid WGS84, zone 32, north 5652930, south 5650610, west 269870, east 272460, nsres 10, ewres 10, rows 232, cols 259. The pre-defined GRASS location can also be downloaded.
To begin, start GRASS with the "maas" location and R within GRASS.

First we load the library of geostatistical functions:

library(sgeostat)

Then we need the Maas dataset which is included in GRASS/R interface package:

library(GRASS)
G<-gmeta()

Now we load the maas dataset, a dataframe of coordinate pairs and the related zinc concentrations:

data(maas)
str(maas)

To plot the individually measures zinc concentrations, use plot() and text() for labels:

plot (maas$x, maas$y); text(maas$x, maas$y, maas$zinc)

Optionally the area border can be plotted:

data(maas.bank)
lines(maas.bank$x, maas.bank$y, col="blue")

The columns within the dataframe must be renamed to match the naming conventions of some functions used later on:

names(maas) <- c("x", "y", "z")
str(maas)

Before interpolating, the distribution type of the dataset has to be checked. Compare the result of:

plot(density(maas$z))

and

plot(density(log(maas$z)))

The original dataset appears to show a skewed distribution. Therefore a log transformation is required (typical for geospatial data) to normalize the dataset. Normal distribution is required for kriging. The log-transformed density plot shows a bimodal distribution.
#Note: plot a gauss distribution additionally for comparison!
With:

par(mfrow=c(1,2))

you can split the output screen into two parts and display both distributions at the same time (we reset directly after plotting):

plot(density(maas$z))
plot(density(log(maas$z)))
par(mfrow=c(1,1))

Maas zinc concentration distributions

Note: As we still have no surface (only zinc concentration points), we cannot display contour lines or the surface yet. We can also try a different approach to normalize the dataset, the Power transformation for normalization:

par(mfrow=c(2,2))
plot(density(maas$z^-0.2))
plot(density(maas$z^-0.4))
plot(density(maas$z^-0.6))
plot(density(maas$z^-0.8))
par(mfrow=c(1,1))

Probably the zinc data have a global trend. A quadratic trend analysis (Least Squares) will tell us more:

library(spatial)
maas.ls2 <- surf.ls(2, maas)

Now we have a trend model. Next step is to evaluate the trend surface. The trmat function needs to know the boundaries of the region. We can get these values by using the min() and max() for x and y. The trmat() functions also needs a given resolution for the final grid. In this example we want 20m resolution. To know the extent, we can calculate the range of coordinates:

max(maas$x)-min(maas$x)
max(maas$y)-min(maas$y)

As the region is not square, we cannot define the resolution properly (unfortunately trmat() function accepts one parameter for resolution only). The last parameter of trmat is the number of cells in each direction which we get by dividing the map extend by the desired resolution (rounding up the result):

(max(maas$y)-min(maas$y))/20
maas.trend2 <- trmat(maas.ls2, min(maas$x), max(maas$x), 
                  min(maas$y), max(maas$y), 200)

Plotting the quadratic trend surface:

image(maas.trend2)
points(maas)
text(maas$x, maas$y, maas$z, pos=1)

We will also try to calculate a cubic trend surface:

maas.ls3 <- surf.ls(3, maas)
maas.trend3 <- trmat(maas.ls3, min(maas$x), max(maas$x),
                        min(maas$y), max(maas$y), 200)
image(maas.trend3)
points(maas)
text(maas$x, maas$y, maas$z, pos=1)

The results show that no trend has to be eliminated (note: we have analysed the original dataset, not the log-transformed!). Now we try to generate a variogram with logtransformed data (you may also try with the power transformed data). First we generate a new dataframe, then generate the variogram:

logmaas <- data.frame(x=maas$x, y=maas$y, lz=log(maas$z))
str(logmaas)

Before modeling the variogram, the dataset has to be slightly modified. First we create an object of class point from a data frame, representing the observed data of a spatial process.

logmaas.point <- point(logmaas)

Then we create a pair object from a point object. A pair object contains information defining pairs of points contained in a point object. A pair object is a list containing five vectors: from, to, lags, dist, and bins. The maxdist depends on the area's size (determined from range, see above).
# mhhh, how to determine maxdist and num.lag correctly?

logmaas.pair <- pair(logmaas.point, num.lags=10, maxdist=2000)
print.pair(logmaas.pair)

To fit a variogram model, the routine will need initial values. These values you have to read from the estimated variogram. First we estimate empirical variogram parameters:

logmaas.estvar <- est.variogram(logmaas.point, logmaas.pair, "lz")

The lags, bins and fitting parameters can be printed directly:

logmaas.estvar

Now we plot the variogram estimates and read initial parameters from it. These parameters will be given to the fit.variogram() function later:

plot(logmaas.estvar)
text(logmaas.estvar$bins, logmaas.estvar$classic/2,
     labels=logmaas.estvar$n, pos=1)

Above image shows the estimated semivariogram. For kriging, we need to fit a variogram model. The function used to generate a variogram model needs some initial values to start up which we have to read from above image. The values are like this:

The general usage is "fit.variogram(model="exponential", estvar, nugget, sill, range, ...)".
The first model to try is the gaussian (get the initial values from the variogram estimator image). The nugget and sill can be read directly from the image, the ag value needs to be calculated (ag=range/sqrt(3), the estimated range as is 1100):

ag<-1100/sqrt(3)
ag

The ag value calculated is: 635 [it might be confusing that not the real range but ag has to be used]

logmaas.varGmod <-fit.variogram("gaussian",
    logmaas.estvar, nugget=0.2, sill=0.5, range=635, plot.it=T,
    iterations=30)

Details about nugget, sill, range can be printed:

logmaas.varGmod

To fit an exponential variogram model. nugget and can be entered directly, ae has to be calculated (ag=range/3):

ag<-1100/3
ag
logmaas.varEmod <-fit.variogram("exponential", logmaas.estvar,
   nugget=0.2, sill=0.5, range=350, plot.it=T, iterations=30)

Details about nugget, sill, range are printed with:

logmaas.varEmod

Create a spatially lagged scatter plot, e.g. plot z(s) versus z(s+h), where h is a lag in a pair object:

lagplot(logmaas.point, logmaas.pair, "lz")

Note: If you want to plot the stored variogram model later, use:

plot(logmaas.estvar, logmaas.varGmod)
plot(logmaas.estvar, logmaas.varEmod)

Plot boxplots of square-root or squared differences between variable values at pairs of points versus the distance between the points:

spacebox(logmaas.point, logmaas.pair, "lz")

Plot a scatter plot of square-root or squared differences between variable values at pairs of points versus the distance between the points:

spacecloud(logmaas.point, logmaas.pair, "lz")

Now kriging to interpolate the continuous surface follows: It requires to generate a grid with points at which the prediction is carried out. Then the surface generation will be done:

surf.krig<- krige.G(logmaas.point, "lz", logmaas.varEmod, G)
str(surf.krig)
summary(surf.krig$zhat)
summary(surf.krig$sehat)

The result is stored in surf.krig$zhat and the standard error in surf.krig$sehat:

plot(G, surf.krig$zhat, col=grey(9:2/9))
title("Ordinary kriging predictions ln(Zinc)")
plot(G, surf.krig$sehat, col=grey(9:2/9))

A nice example with data(utm.maas) (UTM projected data for GRASS, also included) can be seen with:

example(krige.G)

--------
DRAFT: Trend removal from input data, the variogram estimation: lm is used to fit linear models (this subsection here is not working yet)

l1 <- lm(lz~x+y+I(x^2)+I(y^2)+I(x*y), logmaas)
log$res <- residuals(l1)
logmaas.point <- point(logmaas)
logmaas.pair <- pair(logmaas.point, num.lags=10, maxdist=2000)
logmaas.estvar <- est.variogram(logmaas.point, logmaas.pair, "res")
plot(logmaas.estvar)

--
Further notes
The "dim" command can be used to retrieve cols and rows of a matrix (like this DEM). We store this information:

topo.cols <- dim(topo)[1]
topo.rows <- dim(topo)[2]

Just entering the object's name tells us about its value(s):

topo.cols

[to be continued one day: distribution tests, regression (nlme), kriging, variograms (nlme, sgeostat), classifications (tree classifier), cluster analysis, point pattern analysis [splancs], time series analysis, neural networks, ...]