For the R/PostgreSQL connection we need two extra packages. Since 10/2003 there is a new Rdbi/PostgreSQL interface for R-stats. Download it from BioConductor (but not the old SourceForge version!):
Interface installation (enter in shell terminal, appropriate version number):
R CMD INSTALL Rdbi_1.0.2.tar.gz R CMD INSTALL RdbiPgSQL_1.0.2.tar.gz
Note: The RODBC interface in an alternative, but the data transfer is rather slow.
PostgreSQL installation
We assume that you have a working installation of PostgreSQL 7.3 or later. For GNU/Linux RPM packages are available, for Mac OSX packages from FINK.
psql template1 Welcome to psql 7.3.2, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help on internal slash commands \g or terminate with semicolon to execute query \q to quit template1=> \q #maybe only user "postgres" is available, so start like this: psql -U postgres template1 Welcome to psql 7.3.2, the PostgreSQL interactive terminal. [...] \q
Server does not run? This requires 'root' permissions:
su #start postgresql daemon: service postgresql start exit #To enable the PostgreSQL server to launch after boot: chkconfig --list postgresql chkconfig postgresql on chkconfig --list postgresql #... should be changed to 'on' in some levels now.
Server running? Yes (maybe it tells you that the user name does not exist).
We add a new user now:
#become 'root': su #become special user 'postgres' su postgres #create new PostgreSQL user (warning, we don't use a passowrd here!): createuser -a -d pablo #... CREATE USER should be printed exit exit
Now retry to connect (see above, 'psql ...').
Then we explore a bit more (note that the TAB key expands command and table names). Every query has to be followed with semicolon:
psql template1 [...] template1=> SELECT * from pg_user; [...] #show all tables (called 'relation' in SQL speech): \dt No relations found. #what time is it? SELECT CURRENT_TIMESTAMP; timestamptz ------------------------------- 2003-11-01 13:24:42.960403+00 (1 row) #get some help for all backslash commands: \? [...] #get some help for all commands: \h Available help: ABORT CREATE TABLE EXECUTE [...] #print help for 'CREATE TABLE' command: \h CREATE TABLE [...] #get out of PostgreSQL: \q
Now we want to create an own database.
This requires 'root' permissions:
su #stop postgresql daemon (never ! modify databases with PostgreSQL still running): service postgresql stop #just notes, we won't do it now: # NOTE 1: new users are added in: /var/lib/pgsql/data/pg_hba.conf # it is recommended to use "md5" encryption when defining users # NOTE 2: if network access is desired, activate "tcpip_socket = true" in # /var/lib/pgsql/data/postgresql.conf #so we start postgresql daemon again: service postgresql start #-> should print "ok" #to create a new database, we login as 'postgres' user: su postgres createdb grasscourse exit #exit from root exit #now we are normal user again. #Check available databases (the new DB 'grasscourse' should be listed): psql -l
Now we create a new table within this database:
psql grasscourse #Note: column definitions are comma separated, no comma after last column, #to make it more sophisticated, we add a constraint (Y, N) to # column 'enjoyed' to only accepts these two values for that column: CREATE TABLE cities ( id integer, name varchar(50), lastvisit date, enjoyed varchar(1) check (enjoyed in ('Y','N')) ); # -> CREATE TABLE should be printed #list all tables: \d #get column info: \d cities #Insert data into the new table (use single quotes!): #Date order: YY-mm-dd INSERT INTO cities VALUES ( 1, 'Trento', '2003-11-01', 'Y'); # -> INSERT some_numbers should be printed INSERT INTO cities VALUES ( 2, 'Milano', '2003-10-12', 'N'); INSERT INTO cities VALUES ( 3, 'Povo', '2003-11-25', 'Y'); #test for the constraint: INSERT INTO cities VALUES ( 4, 'Bolzano', '2003-10-23', 'A'); # -> should fail: 'A' not accepted #test for the date INSERT INTO cities VALUES ( 4, 'Bolzano', '2003-10-33', 'Y'); # -> should fail: day 33 not accepted #now we query the table: SELECT * FROM cities; id | name | lastvisit | enjoyed ----+--------+------------+--------- 1 | Trento | 2003-11-01 | Y 2 | Milano | 2003-10-12 | N 3 | Povo | 2003-11-25 | Y (3 rows) #now we query the table, ordered by one colum: SELECT * FROM cities order by lastvisit; id | name | lastvisit | enjoyed ----+--------+------------+--------- 2 | Milano | 2003-10-12 | N 1 | Trento | 2003-11-01 | Y 3 | Povo | 2003-11-25 | Y (3 rows) #now we query the table, but only two columns: SELECT name,lastvisit FROM cities order by lastvisit; name | lastvisit --------+------------ Milano | 2003-10-12 Trento | 2003-11-01 Povo | 2003-11-25 (3 rows) #leave the 'psql' terminal: \q
R #it will appear something like this: R : Copyright 2003, The R Development Core Team Version 1.8.0 (2003-10-08) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for a HTML browser interface to help. Type 'q()' to quit R. >
plot(rnorm(123)) q() Save workspace image? [y/n/c]: n
Since you are able to start and leave R, we can try a sample session:
R #get help in terminal window: ?help #use q key to leave the help, cursor and space to browse the text. #open help in browser: help.start() 1+1 # [1] 2 #NOTE: if you get lost while entering a command, CTRL-C is your friend to #get back to the prompt. x <- 1+1 x # [1] 2 #make a sequence of 12 numbers: seq(1,12) #assign result to a variable: x <- seq(1,12) #print variable contents: x # [1] 1 2 3 4 5 6 7 8 9 10 11 12 #what's this variable: str(x) #simple statistics of this variable: summary(x) #see command help (in browser, if still open): ?summary #make some categorical data (called "factors" in R speech): rgb <- c("red", "green", "blue") rgb rgb[2] #have some more: infra <- c("nir", "mir", "fir") thermal <- c("tir", NA, NA) #for convenience we can handle variable composites in a "data frame": landsat5 <- data.frame(rgb, infra, thermal) landsat5 str(landsat5) #Now we look at some sample maps: # Maunga Whau (Mt Eden) is one of about 50 volcanos in the Auckland # volcanic field. This data set gives topographic information for # Maunga Whau on a 10m by 10m grid. data(volcano) str(volcano) #... is a matrix of numerics plot(volcano) #...looks strange! #but a nice map: filled.contour(volcano, color = terrain.colors, asp = 1) title(main = "Maunga Whau volcano: filled contour map")
Up to now we used only base functions and data. There are many extensions, called R libraries which provide enormous functionality. One is the "GRASS" library to link R and GRASS:
#load GRASS library (GRASS/R interface): library(GRASS) #there are sample data built-in, e.g. river Maas soil pollution data: ?utm.maas #many commands in R provide an example. You can just run it: example(utm.maas) #get out of this R session: q() Save workspace image? [y/n/c]: n
Next task is to use the Rdbi interface to connect R and PostgreSQL:
R library(Rdbi) conn <- dbConnect(PgSQL(), host="localhost", dbname="grasscourse") #list available tables: dbListTables(conn)
Reading data (the table we created above) from PostgreSQL into R:
res <- dbSendQuery(conn, "select * from cities") cities <- dbGetResult(res) str(cities) `data.frame': 3 obs. of 4 variables: $ id : int 1 1 1 $ name : chr "Trento" "Milano" "Povo" $ lastvisit: chr "11-01-2003" "10-12-2003" "11-25-2003" $ enjoyed : chr "Y" "N" "Y" #disconnect: dbDisconnect(conn) #print the downloaded data frame: # date: mm-dd-YYYY cities id name lastvisit enjoyed 1 1 Trento 11-01-2003 Y 2 2 Milano 10-12-2003 N 3 3 Povo 11-25-2003 Y #make the character date a real date: tmp <- strptime(cities$lastvisit, "%m-%d-%Y") tmp [1] "2003-11-01" "2003-10-12" "2003-11-25" # YYYY-mm-dd is new date format, so identical to PostgreSQL order #update in the data frame: #cities$lastvisit <- strptime(cities$lastvisit, "%m-%d-%Y")
So far, so nice. Next step is to learn writing data from R to PostgreSQL (e.g. you results): For illustration we use some meteo data, a time series of Nottingham average air temperatures at Nottingham Castle in degrees F for 20 years (note that the MASS library is part of the VR bundle):
library(MASS) data(nottem) #let's look at the data (note: Fahrenheit, not Celsius!)): str(nottem) summary(nottem) nottem plot(nottem) #connect to PostgreSQL: library(Rdbi) conn <- dbConnect(PgSQL(), host="localhost", dbname="grasscourse") #write data to PG database as new table "nottingtemp": dbWriteTable(conn, nottem, "nottingtemp") dbDisconnect(conn)
psql grasscourse \d SELECT * from nottingtemp; \q
grass57 #reset region: g.region -dP #within GRASS, start R: R [...] library(GRASS) #load the GRASS environment into R: G <- gmeta() # see current GRASS location settings: summary.grassmeta(G) # print list of available raster maps: system("g.list rast") #NOTE: You always need to specify "G" for the next commands! # #load the 'elevation.dem' raster map into R: elev <- rast.get(G,"elevation.dem") str(elev) #look at the map: plot(G, elev$elevation.dem) plot(G, elev$elevation.dem, col=terrain.colors(20)) #draw elevation histogram: library(MASS) truehist(elev$elevation.dem, nbins=200) #Is that normally distributed? qqnorm(elev$elevation.dem) qqline(elev$elevation.dem) #-> no, deviates from qqline #delete the object from R memory: rm(elev) #Now we import two raster maps into a data frame: #cat: T (true) or F (false) enables to copy over category labels rather than #numeric values: spearfish <- rast.get(G, c("elevation.dem","vegcover"),cat=(c(F,T))) str(spearfish) summary(spearfish$elevation.dem) summary(spearfish$vegcover.f) #we can also plot the categorical 'vegcover' map, converting to numeric #on-the-fly: plot(G, unclass(spearfish$vegcover)) #get basic statistics on the distribution of vegetation related to #elevation: tapply(spearfish$elevation.dem, spearfish$vegcover, summary) #make a boxplot: boxplot(tapply(spearfish$elevation.dem, spearfish$vegcover, summary)) #vice-verse (warning: long calculation...): tapply(spearfish$vegcover, spearfish$elevation.dem, summary)
grass57 #set resolution/region to Celsius temperature map from lecture 4: g.region rast=temp_celsius_12Jul2000 -p #within GRASS, start R: R library(GRASS) G <- gmeta() #NOTE: You always need to specify "G" for the next commands! # #load the 'elevation.dem' raster map into R: elev <- rast.get(G,"elevation.dem") str(elev) #load the 'temp_celsius_12Jul2000' raster map into R: celsiusmap <- rast.get(G,"temp_celsius_12Jul2000") str(celsiusmap) #Note that R changes underscores to dots #now we extract sample points. First get 200 coordinate pairs: str(G) #East: E.random <- sample(G$xseq,200) #North: N.random <- sample(G$yseq,200) #merge to a data frame: randompoints <- data.frame(E.random,N.random) str(randompoints) #nice, but complicated. Easier to fetch the cell indexes: randompoints <- sample(G$Ncells,200) #now we query the maps at the randomly selected cell numbers: elevsamples <- elev$elevation.dem[randompoints] hist(elevsamples) celsiussamples <- celsiusmap$temp.celsius.12Jul2000[randompoints] hist(celsiussamples) #...the sample data are ready. #look at the plot: plot(celsiussamples,elevsamples) #Linear regression analysis (Linear Model): lm(elevsamples ~ celsiussamples) summary(lm(elevsamples ~ celsiussamples)) #For a related discussion, compare Dalgaard 2002, p.97ff #plot in one graph data and regression line (note parameter order!): plot(celsiussamples,elevsamples) abline(lm(elevsamples ~ celsiussamples)) #The result is: temperature decreasing with height, but the linear #correlations is poor (as expected).
#simulate 4 time series for 4 stations, 500 days: sim1 <- as.double(arima.sim(list(order=c(1,1,0),ar=0.7),n=499)) sim2 <- as.double(arima.sim(list(order=c(1,1,0),ar=0.7),n=499)) sim3 <- as.double(arima.sim(list(order=c(1,1,0),ar=0.7),n=499)) sim4 <- as.double(arima.sim(list(order=c(1,1,0),ar=0.7),n=499)) #station heights: height1 <- seq(122,122,l=500) height2 <- seq(332,332,l=500) height3 <- seq(1543,1543,l=500) height4 <- seq(534,534,l=500) #generate 500 days: ISOdate(2002, 7, 11) - ISOdate(2001, 2, 26) days <- seq(ISOdate(2001, 2, 26),ISOdate(2002, 7, 10), by="d") days #we need a special table structure: tmpsim <- print(c(sim1,sim2,sim3,sim4)) tmpdate <- print(c(days,days,days,days)) tmpheight <- print(c(height1,height2,height3,height4)) #put all in one data frame: meteo <- data.frame(tmpheight,tmpdate,tmpsim) names(meteo) <- c("height","date","var") rm(tmpheight,tmpdate,tmpsim) str(meteo) library(lattice) xyplot(var ~ date | height, data=meteo, type="l")