Features:
A list of things you need to know how to do:
All Functions and Datasets | Add to Existing Plot | ANOVA Models | Bootstrap Methods |
Input/Output-Files | |||
Missing Values | |||
Starting a new Splus project:
[48]ehlehl% cd Splus5project
[49]ehlehl% Splus5 CHAPTER
Creating data directory for chapter .
Splus5 chapter Splus5project initialized.
[50]ehlehl% Splus5
S-PLUS : Copyright (c) 1988, 1999 MathSoft, Inc.
S : Copyright Lucent Technologies, Inc.
Version 5.1 Release 1 for Sun SPARC, SunOS 5.5 : 1999
Working data will be in .Data
Getting data into Splus:
> # Making a vector of numbers > xshort <- c(0,1,2,3,4) # c is a function which can take an # arbitrary number of arguments and # strings them out one after another > # Printing out an object -- type the name > xshort [1] 0 1 2 3 4 > # Make a big object > times <- seq(from=0, to=10,length=1001) > # Print part of times > times[1:10] [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 > # Generate normal data noise <- rnorm(1001) > # Read data from a file > gas <- read.table("gas.data",header=TRUE) > gas[1:3,] Cost Distance UnitPrice 1 9.80 390.4 31.2 2 10.41 413.6 31.4 3 10.50 415.2 31.5 > # gas is now a data.frame > summary(gas) Cost Distance UnitPrice Min.: 6.33 Min.:198.4 Min.:31.20 1st Qu.:11.21 1st Qu.:423.4 1st Qu.:36.10 Median:13.12 Median:443.1 Median:40.10 Mean:12.71 Mean:443.0 Mean:39.15 3rd Qu.:14.09 3rd Qu.:465.6 3rd Qu.:43.30 Max.:15.67 Max.:534.5 Max.:47.10 > # Now what objects have we created? > objects() [1] ".Last.value" "gas" "last.dump" "times" [5] "xshort"
Manipulating data
> # Most functions operate on vectors componentwise > y <- sin(2*pi*times) > # add columns to the data.frame > # Compute litres of gas used in tankful > gas$GasUsed <- gas$Cost/(gas$UnitPrice/100) > # Compute litres per hundred kilometres > gas$Consumption <- gas$GasUsed/(gas$Distance/100) > var(gas[,c(2,4,5)]) Distance GasUsed Consumption Distance 1886.49916 90.893806 -10.2972991 GasUsed 90.89381 8.195885 0.3428260 Consumption -10.29730 0.342826 0.2426734 > sqrt(diag(var(gas[,c(2,4,5)]))) [1] 43.4338480 2.8628456 0.4926189 > apply(gas,2,mean) Cost Distance UnitPrice GasUsed Consumption 12.7143 443.0374 39.15234 32.44423 7.346162 > # some examples of vector behaviour > c(1,2,3)*c(1,2,3) [1] 1 4 9 > # some matrix manipulation > c(1,-1,0)%*%var(gas[,c(2,4,5)]) %*% c(1,-1,0) [,1] [1,] 1712.907 > solve(var(gas[,c(2,4,5)]))%*%var(gas[,c(2,4,5)]) Distance GasUsed Consumption Distance 1 -1.332268e-15 2.220446e-16 GasUsed 0 1.000000e+00 -3.552714e-15 Consumption 0 -8.526513e-14 1.000000e+00
Making simple plots
> # Plot a graph of the sin(2*pi*t) > # Open up a graphics window > motif() > plot(times,y,xlab="Time",ylab="Amplitude", main="The Sine Function",type='l') > # Saving a hard copy > postscript(file='sine.ps',horizontal=F) > plot(times,y,xlab="Time",ylab="Amplitude" ,main="The Sine Function",type='l') > dev.off() > postscript("gaspairs.ps",horizontal=F) > pairs(gas) > dev.off()
Other important plotting functions:
Regression: Splus has a built-in modelling language. Many different modelling functions use the same language.
> Dreg <- lm(Distance ~ GasUsed, data=gas) > Dreg Call: lm(formula = Distance ~ GasUsed, data = gas) Coefficients: (Intercept) GasUsed 83.22513 11.09018 Degrees of freedom: 107 total; 105 residual Residual standard error: 29.77981 > summary(Dreg) Call: lm(formula = Distance ~ GasUsed, data = gas) Residuals: Min 1Q Median 3Q Max -74.02 -16.77 -2.967 17.76 100.8 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 83.2251 32.9062 2.5292 0.0129 GasUsed 11.0902 1.0103 10.9766 0.0000 Residual standard error: 29.78 on 105 degrees of freedom Multiple R-Squared: 0.5343 F-statistic: 120.5 on 1 and 105 degrees of freedom, the p-value is 0 Correlation of Coefficients: (Intercept) GasUsed -0.9962 > # what is Dreg? > # List: with components > names(Dreg) [1] "coefficients" "residuals" "fitted.values" "effects" [5] "R" "rank" "assign" "df.residual" [9] "contrasts" "terms" "call" > names(Dreg$call) [1] "" "formula" "data" > Dreg$call lm(formula = Distance ~ GasUsed, data = gas) > Dreg$call$formula Distance ~ GasUsedCategorical independent variables
> # Some data sets are built in to S > # Different objects are kept in different places. > objects(4) > summary(fuel.frame) ... [124] "freeny.x" "freeny.y" "fuel.frame" [127] "fuel.frame.orig" "galaxy" "gas" ... > summary(fuel.frame) Weight Disp. Mileage Fuel Type Min.:1845 Min.: 73.0 Min.:18.00 Min.:2.703 Compact:15 1st Qu.:2571 1st Qu.:113.8 1st Qu.:21.00 1st Qu.:3.704 Large: 3 Median:2885 Median:144.5 Median:23.00 Median:4.348 Medium:13 Mean:2901 Mean:152.1 Mean:24.58 Mean:4.210 Small:13 3rd Qu.:3231 3rd Qu.:180.0 3rd Qu.:27.00 3rd Qu.:4.762 Sporty: 9 Max.:3855 Max.:305.0 Max.:37.00 Max.:5.556 Van: 7 i> Freg <- lm(Fuel ~ Disp. + Weight +Type, data=fuel.frame) > summary(Freg) Call: lm(formula = Fuel ~ Disp. + Weight + Type, data = fuel.frame) Residuals: Min 1Q Median 3Q Max -0.6973 -0.2444 -0.01367 0.2 0.6363 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 2.9606 0.6005 4.9306 0.0000 Disp. 0.0076 0.0017 4.3616 0.0001 Weight 0.0000 0.0003 0.1611 0.8726 Type1 -0.1453 0.1293 -1.1239 0.2662 Type2 0.0981 0.0470 2.0850 0.0420 Type3 -0.1241 0.0505 -2.4556 0.0174 Type4 -0.0436 0.0252 -1.7314 0.0893 Type5 0.1915 0.0337 5.6748 0.0000 Residual standard error: 0.3139 on 52 degrees of freedom Multiple R-Squared: 0.8486 F-statistic: 41.65 on 7 and 52 degrees of freedom, the p-value is 0
Correlation of Coefficients: (Intercept) Disp. Weight Type1 Type2 Type3 Type4 Disp. 0.4864 Weight -0.9415 -0.7477 Type1 0.3786 -0.2960 -0.1559 Type2 0.0888 0.3470 -0.2166 -0.4995 Type3 -0.7591 -0.0589 0.5929 -0.6197 0.1746 Type4 -0.3099 -0.1641 0.2937 -0.2997 0.1304 0.3021 Type5 0.7121 0.6013 -0.7688 -0.0053 0.2625 -0.3933 -0.2013 > coef(Freg) (Intercept) Disp. Weight Type1 Type2 Type3 2.960617 0.007594073 4.16396e-05 -0.1452804 0.09808411 -0.1240942 Type4 Type5 -0.04358047 0.1915062 > anova(Freg) Analysis of Variance Table Response: Fuel Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Disp. 1 17.25253 17.25253 175.0661 0.000000e+00 Weight 1 7.93103 7.93103 80.4783 0.000000e+00 Type 5 3.54878 0.70976 7.2021 3.465263e-05 Residuals 52 5.12453 0.09855 > # Diagnostic plots: > postscript("FuelDiag.ps",horizontal=F) > # Lay the plots out in a 3 by 2 grid > par(mfrow=c(3,2)) > plot(Freg) > dev.off() > # Examine the postscript file without leaving S > !ghostview FuelDiag.ps&Writing SPlus functions: Here I will make a contour plot for a density which is a 50/50 mixture of two bivariate normal densities with different correlations.
> x <- seq(-3, 3, length = 200) > y <- x > g <- function(x, y, rho1, rho2 = 0) { # g computes the density of a 50/50 mixture of two normal densities # at a grid of x,y values # # Use outer to make a matrix whose ijth entry is x[i]*y[j] q1 <- outer(x, y, "*") y1 <- rep(1, length(y)) x1 <- rep(1, length(x)) qx <- outer(x^2, x1, "*") qy <- outer(y1, y^2, "*") den1 <- exp( - (qx - 2 * rho1 * q1 + qy)/ (2 * sqrt(1 - rho1^2)))/(2 * pi * sqrt(1 - rho1^2)) den2 <- exp( - (qx - 2 * rho2 * q1 + qy)/ (2 * sqrt(1 - rho2^2)))/(2 * pi * sqrt(1 - rho2^2)) (den1 + den2)/2 } > postscript("partb.ps",height=6.5,width=6.5,horizontal=F) > zb <- g(x, y, 0.8, -0.8) > contour(x, y, zb, nlevels = 20, labex = 0) > dev.off()Simulations: I will simulate the sample mean when sampling from the Cauchy distribution.
> # generate 1000 samples of size 25 and put them in 1000 by 25 matrix > xc <- matrix(rcauchy(10000*25), nrow=10000) > # compute sample means the really slow way > xcm0 <- mean(xc[1,1]) > for ( i in 2:10000) xcm0 <- c(xcm0,mean(xc[i,])) > # compute sample means the slow way: > xcm <- apply(xc,1,mean) > # faster using vector arithmetic > one <- rep(1,25) > xcm1 <- xc %*% one/25 > # Plot a kernel smoothed density estimate for xbar > plot(ksmooth(xcm1,range.x=c(-6,6)),type='l', xlab="x",ylab="Density") > # Compare estimated density of xbar to > # Original Cauchy density > xs <- seq(-6,6,length=1000) > lines(xs,dcauchy(xs),lty=2)Pitfalls: When you write a function that calls a function and so on you can get caught in a trap: when functions look for variables you need to be aware of where they look.
> # in a new directory > a <- 3 > g <- function(x) {x^a} > g(2) [1] 8 > # now remove all those functions and stick the whole thing > # inside a function > remove(ls()) >ls() character(0) > w <- function(){ a <- 3 g <- function(x) { x^a } g(2) } > w() Problem in g(2): Object "a" not found Use traceback() to see the call stackThe problem is in where S looks for things it needs inside g. Look up assign in Splus for help!
Looping: is slow so use the vector calculations - make the argument a vector. Use outer. Use matrix arithmetic.
Help: Function help or args when you know the name of the function but not how to use it. Use help.start() to get a netscape based on line help browser. Go to http://lib.stat.cmu.edu/ and look at S-news, the S-Archive and other things.