Sample Data Sets
The following sample data sets are included with SAS/INSIGHT
software.
The AIR data set contains measurements of pollutant
concentrations from a city in Germany during a week in
November 1989. Variables are
- DATETIME
- date and hour in SAS DATETIME format
- DAY
- day of the week
- HOUR
- hour of the day
- CO
- carbon monoxide concentration
- O3
- ozone concentration
- SO2
- sulfur dioxide concentration
- NO
- nitrogen oxide concentration
- DUST
- dust concentration
- WIND
- wind speed
The BASEBALL data set contains
performance measures and salary levels for regular hitters and
leading substitute hitters in major league baseball
for the year 1986 (Collier 1987).
There is one observation per hitter. Variables are
- NAME
- the player's name
- NO_ATBAT
- number of times at bat in 1986
- NO_HITS
- number of hits in 1986
- NO_HOME
- number of home runs in 1986
- NO_RUNS
- number of runs in 1986
- NO_RBI
- number of runs batted in in 1986
- NO_BB
- number of bases on balls in 1986
- YR_MAJOR
- years in the major leagues
- CR_ATBAT
- career at bats
- CR_HITS
- career hits
- CR_HOME
- career home runs
- CR_RUNS
- career runs
- CR_RBI
- career runs batted in
- CR_BB
- career bases on balls
- LEAGUE
- player's league at the end of 1986
- DIVISION
- player's division at the end of 1986
- TEAM
- player's team at the end of 1986
- POSITION
- positions played in 1986
- NO_OUTS
- number of put outs in 1986
- NO_ASSTS
- number of assists in 1986
- NO_ERROR
- number of errors in 1986
- SALARY
- salary in thousands of dollars
The POSITION variable in the BASEBALL data set is encoded
as follows:
13 | first base, third base | CS | center field, shortstop |
1B | first base | DH | designated hitter |
1O | first base, outfield | DO | designated hitter,
outfield |
23 | second base, third base | LF | left field |
2B | second base | O1 | outfield, first base |
2S | second base, shortstop | OD | outfield, designated
hitter |
32 | third base, second base | OF | outfield |
3B | third base | OS | outfield, shortstop |
3O | third base, outfield | RF | right field |
3S | third base, shortstop | S3 | shortstop, third base |
C | catcher | SS | shortstop |
CD | center field, designated hitter | UT | utility |
CF | center field | | |
The BUSINESS data set contains information on publicly-held
German, Japanese, and U.S. companies in the
automotive, chemical, electronics, and oil refining industries.
There is one observation for each company.
Variables are
- NATION
- the nationality of the company
- INDUSTRY
- the company's principal business
- EMPLOYS
- the number of employees
- SALES
- sales for 1991 in millions of dollars
- PROFITS
- profits for 1991 in millions of dollars
The DRUG data set contains results of an experiment
to evaluate drug effectiveness (Afifi and Azen 1972).
Four drugs were tested against three diseases on six subjects;
there is one observation for each test.
Variables are
- DRUG
- the drug used in treatment
- DISEASE
- the disease present
- CHANG_BP
- the change in systolic blood pressure due to treatment
The GPA data set contains data collected to determine which
applicants at a large midwestern university were likely to succeed
in its computer science program (Campbell and McCabe 1984).
There is one observation per student. Variables are
- GPA
- the grade point average of students in the computer science program
- HSM
- the average high school grade in mathematics
- HSE
- the average high school grade in English
- HSS
- the average high school grade in science
- SATM
- the score on the mathematics portion of the SAT exam
- SATV
- the score on the verbal portion of the SAT exam
- SEX
- the student's gender
The IRIS data set is Fisher's Iris data (Fisher 1936).
Sepal and petal size were measured for fifty specimens
from each of three species of iris.
There is one observation per specimen. Variables are
- SEPALLEN
- sepal length in millimeters
- SEPALWID
- sepal width in millimeters
- PETALLEN
- petal length in millimeters
- PETALWID
- petal width in millimeters
- SPECIES
- the species
The MINING data set
contains results of an experiment to determine
whether drilling time was faster for wet drilling or dry drilling
(Penner and Watts 1991).
Tests were replicated three times for each method at different test holes.
There is one observation per five-foot interval for each replication.
Variables are
- DRILTIME
- the time in minutes to drill the last five feet of the current depth
- METHOD
- the drilling method, wet or dry
- REP
- the replicate number
- DEPTH
- the depth of the hole in feet
The MININGX data set is a subset of the MINING data set.
It contains data from only one of the test holes.
The PATIENT data set contains data collected
on cancer patients (Lee 1974). There is one observation
per patient. Variables are
- REMISS
- 1 if remission occurred and 0 otherwise
- CELL
- SMEAR
- INFIL
- LI
- TEMP
- BLAST
- measures of patient characteristics
The SHIP data set contains data from an investigation
of wave damage to cargo ships (McCullagh and Nelder 1989).
The purpose of the investigation was to set standards for future
hull construction. There is one observation per ship.
Variables are
- Y
- the number of damage incidents
- YEAR
- year of construction
- TYPE
- the type of ship
- PERIOD
- the period of operation
- MONTHS
- the aggregate months of service
Choose Help:Create Samples to create the sample
data sets in your sasuser directory.
When you have created the sample data sets, turn to the
Techniques part of this manual to learn how to enter your data
and begin exploring it with SAS/INSIGHT software.
Note | If you have an existing data set in your sasuser library
with the same name as a sample data set, it will be overwritten
if you create the sample. |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.