Introduction to Categorical Data Analysis Procedures

Overview

Several procedures in SAS/STAT software can be used for the analysis of categorical data:

CATMOD: fits linear models to functions of categorical data, facilitating such analyses as regression, analysis of variance, linear modeling, log-linear modeling, logistic regression, and repeated measures analysis. Maximum likelihood estimation is used for the analysis of logits and generalized logits, and weighted least squares analysis is used for fitting models to other response functions.
CORRESP: performs simple and multiple correspondence analyses, using a contingency table, Burt table, binary table, or raw categorical data as input. For more on PROC CORRESP, see Chapter 6, "Introduction to Multivariate Procedures," and Chapter 24, "The CORRESP Procedure,".
FREQ: builds frequency tables or contingency tables and produces numerous tests and measures of association including chi-square statistics, odds ratios, correlation statistics, and Fisher's exact test for any size two-way table. In addition, it performs stratified analysis, computing Cochran-Mantel-Haenszel statistics and estimates of the common relative risk. It performs a test of binomial proportions, computes measures of agreement such as McNemar's test, kappa, and weighted kappa.
GENMOD: fits generalized linear models with maximum-likelihood methods. This family includes logistic, probit, and complementary log-log regression models for binomial data, Poisson regression models for count data, and multinomial models for ordinal response data. It performs likelihood ratio and Wald tests for type I, type III, and user-defined contrasts. It analyzes repeated measures data with generalized estimating equation (GEE) methods.
LOGISTIC: fits linear logistic regression models for binary or ordinal response data with maximum-likelihood methods. It performs stepwise regression and provides regression diagnostics. The logit link function in the logistic regression models can be replaced by the normit function or the complementary log-log function.
PROBIT: computes maximum-likelihood estimates of regression parameters and optional threshold parameters for binary or ordinal response data.

Other procedures that perform analyses for categorical data are the TRANSREG and PRINQUAL procedures. PROC PRINQUAL is summarized in Chapter 6, "Introduction to Multivariate Procedures," and PROC TRANSREG is summarized in Chapter 3, "Introduction to Regression Procedures."

A categorical variable is defined as one that can assume only a limited number of discrete values. The measurement scale for such a variable is unrestricted. It can be nominal, which means that the observed levels are not ordered. It can be ordinal, which means that the observed levels are ordered in some way. Or it can be interval, which means that the observed levels are ordered and numeric and that any interval of one unit on the scale of measurement represents the same amount, regardless of its location on the scale. One example of a categorical variable is litter size; another is the number of times a subject has been married. A variable that lies on a nominal scale is sometimes called a qualitative or classification variable. Categorical data result from observations on multiple subjects where one or more categorical variables are observed for each subject. If there is only one categorical variable, then the data are generally represented by a frequency table, which lists each observed value of the variable and its frequency of occurrence.

If there are two or more categorical variables, then a subject's profile is defined as the subject's observed values for each of the variables. Such categorical data can be represented by a frequency table that lists each observed profile and its frequency of occurrence.

If there are exactly two categorical variables, then the data are often represented by a two-dimensional contingency table, which has one row for each level of variable 1 and one column for each level of variable 2. The intersections of rows and columns, called cells, correspond to variable profiles, and each cell contains the frequency of occurrence of the corresponding profile.

If there are more than two categorical variables, then the data can be represented by a multidimensional contingency table. There are two commonly used methods for displaying such tables, and both require that the variables be divided into two sets.

In the first method, one set contains a row variable and a column variable for a two-dimensional contingency table, and the second set contains all of the other variables. The variables in the second set are used to form a set of profiles. Thus, the data are represented as a series of two-dimensional contingency tables, one for each profile. This is the data representation used by PROC FREQ. For example, if you request tables for RACE*SEX*AGE*INCOME, the FREQ procedure represents the data as a series of contingency tables: the row variable is AGE, the column variable is INCOME, and the combinations of levels of RACE and SEX form a set of profiles.

In the second method, one set contains the independent variables, and the other set contains the dependent variables. Profiles based on the independent variables are called population profiles, whereas those based on the dependent variables are called response profiles. A two-dimensional contingency table is then formed, with one row for each population profile and one column for each response profile. Since any subject can have only one population profile and one response profile, the contingency table is uniquely defined. This is the data representation used by PROC CATMOD.

Chapter Contents
Previous
Next
Top