Computational Resources

The GLM Procedure

Computational Resources

Memory

For large problems, most of the memory resources are required for holding the X'X matrix of the sums and cross products. The section "Parameterization of PROC GLM Models" describes how columns of the X matrix are allocated for various types of effects. For each level that occurs in the data for a combination of class variables in a given effect, a row and column for X'X is needed.

The following example illustrates the calculation. Suppose A has 20 levels, B has 4 levels, and C has 3 levels. Then consider the model

   proc glm;
     class A B C;
     model Y1 Y2 Y3=A B A*B C A*C B*C A*B*C X1 X2;
   run;

The X'X matrix (bordered by X'Y and Y'Y) can have as many as 425 rows and columns:

1: for the intercept term
20: for A
4: for B
80: for A*B
3: for C
60: for A*C
12: for B*C
240: for A*B*C
2: for X1 and X2 (continuous variables)
3: for Y1, Y2, and Y3 (dependent variables)

The matrix has 425 rows and columns only if all combinations of levels occur for each effect in the model. For m rows and columns, 8m² bytes are needed for cross products. In this case, 8·425² = 1,445,000 bytes, or about 1,445,000 / 1024 = 1411K.

The required memory grows as the square of the number of columns of X; most of the memory is for the A*B*C interaction. Without A*B*C, you have 185 columns and need 268K for X'X. Without either A*B*C or A*B, you need 86K. If A is recoded to have ten levels, then the full model has only 220 columns and requires 378K.

The second time that a large amount of memory is needed is when Type III, Type IV, or contrast sums of squares are being calculated. This memory requirement is a function of the number of degrees of freedom of the model being analyzed and the maximum degrees of freedom for any single source. Let Rank equal the sum of the model degrees of freedom, MaxDF be the maximum number of degrees of freedom for any single source, and N_y be the number of dependent variables in the model. Then the memory requirement in bytes is

$(8 x (\frac{{Rank } x ({Rank } + 1)}2)) & + & (N_y x {Rank } ) \ & + & ( \frac{{MaxDF} x ({MaxDF} + 1)}2 ) \ & + & (N_y x {MaxDF} )$

Unfortunately, these quantities are not available when the X'X matrix is being constructed, so PROC GLM may occasionally request additional memory even after you have increased the memory allocation available to the program.

If you have a large model that exceeds the memory capacity of your computer, these are your options:

Eliminate terms, especially high-level interactions.
Reduce the number of levels for variables with many levels.
Use the ABSORB statement for parts of the model that are large.
Use the REPEATED statement for repeated measures variables.
Use PROC ANOVA or PROC REG rather than PROC GLM, if your design allows.

CPU Time

For large problems, two operations consume a lot of CPU time: the collection of sums and cross products and the solution of the normal equations.

The time required for collecting sums and cross products is difficult to calculate because it is a complicated function of the model. For a model with m columns and n rows (observations) in X, the worst case occurs if all columns are continuous variables, involving nm²/2 multiplications and additions. If the columns are levels of a classification, then only m sums may be needed, but a significant amount of time may be spent in look-up operations. Solving the normal equations requires time for approximately m³/2 multiplications and additions.

Suppose you know that Type IV sums of squares are appropriate for the model you are analyzing (for example, if your design has no missing cells). You can specify the SS4 option in your MODEL statement, which saves CPU time by requesting the Type IV sums of squares instead of the more computationally burdensome Type III sums of squares. This proves especially useful if you have a factor in your model that has many levels and is involved in several interactions.

Chapter Contents
Previous
Next
Top