Chapter 9 - Multiple Regression

Last chapter, we looked at simple regression - one X variable, one independent variable.  Now we look at multiple regression, more than one independent variable.

In multiple regression, we can look at the effects of several variables simultaneously and isolate the effects of one variable while controlling for the influence of other variables.  Like a controlled experiment - we can invoke the ceteris paribus condition, using an unlimited number of control (independent) variables.  OLS does that automatically.  Allows us to say, "after controlling for the effects of X1, X2, X3, etc., we can say that X4 has a significantly positive (negative) effect on Y."

Multiple regression is as close as we can get in policy research/social sciences to the control of an experimental research design (laboratory experiment).

Example: After controlling for age, race, education, number of years continuous employment, marital status, number of children, IQ, union status, geographic differences in earnings, etc. etc., we find that women make X% less (more) than men, and the results are significant at the X% level (1, 5 or 10%).

Multiple regression also allows us the test/assess the joint significance of a group of indpt variables.
 

MULTIPLE REGRESSION:

Y = a + b1 X1 + b2 X2 +...bk Xk + e, where X1 and X2... Xk are independent variables, k = number of independent variables and Y is dependent variable.  We can theoretically have hundreds of variables: econometric model of the U.S. economy, for example, might have hundreds of independent variables.

The estimated b's (slope coefficients) are considered "partial" measures, like a partial derivative in calculus.

Interpretation of b1: shows the influence of X1 on Y, holding X2 CONSTANT.

Interpretation of b2: shows the influence of X2 on Y, holding X1 CONSTANT.

If b1 = 1.5, it means that for every one unit change in X1, Y will change by 1.5 units of Y, holding X2 constant.

Extension of juvenile delinquency example from Chapter 8, adding two additional variables:

Unit of analysis: ____________________

Y (teen drug arrests) = a + b1 UN + b2 % SINGLE PARENT FAMILIES + b3 POPULATION + e

b1 = the effect of UN on teen arrests, holding % single parents and pop constant.
b2 = the effect of % Single Parents (SP) on teen arrests, holding UN and POP constant
b3 = the effect of POP on teen arrests, holding UN and % SP constant.

Assume that OLS shows that b1 = +25.6, b2 = +6.3 and b3 = -.000006

Means that for every one percent increase in a city's un rate, the number of teen arrests will increase by 25.6, after controlling for the effects of %SP and POP.  Note that the coefficient was 26.7 in the single variable regression, now it is 25.6, meaning that after controlling for percent of single-parent families and population, the effect of teen unemployment is slightly less than before.

For every one percent increase in a city's SP%, teen arrests will increase by 6.3, controlling for POP and UN.

For every 1,000,000 increase in a city's POP, teen arrests will fall by 6 (-.000006 x 1m), controlling for UN and %SP.
 

If we accept the model, then we could use the estimated coefficients to predict the JD rate in other cities.

Example: City A has UN=10%, %SP=12%, and POP=250,000.

Estimated Teen Arrests in City A = 346 (see page 238).

We would also look at R2, the standard error of the estimate, t-tests on individual variables, and the F-statistic.  In multiple regression the interpretation of R2 is the same: Explained Variation / Total Variation, or the % of Total Variation explained by the regression.  However, we are not measuring the fit around a line, but the fit around a multidimensional plane reflecting the number of independent variables.

F-test is used in multiple regression as another measure of "goodness of fit," to do a joint test of the significance of the indpt vars as a group.

In this case:

Ho: b1= b2= b3 = 0
Ha: b1= b2 = b3 n.e. (not equal) 0

There are F-tables on pages 354-355.  Critical F-values depend on degrees of freedom (d.f.), complicated formula.  Just look at the probability.  If the F prob. < .05 in EVIEWS output, then the group of variables is statistically significant at the 5% level.  If F prob < .01, the variables are significant at the 1% level.

F-stat = Ratio of Explained Variation / Unexplained Variation, where both the numerator and the denominator are adjusted for d.f.

Remember R2 = Explained Variation / Total Variation.

Advantage of F-test over R2: F-test gives us a specific level of stat significance, R2 doesn't.

See sample regression output on page 242-243, F-stat.
 

Standardized Partial Regression Coefficients (Beta):

Transforms the regression coefficients into standardized units: standard deviation units, allows comparison/assessment of the relative effect/impact of each variable.  The interpretation of the standardized coefficients: The amount change in Y (measured in std. deviations) produced by a one standard deviation change in X.  The transformation imposes the same units on each variable, can be useful when the X variables are all measured in different units (years, dollars, percentages, etc.).

Example: page 239, transforming the coefficients from 25.6 / 6.3 / -.000006 measured in original units (%, Pop) to .78 / .08 / -.13 measured in standard deviation units.  Allows to say that teen un rate is much more important that the other two variables.  NOT part of EVIEWS output.

T-statistic: Statistical test of the Ho that b = 0.   The formula is: t-statb bse   where b is the unstandardized regression coefficient and bse  is standard error of b.  When b is large compared to its standard error, the t-stat will be large and significant.  Since the critical t-value for the 5% level of significance (2-tailed) is 1.96 or approximately 2, we can say that in general, when b is more than 2X the std error, the t-stat will be greater than 2, and the Ho will be rejected at the 5% level.  The X variable will then have a significant effect on Y.
 

See table 9-1, p. 245 for a summary of regression output.  We are most interested in 1, 2, 4, 5, 7, 9, 10.
 

Exercise 9-1 on page 247.  Unit of analysis is ________________,   N = 50

MODEL:

Avg hospital stay in DAYS = f (% over 65, Percent covered by private insurance, percent covered by HMO, median income, # hospital beds)

Predicted signs of coefficients?

Questions:

a.

b.

c.

d.
 
 

DUMMY VARIABLES

Dummy variables are used to evaluate qualitative variables that can only be expressed as YES/NO- Male Y/N, College degree Y/N, Race-White Y/N.

Dummys take on the value of either zero or one depending on whether some condition holds.

1 if Male
0 if Female.

1 if White
0 if not white

1 if person has BA
0 if no degree.

Example: page 82 on handout. Yi - Salary of High School teacher i.

We have a model: Teacher salary = f ( # years experience and whether teacher has a masters degree).

X1i = Master's degree or Not
X2i = Number of years teaching experience
 

X1i is a dummy variable that = 1 if teacher has Master, 0 if no masters (BA only).

Equation: Yi = B0 + B1 X1 + B2 X2 + e

The coefficients can be interpreted as:

1) If the teacher has a BA Only, X1 = 0 so Y = B0 + B2 X2 and

2) If the teacher has a masters: X1 = 1 and Y = B0 + B1 + B2 X2.

B1 represents the additional average earnings gained by having the master's degree compared to a BA degree, HOLDING EXPERIENCE CONSTANT.

The dummy variable is called an "intercept dummy" because it actually changes the intercept of the regression, depending on whether the teacher has a masters degree.

See page 83.  When a teacher has a B.A. only, X1 = 0, and the second term drops out because it equals 0, and the the intercept is B0.

If the teacher has a masters, X1 = 1, the coefficient B1 will be a constant, so the intercept for teachers with a masters degrees will be B0 + B1.  THE SLOPE WILL BE THE SAME!  B1 = represents the additional average earnings for teachers with master's degrees.

Note: We only needed one dummy variable to represent two conditions.  The event NOT represented by a dummy variable is the omitted condition, and the event represented by the dummy is the included condition.  The coefficient is interpreted as the effect of the included condition relative to the omitted condition.  The effect of an MA relative to no MA/(BA only).
 

Example: The effect of sorority/fraternity membership on GPA in college.

We cant just take the average GPA of frat members vs. non-frat members.  Why?  There may be other relevant variables besides Greek membership that are important.  We want to control for these other variables and look at the effect of membership, holding the other variables constant. We need a model:

CG = College GPA (dependent variable)

HG = high school GPA (independent variable)
S = SAT score (independent variable)

G = dummy variable (independent variable)
1 = membership in sorority or frat
0 = not a member

We have the equation on page 84.  The coefficient of interest is Gi = -.38.  This means that: holding GPA in high school and SAT scores constant, Greek membership results in a .38 lower GPA in college on a 4 point scale (0 to 4), about 1/3 grade lower for Greek members.

Interpretation of dummy variable coefficient:  A dummy variable is usually EITHER/OR, 0 or 1, so the interpreation is always IF....THEN.  IF a person meets condition X, THEN...... e.g., IF black, IF female, IF married, IF Urban, IF Hispanic, IF a member of a union, IF from the South, etc...THEN............. In the case above, IF a person is a member of a fraternity, THEN their GPA decreases on average by .38 points.

Incorrect interpretation of dummy variable: IF members of a fraternity increase by one, IF the number of females increases by one, etc...... 
 

EXAMPLE IN BOOK ON PAGE 255

We want to account for possible regional variation in the relationship between teen un rate and teen drug arrests.  We classify cities by regions of the country (N or Midwest, S, E and W).  We have four conditions, so we use three dummy variables in the equation:
 

Y = a + b1 UN RATE + D1 X1 + D2 X2 + D3 X3 + e
X1 = 1 for East, 0 otherwise
X2 = 1 for West, 0 otherwise
X3 = 1 for Midwest, 0 otherwise

When X1 X2 X3 are all 0, then the South will be represented by the constant term a.   We could then test for regional differences by seeing whether the coefficients on the dummy variables are significant.

Alternative specification: Have 4 dummy variables for 4 conditions and NO constant.  Example: Have one dummy variable for Male and one dummy variable for Female and NO constant.
 

EVENT STUDIES USING DUMMY VARIABLES

We can use dummy variables to do event studies. Examples:

1. The effect of Three Mile Island nuclear accident in April 1979 on General Public Utilities stock return. We will look at one month.  Dummy variable = 1 in April 79 and 0 for all other months.

2. The effect of a takeover event on the relevant companies' stock returns.  Dupont and Dow both bid for Conoco in 1981, and Dupont was eventually the successful takeover bidder. We will look at four month period during the bidding activity.

Dummy variable = 1 during the relevant time period.

                          = 0 in all other periods

3. Effect of regulation/deregulation, e.g. effect of coal mining act on safety in mines.  Effect of deregulation of airlines on airline profits.  Dummy would equal 0 BEFORE the event and 1 AFTER the event.

4. Effect of Democratic president on economy. D=0 for years 1980-1992, D=1 for years 1993-2000.

We can have more than one dummy variable.

Example: Look at the differences between education: HS, BA and MBA on earnings. Three conditions = 2 dummys.

D1 = 1 if MBA, 0 if no MBA

D2 = 1 if BA, 0 if no BA

Then when both dummys are 0, we have the omitted condition - HS degree only. You can look at the effects of having an MBA vs. HS and having a BA vs HS.
 
 

INTERACTION TERMS

Creating a new variable by multiplying one variable times another. Measures the interaction between X and X2.

Example: Y = a + B1 X1 + B2 D1 + B3 (X1*D1)

X*D1 is an interaction term.  It allows the changes in Y with respect to X to depend on D.
 

Example: Earnings as a function of experience. Dummy D1 = 1 if worker is Male, 0 = female.

Income = a + B1 EXP + B2 D1 + B3 D1* EXP.

We might assume that men earn more than women after controlling for experience.  B2 should be pos and significant.  We might also assume that men increase their income more per year of experience than women - get promoted to better jobs, etc.

In that case we would expect B3 to be pos and signif.  Men would have an increasing return to being male.

Example: you have one dummy variable for Male/Female (D1 = 1 for M, 0 for F) and another dummy variable for marital status (D2 =  1 for married, 0 for single), we create a new interaction dummy variable D3 = D1 * D where D3 would be equal to 1 for married men, 0 otherwise.  We could measure what effect being married has on male earnings by looking at the sign and significance of the coefficient for the variable D3.  We could create another interaction dummy variable for married women.  Or we could create an interaction term for being African American or Hispanic or Asian and Female, etc.

OLS assumes statistical independence between indpt vars, or no statistical interaction. Creating an interaction term allows us to model and test for interaction between two variables.
 
 

ALTERNATIVE SPECIFICATIONS

NONLINEAR RELATIONSHIPS

1. Polynomial/Quadratic

Sometimes we want to represent a non-linear relationship between Y and X like on page 258.  We can represent this relation with a polynomial functional form, or quadratic equation, where one of the variables is raised to a power other than one. Example:

Y = a + b1 X + b2 X2

Example: earnings as a function of age or experience, and other factors, over a lifetime.  The effect of Age on earnings?  Up to a point age would be a positive influence on earnings, but then later it would lower earnings. Either because of retirement, working less hours, or obsolescence of skills.

Examples:

Lifetime earnings would increase, level off, and then fall with age. We could model this with a quadratic equation:

Earnings = a + B1 Age + B2 Age2 + e

B1 would be positive and B2 would be negative if earnings rise and then fall over a person's lifetime.  This is exactly what labor economists have found.
 

2. DOUBLE LOG FORM

The most common functional form that is nonlinear in the variables but linear in the coefficients.  Some economists use double-logs as the default specification.
Double log means the dependent and all indpt vars are all logged. Also called log-log form or specification.

EVIEWS command: genr logx = log(x), will generate a new variable "logx" which is equal to the log of x.

ln Y = a + B1 Ln X1 + B2 Ln X2 + e

This is used to model constant elasticity, where the elasticity is constant and the slopes are not constant.  In contrast to linear equation, where the slope is constant (line) and the elasticity changes along the line.

If elasticity is assumed to be constant, then the double log form is used and:

Ey,x = % Y / % X = B = constant.

Beta coefficient is the elasticity, and it is constant.

In double log form, an individual regression coefficient (B) can be interpreted as an elasticity.

Beta = elasticity means that for a one percent change in X, Y changes by Beta %. (holding the other Xs constant).

Advantage of double log: it is easy to interpret the coefficients - they are measure in terms of percentage change, not the units of the X or Y variable.

If b1= +1.5, then regardless of the units that Y and X are measured in, we can say that: for a 1% increase in X, Y will increase by 1.5%.

If b1= -.5, then we can say that for a one percent increase in X, Y will fall by .5%.
 

3. SEMI-LOG FORM or LOG-LINEAR FORM

Only one side is in logs.

Example: Ln Y = a + B X + e

Interpretation of B: for a one unit change in X, what % does Y change?

Example: The handout from last class.  The dependent variable was the Log of Income, so the interpretation of the coefficients was:

a. For every one year increase in education, earnings increase by .10 or 10% (not dollars).
b. For every one point increase in IQ, earnings go up by .5%
c. For every year of experience, earnings go up by 2.9%.
d. Being married increases male earnings by 44% (interaction term)
e. Being married decreases female earnings by 9.2% (interaction term).
f. Having children decreases female earnings by 20%. (interaction term)
 

Example: Y = a + B ln X + e

Interpretation of B: for an x% change in X, how many units does Y change?
 

PROBLEMS IN MULTIPLE REGRESSION

For OLS to be efficient and consistent and unbiased, which are all desirable qualities of an estimator, then certain assumptions have to be met for OLS to be BLUE (best, linear, unbiased estimator). If the conditions are violated, then it is possible that there is another estimator that is better - more efficient, more unbiased or more consistent.

Think of an estimator as a rifle firing at a target.  OLS is an estimator, GLS is an estimator, MLE is an estimator, FLS, WLS, TSLS, nonlinear least squares are all estimators.  In certain circumstances, some of these other estimators may be better.

Think of the rifle as the estimator, each bullet is a point estimate, like an individual estimated beta from a single OLS regression.  The bullseye is like the population parameter that we are trying to estimate from a sample.

Consistency - as you move closer to the target, a consistent estimator/rifle will become more accurate. As sample size increases, a consistent estimator becomes more consistent - lower variance, less dispersion.

Efficiency - how close to the bullseye is the cluster of bullets/point estimates? The more efficient the estimator/rifle, the tighter the dispersion of estimates/bullets around the bullseye.

Unbiasedness - the more unbiased the estimator, the closer to the target is the cluster of shots/estimates.

If would be possible to have an estimator that was more efficient than OLS, but it could be biased, centered around a point away from the bullseye, but with a tighter cluster pattern.  More efficient, but biased estimator.  Like a rifle with an off-center sight.

Under certain assumptions, OLS is BLUE - the Best Linear Unbiased Estimator. There is no other estimation method that is more efficient, more consistent and/or more unbiased.

There are 7 "classical assumptions" that are necessary for OLS to be the most desirable estimator. If the assumptions are violated, then 1) OLS may not be BLUE and 2) OLS results may be misleading.  Variables may show up as stat significant when they really aren't.  Or significant variables may show up as being insig.
 

SEVEN ASSUMPTIONS FOR OLS TO BE THE BEST ESTIMATOR

1. The error will have a mean of zero.

2. The error will be distributed normally.

3. The error will have a constant variance - homoskedastic errors.

           e ~ N (0, sigma2)  Summarizes first three assumptions

4. The error term will be uncorrelated with the indpt var.

5. The error terms are uncorrelated with each other - no serial correlation or autocorrelation.

6. There is no measurement error. (Would apply for all estimators)

7. No indpt var. is perfectly correlated with any other indpt var. No multicollinearity.

The first five assumptions involve the error terms or residuals, the variation in the dependent variable NOT accounted for by the independent variables.
 

ASSUMPTIONS:

1.  Errors will have a zero mean.  Observations of the residuals are assumed to be drawn from a random variable distribution with a mean of zero and a constant variance.

For a small sample, the mean will not be exactly zero, but as N approaches Infinity, the mean of the sample residuals will approach zero.

The intercept term, constant, actually forces the mean of the regression to be zero. If the mean of the error term is NOT zero, then this nonzero amount is added to or subtracted from each error term and instead added to the intercept.

As long as there is a constant term, this forces the mean of the errors to be zero, or approach zero, so as long as there is a constant we don't have to worry about violation of this assumption.
 

2.  The error term is normally distributed

Not really a strict requirement for OLS, but it is a requirement for hypothesis testing. The t-test and F-tests do assume that the error term is normally distributed.

As N gets bigger, the distribution of the error term tends to approach the normal distribution - Central Limit Theorem.  The normal distribution is a symmetrical, continuous, bell-shaped curve.  It is described by the mean (measure of central tendency) and variance (measure of dispersion).

With large sample sizes, we don't have to worry about violating Assumption #2, but it could be a potential problem with very small samples.
 

3.  All Explanatory Variables are Uncorrelated with the Error Term.

The X variables are assumed to be determined independently of the values of the dependent variable and the error term.  If one the Xs and the error term are correlated, then OLS estimates may attribute some of the variation in Y to X, when it actually came from the error term.  If X and the error term are correlated, then the beta coefficient on X will be higher, or biased upward, than it otherwise would be.  OLS will attribute some of the variation in Y to the variation in X, instead of the error term.

Example: System of equations. May violate the assumption or leaving out important variables.

Inc = a + B Educ + e1

Educ = a + B Inc + e2

Assume e1 is pos due to some change in technology. That positive shock raises Income in Eq. 1, which then raises Educ in Eqn. 2, which then raises Educ in Eqn. 1. e1 and Educ ARE correlated. Mostly a problem in a system of equations.
 

4.  Error terms are uncorrelated with each other. No serial correlated errors or autocorrelated errors. There is no pattern in the SIGN of the errors. Serial correlation exists when there is a pattern in the sign of the errors - a positive error is likely to be followed by a positive error and a negative error is likely to be followed by a negative error.

The errors are assumed to be drawn independently from each other from a distribution. The assumption of independent errors means that we shouldn't find any pattern in the errors. It is more difficult for OLS to get precise estimates when the errors are NOT independent.

Example: first-order serial correlation, 1st order autocorrelation:

et = f (et-1)

LS resid c resid(-1)

If there is a pattern, the coefficient on resid(-1) will be significant.

Most serious problem in time-series models. Suppose there is a shock: Monetary supply increase/decrease, negative oil price shock, pos technology shock, deregulation, tax increase/decrease, price control, war, election, etc. It is possible that the shock will last more than one period, or affect more than one period. If a pos shock affects more than one period, the residuals are likely to be positive in more than one period. Pos error likely to follow a pos error. There is a pattern in the residuals. In this case, et and et+1 are correlated.

Consequence of serial correlation: OLS understates the standard errors and overestimates the t-statistics. T-stats are inflated. Coefficient may appear sig when it really isn't.

Test for serial correlation: Durbin-Watson stat that is part of the regression output. The D-W test is for 1st order serial correlation. The D-W statistic can range from 0-4. If the D-W test stat is 2, then there is no serial correlation. The closer to 2, the less the serial correlation.
 

5.  Error Term has a Constant Variance

The error term is homoskedastic - constant error variance over the sample or over time.  No heteroskedasticity.  No pattern in the size of the residuals.

In time series, if there is a pattern in the size of errors, then e2t = f( e2t-1).

A large error is likely to be followed a large error and a small error is likely to be followed by a small error.

Volatility clustering.  Alternating periods of high volatility (large errors of either sign) and low volatility ( small errors of either sign).

In cross-section data, there could be a pattern like on page 252.  Assume that the model is for consumption as a function of income.  At higher levels of income there is a greater variance of errors around the regression line, compared to lower levels of income.

If the error is NOT constant, then OLS will be inefficient, and will generate imprecise estimates of the coefficients of the indpt variables. Generally, standard errors will be too small, and t-stats will be too large. T-stats and sig levels will be INFLATED, so that we might find variables to be significant when they are NOT.
 

6.  No Measurement Error - Assumes that there are no errors in data, and assumes that the variables are a correct measure.

Example: we assume that poverty can be measured by the percentage of school children who receive free or reduced price lunches.

We assume that SAT score is a valid and reliable measure of academic achievement.
 

7.  No Multicollinearity - No explanatory variable is a linear transformation of another explanatory variable. Examples:  X1 * 2 = X2, or X1 + 3 = X2.

OLS will be incapable of determining how much of the variation in Y is attributed to X1 and X2. X1 and X2 move together perfectly, so there is no way to distinguish the independent effect of each variable.  General rule: When correlation between two variables > .70, there may be a problem with multicollinearity.

Solution: one variable must be dropped.

Error message in EVIEWS when multicollinearity is strong: "Near singular matrix."

Example: Including both real and nominal Int Rates in a period when inflation and expected inflation are constant.  Nominal Interest = real + Inf Exp.

Example: including two measures of the same thing. Temperature in Fahrenheit and Celsius, weight in both kilos and pounds.

Age and work experience.

Income and wealth.

Urban highway miles and motor vehicle registrations in each state.  Both are a measure of size of state.

An X variable has variance = 0, meaning that it is just a constant variable. A dummy variable that is always zero or 1. In that case, it is correlated with the constant/intercept.

Consequence of multicollinearity: OLS overestimates standard errors and underestimates t-statistics. T-stats are lower than they should be. Variables may not be sig by t-test when they really are.
 

POTENTIAL PROBLEMS IN MULTIPLE REGRESSION:
See page 259.

EVIEWS:
1.  After you perform a linear regression, you can click on "View," then "Residual Tests" and there are a series of diagnostic tests, mostly for serial correlation and heteroskedasticity in the residuals (errors).

2.  Also, after a regression, you can click on "View," then "Stability Tests,"  which includes the "Chow Breakpoint Test" for structural stability over time.