@1007477689 2020-05-01T15:32:00.000000Z 字数 35550 阅读 598

Chapter 3 Analysis of Cross-Sectional Data

学习

Note

The primary reference text for these notes is Hayashi (2000). Other comprehensive treatments are available in Greene (2007) and Davidson & MacKinnon (2003).

3.0 Introduction

Linear regression is the foundation of modern econometrics. While the importance of linear regression in financial econometrics has diminished in recent years, it is still widely employed. More importantly, the theory behind least squares estimators is useful in broader contexts and many results of this chapter are special cases of more general estimators presented in subsequent chapters. This chapter covers model specification, estimation, inference, under both the classical assumptions and using asymptotic analysis, and model selection.

Linear regression is the most basic tool of any econometrician and is widely used throughout finance and economics. Linear regression's success is owed to two key features: the availability of simple, closed form estimators and the ease and directness of interpretation. However, despite superficial simplicity, the concepts discussed in this chapter will reappear in the chapters on time series, panel data, Generalized Method of Moments (GMM), event studies and volatility modeling.

3.1 Model Description

Linear regression expresses a dependent variable as a linear function of independent variables, possibly random, and an error

$y_{i} = \beta_{1}^{i}x_{1}^{i} + \beta_{2}^{i}x_{2}^{i} + ... + \beta_{k}^{i}x_{k}^{i} + \epsilon_{i}$

where $y^{i}$ is known as the regress and, dependent variable or simply the left-hand-side variable.
The $k$ variables, $x_{1}^{i}$ , $...$ , $x_{k}^{i}$ , are known as the regressors, independent variables orright-hand-side variables. $\beta_{1}^{i}$ , $\beta_{2}^{i}$ , $...$ , $\beta_{k}^{i}$ , are the regression coefficients, $\epsilon_{i}$ is known as the innovation, shock or error and $i = 1,2,...,n$ index the observation. While this representation clarifies the relationship between $y^{i}$ and the $x$ s, matrix notation will generally be used to compactly describe models:

$\epsilon_{i}$

where $X$ is an $n$ by $k$ matrix, $\beta$ is a $k$ by $1$ vector, and both $y$ and $\epsilon$ are $n$ by $1$ vectors.

Two vector notations will occasionally be used: row

$\epsilon_{i}$

and column,

$\textbf{y}_{i} = \beta_{1}^{i}\textbf{x}_{1} + \beta_{2}^{i}\textbf{x}_{2} + ... + \beta_{k}^{i}\textbf{x}_{k} + \epsilon_{i}$

Linear regression allows coefficients to be interpreted all things being equal. Specifically, the effect of a change in one variable can be examined without changing the others. Regression analysis also allows for models which contain all of the information relevant for determining $y^{i}$ whether it is directly of interest or not. This feature provides the mechanism to interpret the coefficient on a regressor as the unique effect of that regressor (under certain conditions), a feature that makes linear regression very attractive.

3.1.1 What is A Model?

What constitutes a model is a difficult question to answer. One view of a model is that of the data generating process (DGP). For instance, if a model postulates

$y_{i} = \beta_{i}x_{i} + \epsilon_{i}$

one interpretation is that the regressand, $y_{i}$ , is exactly determined by $x_{i}$ and some random shock. An alternative view, one that I espouse, holds that $x_{i}$ is the only relevant variable available to the econometrician that explains variation in $y_{i}$ . Everything else that determines $y_{i}$ cannot be measured and, in the usual case, cannot be placed into a framework which would allow the researcher to formulate a model.

Consider monthly returns on the S&P 500, a value weighted index of 500 large firms in the United States. Equity holdings and returns are generated by individuals based on their beliefs and preferences. If one were to take a (literal) data generating process view of the return on this index, data on the preferences and beliefs of individual investors would need to be collected and formulated into a model for returns. This would be a daunting task to undertake, depending on the generality of the belief and preference structures.

On the other hand, a model can be built to explain the variation in the market based on observable quantities (such as: oil price changes, macroeconomic news announcements, etc.) without explicitly collecting information on beliefs and preferences. In a model of this type, explanatory variables can be viewed as inputs individuals consider when forming their beliefs and, subject to their preferences, taking actions which ultimately affect the price of the S&P 500. The model allows the relationships between the regressand and regressors to be explored and is meaningful even though the model is not plausibly the data generating process.

In the context of time-series data, models often postulate that the past values of a series are useful in predicting future values. Again, suppose that the data were monthly returns on the S&P 500 and, rather than using contemporaneous explanatory variables, past returns are used to explain present and future returns. Treated as a DGP, this model implies that average returns in the near future would be influenced by returns in the immediate past. Alternatively, taken an approximation, one interpretation postulates that changes in beliefs or other variables that influence holdings of assets change slowly (possibly in an unobservable manner). These slowly changing "factors" produce returns which are predictable. Of course, there are other interpretations but these should come from finance theory rather than data. The model as a proxy interpretation is additionally useful as it allows models to be specified which are only loosely coupled with theory but that capture interesting features of a theoretical model.

Careful consideration of what defines a model is an important step in the development of an econometrician, and one should always consider which assumptions and beliefs are needed to justify a specification.

3.1.2 Example: Cross-section Regression of Returns on Factors

The concepts of linear regression will be explored in the context of a cross-section regression of returns on a set of factors thought to capture systematic risk. Cross sectional regressions in financial econometrics date back at least to the "Capital Asset Pricing Model" (CAPM, Markowitz (1959), Sharpe (1964) and Lintner(1965)), a model formulated as a regression of individual asset's excess returns on the excess return of the market. More general specifications with multiple regressors are motivated by the "Intertemporal CAPM" (ICAPM, Merton (1973)) and "Arbitrage Pricing Theory" (APT, Ross (1976)).

The basic model postulates that excess returns are linearly related to a set of systematic risk factors. The factors can be returns on other assets, such as: the market portfolio, or any other variable related to intertemporal hedging demands, such as: interest rates, shocks to inflation or consumption growth or more compactly,

$r^{i} − r^{i}_{f} = \textbf{f}^{i}\mathbf{\beta} + \epsilon^{i}$

where $r_{e}^{i} = r^{i} − r^{i}_{f}$ is the excess return on the asset and $\textbf{f}^{i}= [f_{1}^{i} , ... , f_{k}^{i}]$ are returns on factors that explain systematic variation.

VARIABLE	DESCRIPTION
VWM	Returns on a value-weighted portfolio of all NYSE, AMEX and NASDAQ stocks
SMB	Returns on the Small minus Big factor, a zero investment portfolio that is long small market capitalization firms and short big caps.
HML	Returns on the High minus Low factor, a zero investment portfolio that is long high BE/ME firms and short low BE/ME firms.
UMD	Returns on the Up minus Down factor (also known as the Momentum factor), a zero investment portfolio that is long firms with returns in the top 30% over the past 12 months and short firms with returns in the bottom 30%.
SL	Returns on a portfolio of small cap and low BE/ME firms.
SM	Returns on a portfolio of small cap and medium BE/ME firms.
SH	Returns on a portfolio of small cap and high BE/ME firms.
BL	Returns on a portfolio of big cap and low BE/ME firms.
BM	Returns on a portfolio of big cap and medium BE/ME firms.
BH	Returns on a portfolio of big cap and high BE/ME firms.
RF	Risk free rate (Rate on a 3 month T-bill).
DATE	Date in format YYYYMM.

Table 3.1: Variable description for the data available in the Fama-French data-set used throughout this chapter.

Linear factors models have been used in countless studies, the most well known by Fama and French (Fama & French(1992) and Fama & French(1993)) who use returns on specially constructed portfolios as factors to capture specific types of risk. The Fama-French data set is available in Excel (ff.xls) or MATLAB (ff.mat) formats and contains the variables listed in table 3.1.

All data, except the interest rates, are from the CRSP database and were available monthly from January 1927 until June 2008. Returns are calculated as 100 times the logarithmic price difference (100(ln(pi )−ln(pn−1))). Portfolios were constructed by sorting the firms into categories based on market capitalization, Book Equity to Market Equity (BE/ME), or past returns over the previous year. For further details on the construction of portfolios, see Fama & French(1993) or Ken French's website:

http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html.

A general model for the $BH$ portfolio can be specified

$BH^{i}_{e} - RF^{i}_{e} = \beta_{1} + \beta_{2}(VWM_{e}^{i} - RF^{i}_{e}) + \beta_{3}SMB_{e}^{i} + \beta_{4}HML_{e}^{i} +\beta_{5}UMD^{i} + \epsilon^{i}$

or, in terms of the excess returns,

$BH^{i}_{e} = \beta_{1} + \beta_{2}VWM_{e}^{i} + \beta_{3}SMB_{e}^{i} + \beta_{4}HML_{e}^{i} +\beta_{5}UMD^{i} + \epsilon^{i}$

The coefficients in the model can be interpreted as the effect of a change in one variable holding the other variables constant. For example, $\beta_{3}$ captures the effect of a change in the $SMB_{e}^{i}$ risk factor holding $VWM_{e}^{i}$ , $HML_{e}^{i}$ and $UMD^{i}$ constant. Table 3.2 contains some descriptive statistics of the factors and the six portfolios included in this data set.

3.2 Functional Form

A linear relationship is fairly specific and, in some cases, restrictive. It is important to distinguish specifications which can be examined in the framework of a linear regression from those which cannot. Linear regressions require two key features of any model: each term on the right hand side must have only one coefficient that enters multiplicatively and the error must enter additively.1 Most specifications satisfying these two requirements can be treated using the tools of linear regression.2 Other forms of “nonlinearities” are permissible. Any regressor or the regressand can be nonlinear transformations of the original observed data.

Double log (also known as log-log) specifications, where both the regressor and the regressands are log transformations of the original (positive) data, are common.

$lny_{i} = \beta_{1} + \beta_{2}lnx_{i} + \epsilon_{i}$

In the parlance of a linear regression, the model is specified:

$\widetilde{y}_{i} = \beta_{1} + \beta_{2}\widetilde{x}_{i} + \epsilon_{i}$

where $\widetilde{y}_{i} = ln(y_i)$ and $\widetilde{x}_{i} = lnx_{i}$ . The usefulness of the double log specification can be illustrated by a Cobb-Douglas production function subject to a multiplicative shock

$Y_{i} = \beta_{1}K^{\beta_{2}}_{i}L^{\beta_{3}}_{i}\epsilon_{i}.$

Using the production function directly, it is not obvious that, given values for output ( $Y_i$ ), capital ( $K_i$ ) and labor ( $L_i$ ) of firm $i$ , the model is consistent with a linear regression. However, taking logs,

$lnY_{i} =ln\beta_{1} + \beta_{2}lnK_{i} + \beta_{3}lnL_{i} + ln\epsilon_{i}$

the model can be reformulated as a linear regression on the transformed data. Other forms, such as semi-log (either log-lin, where the regressand is logged but the regressors are unchanged, or lin-log, the opposite) are often useful to describe certain relationships.

Linear regression does, however, rule out specifications which may be of interest. Linear regression is not an appropriate framework to examinea model of the form

$y_{i} = \beta_{1}x^{\beta_{2}}_{1,t} + \beta_{3}x^{\beta_{4}}_{2,t} + \epsilon_{i}$

Fortunately, more general frameworks, such as "generalized method of moments" (GMM) or "maximum likelihood estimation" (MLE), topics of subsequent chapters, can be applied.

Two other transformations of the original data, dummy variables and interactions, can be used to generate nonlinear (in regressors) specifications.

A dummy variable is a special class of regressor that takes the value $0$ or $1$ . In finance, dummy variables (or dummies) are used to model calendar effects, leverage (where the magnitude of a coefficient depends on the sign of the regressor), or group-specific effects.
Variable interactions parameterize nonlinearities into a model through products of regressors. Common interactions in clude powers of regressors $\left( x^{2}_{1,i}, x^{3}_{1,i} \right)$ , $...$ , cross-products of regressors $\left( x_{1}^{i} , x_{2}^{i} \right)$ and interactions between regressors and dummy variables.

Considering the range of nonlinear transformation, linear regression is surprisingly general despite the restriction of parameter linearity.

The use of nonlinear transformations also change the interpretation of the regression coeffi- cients. If only unmodified regressors are included,
yi =xiβ+εi
and ∂ yi = βk . Suppose a specification includes both xk and x 2 as regressors,
∂ xk,i k yi =β1xi +β2xi2+εi
In this specification, ∂ yi =β1+β2xi and the level of the variable enter sits partial effect. Similarly, ∂ xi
in a simple double log model and

lnyi =β1lnxi +εi ,∂y β1=∂lnyi = y =%∆y
∂ lnxi ∂ x %∆x x

Thus, β1 corresponds to the elasticity of yi with respect to xi . In general, the coefficient on a variable in levels corresponds to the effect of a one unit changes in that variable while the coefficient on a variable in logs corresponds to the effect of a one percent change. For example, in a semi-log model where the regressor is in logs but the regressand is in levels,

yi =β1lnxi +εi ,

β1 will correspond to a unit change in yi for a % change in xi . Finally, in the case of discrete regressors, where there is no differential interpretation of coefficients, β represents the effect of a whole unit change, such as a dummy going from 0 to 1.

3.2.1 Example: Dummy variables and interactions in cross section regression

Two calendar effects, the January and the December effects, have been widely studied in finance. Simply put, the December effect hypothesizes that returns in December are unusually low due to tax-induced portfolio rebalancing, mostly to realized losses, while the January effect stipulates returns are abnormally high as investors return to the market. To model excess returns on a portfolio (B Hie ) as a function of the excess market return (V W Mie ), a constant, and the January and December effects, a model can be specified

$BHe=β+βVWMe+βI +βI +ε i 1 2 i 31i 412i i$

where I1i takes the value 1 if the return was generated in January and I12i does the same for December. The model can be reparameterized into three cases:

$BHe =(β +β )+β VWMe+ε i132ii January$

$BHe =(β +β )+β VWMe+ε i142ii December$

$BHe =β +β VWMe+ε i12ii Otherwise$

Similarly dummy interactions can be used to produce models with both different intercepts and different slopes in January and December,

$BHe=β+βVWMe+βI +βI +βI VWMe+βI VWMe+ε. i 1 2 i 31i 412i 51i i 612i i i$

If excess returns on a portfolio were nonlinearly related to returns on the market, a simple model can be specified

$BHe =β +β VWMe +β (VWMe)2+β (VWMe)3+ε . i12i3i4ii$

Dittmar (2002) proposed a similar model to explain the cross-sectional dispersion of expected returns.

3.3 Estimation

Linear regression is also known as ordinary least squares (OLS) or simply least squares, a moniker derived from the method of estimating the unknown coefficients. Least squares minimizes the squared distance between the fit line (or plane if there are multiple regressors) and the regressand. The parameters are estimated as the solution to

$min (\textbf{y} - \textbf{X}\textbf{\beta})^{T}(\textbf{y} − \textbf{X}\textbf{\beta}) = min \sum_{i=1}^{n}(y_{i} − \textbf{x}_{i}\textbf{X})^2 (3.6)$

First order conditions of this optimization problem are

$−2\textbf{X}'(\textbf{y} - \textbf{X}\textbf{\beta}) = −2(\textbf{X}'\textbf{y} - \textbf{X}'\textbf{X}\textbf{\beta}) = −2\sum_{i=1}^{n} \textbf{x}_i(\textbf{y} - \textbf{X}\textbf{\beta})) = \textbf{0} (3.7)$

and rearranging, the least squares estimator for $\beta$ , can be defined.

Definition3.1 - OLS Estimator

The ordinary least squares estimator, denoted $\hat{\beta}$ , is defined

$\hat{\beta} =(\textbf{X}^{T}\textbf{X})^{-1}\textbf{X}\textbf{y} \tag{3.8}$

Clearly this estimator is only reasonable if $\textbf{X}^{T}\textbf{X}$ is invertible which is equivalent to the condition that $rank(\textbf{X}) = k$ . This requirement states that no column of $\textbf{X}$ can be exactly expressed as a combination of the $k − 1$ remaining columns and that the number of observations is at least as large as the number of regressors ( $n ≥ k$ ). This is a weak condition and is trivial to verify in most econometric software packages : using a less than full rank matrix will generate a warning or error.

Dummy variables create one further issue worthy of special attention. Suppose dummy variables corresponding to the 4 quarters of the year, I1i ,...,I4i , are constructed from a quarterly data set of portfolio returns. Consider a simple model with a constant and all 4 dummies

rn =β1+β2I1i +β3I2i +β4I3i +β5I4i +εi.

It is not possible to estimate this model with all 4 dummy variables and the constant because the constant is a perfect linear combination of the dummy variables and so the regressor matrix would be rank deficient. The solution is to exclude either the constant or one of the dummy variables. It makes no difference in estimation which is excluded, although the interpretation of the coefficients changes. In the case where the constant is excluded, the coefficients on the dummy variables are directly interpretable as quarterly average returns. If one of the dummy variables is excluded, for example the first quarter dummy variable, the interpretation changes. In this parameterization, rn =β1+β2I2i +β3I3i +β4I4i +εi,
β1 istheaveragereturninQ1,whileβ1+βj istheaveragereturninQj. It is also important that any regressor, other the constant, be nonconstant. Suppose a regression included the number of years since public floatation but the data set only contained assets that have been trading for exactly 10 years. Including this regressor and a constant results in perfect collinearity, but, more importantly, without variability in a regressor it is impossible to deter- mine whether changes in the regressor (years since float) results in a change in the regressand
or whether the effect is simply constant across all assets. The role that variability of regressors plays be revisited when studying the statistical properties of βˆ .
The second derivative matrix of the minimization,
2X′ X,
ensures that the solution must be a minimum as long as X′X is positive definite. Again, positive
definiteness of this matrix is equivalent to rank(X) = k .
Once the regression coefficients have been estimated, it is useful to define the fit values, yˆ = Xβˆ
and sample residuals εˆ = y − yˆ = y − Xβˆ . Rewriting the first order condition in terms of the explanatory variables and the residuals provides insight into the numerical properties of the residuals. An equivalent first order condition to eq. (3.7) is
X′εˆ =0. (3.9) This set of linear equations is commonly referred to as the normal equations or orthogonality conditions. This set of conditions requires that εˆ is outside the span of the columns of X. Moreover, considering the columns of X separately, X′j εˆ = 0 for all j = 1,2,...,k . When a column contains aconstant(aninterceptinthemodelspecification),ι′εˆ=0(􏰖ni=1εˆi =0),andthemeanofthe
residuals will be exactly 0.3
The OLS estimator of the residual variance, σˆ 2, can be defined.4
Definition3.2(OLSVarianceEstimator). TheOLSresidualvarianceestimator,denotedσˆ2,is
defined
σˆ2= εˆ′εˆ (3.10) n−k
Definition3.3(StandardErroroftheRegression). Thestandarderroroftheregressionisdefinedas √
σˆ= σˆ2 (3.11)
The least squares estimator has two final noteworthy properties. First, nonsingular transforma- tions of the x ’s and non-zero scalar transformations of the y ’s have deterministic effects on the estimated regression coefficients. Suppose A is a k by k nonsingular matrix and c is a non-zero scalar.Thecoefficientsofaregressionofcyi onxiAare
β ̃ =[(XA)′(XA)]−1(XA)′(cy) = c (A′X′XA)−1A′X′y
= c A−1(X′X)−1A′−1A′X′y = c A−1(X′X)−1X′y =cA−1βˆ.
(3.12)
3ι is an n by 1 vector of 1s.
4The choice of n −k in the denominator will be made clear once the properties of this estimator have been examined.
Portfolio Constant
VWMe SMB HML UMD σ
BHe BMe BLe SHe SMe SLe
-0.06 -0.02 0.09 0.05 0.06 -0.10
1.08 0.02 0.80 0.99 -0.12 0.32 1.02 -0.10 -0.24 1.02 0.93 0.77 0.98 0.82 0.31 1.08 1.05 -0.19
-0.04 1.29 -0.03 1.25 -0.02 0.78 -0.04 0.76
0.01 1.07 -0.06 1.24
Table3.3:Estimatedregressioncoefficientsfromthemodelrpi =β1+β2VWMe +β3SMBi + ii
β4H M Li +β5U M Di +εi , where r pi is the excess return on one of the six size and BE/ME sorted i
portfolios. The final column contains the standard error of the regression.
Second, as long as the model contains a constant, the regression coefficients on all terms ex- cept the intercept are unaffected by adding an arbitrary constant to either the regressor or the regressands. Consider transforming the standard specification,
to
yi = β1 +β2 x2,i +...+βk xk ,i +εi y ̃i = β1 +β2 x ̃2,i +...+βk x ̃k ,i +εi
wherey ̃i =yi+cy andx ̃j,i =xj,i+cxj .Thismodelisidenticalto yi = β ̃1 +β2 x2,i +...+βk xk ,i +εi
whereβ ̃1=β1+cy −β2cx2 −...−βkcxk .

3.3.1 Estimation of Cross-Section regressions of returns on factors

Table 3.3 contains the estimated regression coefficients as well as the standard error of the regression for the 6 portfolios in the Fama-French data set in a specification including all four factors and a constant. There has been a substantial decrease in the magnitude of the standard error of the regression relative to the standard deviation of the original data. The next section will formalize how this reduction is interpreted.

3.4 Assessing Fit

Once the parameters have been estimated, the next step is to determine whether or not the model fits the data. The minimized sum of squared errors, the objective of the optimization, is an obvious choice to assess fit. However, there is an important limitation drawback to using the sum of squared errors: changes in the scale of $y_{i}$ alter the minimized sum of squared errors without changing the fit. In order to devise a scale free metric, it is necessary to distinguish between the portions of $\textbf{y}$ which can be explained by $\textbf{X}$ from those which cannot.

Two matrices, the projection matrix, ${\textbf{P}}_{\textbf{X}}$ and the annihilator matrix, ${\textbf{M}}_{\textbf{X}}$ , are useful when decomposing the regressand into the explained component and the residual.

Definition3.4 - Projection Matrix

The projection matrix, asymmetric idempotent matrix that produces the projection of a variable onto the space spanned by $\textbf{X}$ , denoted ${\textbf{P}}_{\textbf{X}}$ , is defined

${\textbf{P}}_{\textbf{X}} = \textbf{X}(\textbf{X}'\textbf{X})^{-1} \textbf{X}' (3.13)$

Definition3.5 - Annihilator Matrix

The annihilator matrix, asymmetric idempotent matrixthat produces the projection of a variable onto the null space of $\textbf{X}'$ , denoted ${\textbf{M}}_{\textbf{X}}$ , is defined

${\textbf{M}}_{\textbf{X}} = \textbf{I}_{n} - \textbf{X}(\textbf{X}'\textbf{X})^{-1} \textbf{X}'(3.14)$

These two matrices have some desirable properties. Both the fit value of y (yˆ) and the estimated errors, εˆ , can be simply expressed in terms of these matrices as yˆ = PX y and εˆ = MX y respectively. These matrices are also idempotent: PXPX = PX and MXMX = MX and orthogonal: PXMX = 0. The projection matrix returns the portion of y that lies in the linear space spanned by X, while the annihilator matrix returns the portion of y which lies in the null space of X′. In essence, MX annihilates any portion of y which is explainable by X leaving only the residuals.

Decomposing $\textbf{y}$ using the projection and annihilator matrices,

$y = {\textbf{P}}_{\textbf{X}}\textbf{y} + {\textbf{M}}_{\textbf{X}}\textbf{y}$

which follows since ${\textbf{P}}_{\textbf{X}} + {\textbf{M}}_{\textbf{X}} = {\textbf{I}}_{\textbf{n}}$ . The squared observations can be decomposed

y′y=(PXy+MXy)′(PXy+MXy) =y′PXPXy+y′PXMXy+y′MXPXy+y′MXMXy =y′PXy+0+0+y′MXy
=y′PXy+y′MXy
notingthatPX andMX areidempotentandPXMX=0n.Thesethreequantitiesareoftenreferred
to as5
n
y′y=􏰚y2 UncenteredTotalSumofSquares(TSS )
(3.15)
(3.16)
iU i=1
n
y′ PX y = 􏰚(xi βˆ )2 Uncentered Regression Sum of Squares (RSSU )
i=1
5

There is no consensus about the names of these quantities. In some texts, the component capturing the fit portion is known as the Regression Sum of Squares (RSS) while in others it is known as the Explained Sum of Squares (ESS), while the portion attributable to the errors is known as the Sum of Squared Errors (SSE), the Sum of Squared Residuals (SSR) ,the Residual Sum of Squares (RSS) or the Error Sum of Squares (ESS). The choice to use SSE and RSS in this text was to ensure the reader that SSE must be the component of the squared observations relating to the error variation.
n y′MXy=􏰚(yi −xiβˆ)2
i=1 Dividing through by y′y
or
which is not (y′MXy). The portion of the total variation explained by X is known as the uncentered R2 (R2U),
Definition3.6(UncenteredR2(R2U)). TheuncenteredR2,whichisusedinmodelsthatdonot include an intercept, is defined
R2U=RSSU =1−SSEU (3.18) TSSU TSSU
While this measure is scale free it suffers from one shortcoming. Suppose a constant is added to y, so that the TSSU changes to (y+c )′(y+c ). The identity still holds and so (y+c )′(y+c ) must increase (for a sufficiently large c ). In turn, one of the right-hand side variables must also grow larger. In the usual case where the model contains a constant, the increase will occur in the RSSU (y′PXy), and as c becomes arbitrarily large, uncentered R2 will asymptote to one. To overcome this limitation, a centered measure can be constructed which depends on deviations from the mean rather than on levels.
Lety ̃=y−y ̄ =MιywhereMι =In −ι(ι′ι)−1ι′ ismatrixwhichsubtractsthemeanfromavector of data. Then
UncenteredSumofSquaredErrors(SSEU).
y′PXy + y′MXy =1 y′y y′y
RSSU +SSEU =1.
(3.17)
TSSU TSSU
This identity expresses the scale-free total variation in y that is captured by X (y′PXy) and that
or more compactly
y′MιPXMιy+y′MιMXMιy=y′Mιy y′MιPXMιy + y′MιMXMιy =1
y′Mιy y′Mιy
y ̃′PXy ̃ + y ̃′MXy ̃ =1. y ̃ ′ y ̃ y ̃ ′ y ̃
Centered R2 (R2C ) is defined analogously to uncentered replacing the uncentered sums of squares with their centered counterparts.

Definition3.7 - Centered $R^2$ ,(R2C)

The uncentered R2, used in models that include an intercept, is defined

R2C=RSSC =1−SSEC (3.19) TSSC TSSC

where

n y′MιMXMιy=􏰚(yi −xiβˆ)2
i=1 andwherex ̄=n−1􏰖ni=1xi.
n y′Mιy=􏰚(yi −y ̄)2
i=1 n
y′MιPXMιy=􏰚(xiβˆ −x ̄βˆ)2 i=1
CenteredTotalSumofSquares(TSSC) CenteredRegressionSumofSquares(RSSC) CenteredSumofSquaredErrors(SSEC).
(3.20) (3.21) (3.22)

The expressions R2 , SSE, RSS and TSS should be assumed to correspond to the centered version unless further qualified. With two versions of R2 available that generally differ, which should be used? Centered should be used if the model is centered (contains a constant) and uncentered should be used when it does not. Failing to chose the correct R2 can lead to incorrect conclusions about the fit of the model and mixing the definitions can lead to a nonsensical R2 that falls outside of [0,1]. For instance, computing R2 using the centered version when the model does not contain a constant often results in a negative value when

$R^2 = 1 - \frac{SSE_{C}}{TSS_{C}}$

Most software will return centered $R^2$ and caution is warranted if a model is fit without a constant.

$R^2$ does have some caveats. First, adding an additional regressor will always (weakly) increase the $R^2$ since the sum of squared errors cannot increase by the inclusion of an additional regressor. This renders $R^2$ useless in discriminating between two models where one is nested within the other. One solution to this problem is to use the degree of freedom adjusted $R^2$ .

Definition3.8 - Adjusted $R^2$ ( $\overline{R}^2$ )

The adjusted $\overline{R}^2$ , which adjusts for the number of estimated parameters, is defined

$\overline{R}^2 = 1 - \frac{\frac{SSE}{n-k}}{\frac{TSS}{n-1}} = 1 - \frac{SSE}{TSS} \frac{n - 1}{n - k}$

$\overline{R}^2$ will increase if the reduction in the $SSE$ is large enough to compensate for a loss of 1 degree of freedom, captured by the $n − k$ term. However, if the $SSE$ does not change, $\overline{R}^2$ will decrease. $\overline{R}^2$ is preferable to ${R}^2$ for comparing models, although the topic of model selection will be more formally considered at the end of this chapter. $\overline{R}^2$ , like ${R}^2$ 2, should be constructed from the appropriate versions of the $RSS$ , $SSE$ and $TSS$ (either centered or uncentered) .

Second, ${R}^2$ is not invariant to changes in the regressand. A frequent mistake is to use ${R}^2$ to compare the fit from two models with different regressands, for instance $y_i$ and $ln(y_i)$ . These numbers are incomparable and this type of comparison must be avoided. Moreover, ${R}^2$ is even sensitive to more benign transformations. Suppose a simple model is postulated,

$y_i =β_1 + β_2x_i + ε_i$

of factors models. Bold indicates the correct version (centered or uncentered) for that model. R2 is monotonically increasing in larger models, while R ̄ 2 is not. and a model logically consistent with the original model,

$yi − xi = β1 + (β2−1)xi + εi$

is estimated. The R2s from these models will generally differ. For example, suppose the original coefficientonxi was 1. Subtracting xi will reduce the explanatory power of xi to 0,rendering it useless and resulting in a R2 of 0 irrespective of the R2 in the original model.

3.4.1 Example: $R^2$ and $\overline{R}^2$ in Cross-Sectional Factor models

To illustrate the use of $R^2$ , and the problems with its use, consider a model for $BH^e$ which can depend on one or more risk factor.

The $R^2$ values in Table 3.4 show two things.

First, the excess return on the market portfolio alone can explain $80%$ of the variation in excess returns on the big-high portfolio.

Second, the $HML$ factor appears to have additional explanatory power on top of the market evidenced by increases in $R^2$ from $0.80$ to $0.96$ . The centered and uncentered $R^2$ are very similar because the intercept in the model is near zero. Instead, suppose that the dependent variable is changed to $10+BH^e$ or $100+BH^e$ and attention is restricted to the CAPM. Using the incorrect definition for $R^2$ can lead to nonsensical (negative) and misleading (artificially near 1) values. Finally, Table 3.5 also illustrates the problems of changing the regressand by replacing the regressand $BH_i^e$ with $BH_i^e − VWM_i^e$ . The $R^2$ decreases from a respectable $0.80$ to only $0.10$ , despite the interpretation of the model is remaining unchanged.

3.5 Assumptions

Thus far, all of the derivations and identities presented are purely numerical. They do not indicate whether $\hat{\beta}$ is a reasonable way to estimate $\beta$ . It is necessary to make some assumptions about the innovations and the regressors to provide a statistical interpretation of $\hat{\beta}$ . Two broad classes of assumptions can be used to analyze the behavior of $\hat{\beta}$ : the classical framework (also known as small sample or finite sample) and asymptotic analysis (also known as large sample).

Neither method is ideal. The small sample framework is precise in that the exact distribution of regressors and test statistics are known. This precision comes at the cost of many restrictive assumptions – assumptions not usually plausible in financial applications. On the other hand, asymptotic analysis requires few restrictive assumptions and is broadly applicable to financial data, although the results are only exact if the number of observations is infinite. Asymptotic analysis is still useful for examining the behavior in finite samples when the sample size is large enough for the asymptotic distribution to approximate the finite-sample distribution reasonably well.

This leads to the most important question of asymptotic analysis: How large does n need to be before the approximation is reasonable? Unfortunately, the answer to this question is "It depends". In simple cases, where residuals are independent and identically distributed, as few as 30 observations may be sufficient for the asymptotic distribution to be a good approximation to the finite-sample distribution. In more complex cases, anywhere from 100 to 1,000 may be needed, while in the extreme cases, where the data is heterogenous and highly dependent, an asymptotic approximation may be poor with more than 1,000,000 observations.

The properties of $\hat{\beta}$ will be examined under both sets of assumptions. While the small sample results are not generally applicable, it is important to understand these results as the lingua franca of econometrics, as well as the limitations of tests based on the classical assumptions, and to be able to detect when a test statistic may not have the intended asymptotic distribution. Six assumptions are required to examine the finite-sample distribution of $\hat{\beta}$ and establish the optimality of the OLS procedure ( although many properties only require a subset ).

Assumption 3.9 — Linearity

$y_{i} = \textbf{x}_{i} \textbf{\beta} + \epsilon_{i}$

This assumption states the obvious condition necessary for least squares to be a reasonable method to estimate the $\textbf{\beta}$ . It further imposes a less obvious condition, that $\textbf{x}_{i}$ must be observed and measured without error. Many applicationsin financial econometricsinclude latent variables. Linear regression is not applicable in these cases and a more sophisticated estimator is required. In other applications, the true value of $x_{k,i}$ is not observed and a noisy proxy must be used, so that $\hat{x}_{k,i} = x_{k,i} + ν_{k,i}$ where $νk,i$ is an error uncorrelated with $x_{k,i}$ . When this occurs, ordinary least squares estimators are misleading and a modified procedure (two-stage least squares (2SLS) or instrumental variable regression (IV)) must be used.

Assumption 3.10 — Conditional Mean

$E[\epsilon_{i} |\textbf{X}] = 0, i = 1 , 2 , ... , n$

This assumption states that the mean of each $epsilon_{i}$ is zero given any $x_{k, i}$ , any function of any $x_{k,i}$ or combinations of these. It is stronger than the assumption used in the asymptotic analysis and is not valid in many applications (e.g. Time-Series Data). When the regressand and regressor consist of time-series data, this assumption may be violated and $E[εi |xi+j ]6 = 0$ for some $j$ . This assumption also implies that the correct form of $x_{k,i}$ enters the regression, that $E[\epsilon_{i} = 0$ (through a simple application of the law of iterated expectations), and that the innovations are uncorrelated with the regressors, so that $E[\epsilon_{i} |X] = 0 , i' = 1, 2, ... , n, j =1 , 2 , ... ,n , j = 1 , 2 , ... , k$ .

Assumption 3.11 — Rank

The rank of $\textbf{X}$ is $k$ with probability $1$ .

This assumption is needed to ensure that $\hat{\beta}$ ** is identified and can be estimated. In practice, it requires that the no regressor is perfectly co-linear with the others, that the number of observations is at least as large as the number of regressors ( $n ≥ k$ ) and that variables other than a constant have non-zero variance.

Assumption 3.12 — Conditional Homoskedasticity

$V[\epsilon_{i} |\textbf{X}] = \sigma^{2}$

Homoskedasticity is rooted in homo (same) and skedannumi (scattering) and in modern English means that the residuals have identical variances. This assumption is required to establish the optimality of the OLS estimator and it specifically rules out the case where the variance of an innovation is a function of a regressor.

Assumption 3.13 — Conditional Correlation

$E[\epsilon_{i}\epsilon_{j} |\textbf{X}] = 0, i = 1, 2, ... , n, j = i + 1, ... , n$

Assuming the residuals are conditionally uncorrelated is convenient when coupled with the homoskedasticity assumption: the covariance of the residuals will be $\sigma^{2}\textbf{I}_{n}$ . Like homoskedasticity, this assumption is needed for establishing the optimality of the least squares estimator.

Assumption 3.14 — Conditional Normality

$\mathbf{\epsilon}|\mathbf{X} ∼ N (\mathbf{0}, \mathbf{\Sigma})$

Assuming a specific distribution is very restrictive – results based on this assumption will only be correct is the errors are actually normal – but this assumption allows for pre

3.13 Model Selection and Specification Checking

Econometric problems often begin with a variable whose dynamics are of interest and a relatively large set of candidate explanatory variables. The process by which the set of regressors is reduced is known as model selection or building.

Model building inevitably reduces to balancing two competing considerations: congruence and parsimony. A congruent model is one that captures all of the variation in the data explained by the regressors. Obviously, including all of the regressors and all functions of the regressors should produce a congruent model. However, this is also an infeasible procedure since there are infinitely many functions of even a single regressor. Parsimony dictates that the model should be as simple as possible and so models with fewer regressors are favored. The ideal model is the parsimonious congruent model that contains all variables necessary to explain the variation in the regressand and nothing else.

Model selection is as much a black art as science and some lessons can only be taught through experience. One principal that should be universally applied when selecting a model is to rely on economic theory and, failing that, common sense. The simplest method to select a poorly performing model is to try any and all variables, a process known as data snooping that is capable of producing a model with an arbitrarily high $R^{2}$ even if there is no relationship between the regressand and the regressors.

There are a few variable selection methods which can be examined for their properties. These include

General to Specific modeling (GtS)
Specific to General modeling (StG)
Information criteria (IC)

3.13.1 Model Building

3.13.1.1 General to Specific

General to specific (GtS) model building begins by estimating largest model that can be justified by economic theory (and common sense). This model is then pared down to produce the smallest model that remains congruent with the data. The simplest version of GtS begins with the complete model. If any coefficients have p-values (of t -stats) less than some significance level α (usually 5 or 10%), the least significant regressor is dropped from the regression. Using the remaining regressors, the procedure is repeated until all coefficients are statistically significant, always dropping the least significant regressor.

One drawback to this simple procedure is that variables which are correlated but relevant are often dropped. This is due to a problem known as multicollinearity and individual t -stats will be small but joint significance tests that all coefficients are simultaneously zero will strongly reject. This suggests using joint hypothesis tests to pare the general model down to the specific one. While theoretically attractive, the scope the of possible joint hypothesis tests is vast even in a small model, and so using joint test is impractical.

GtS suffers from two additional issues.

First, it will include an irrelevant variable with positive probability (asymptotically) but will never exclude a relevant variable.
Second, test statistics do not have standard distributions when they are used sequentially (as is the case with any sequential model building procedure).

The only viable solution to the second problem is to fit a single model, make variable inclusions and exclusion choices, and live with the result. This practice is not typically followed and most econometricians use an iterative procedure despite the problems of sequential testing.

3.13.1.2 Specific to General

Specific to General (StG) model building begins by estimating the smallest model, usually including only a constant. Variables are then added sequentially based on maximum $t$ -stat until there is no excluded variable with a significant $t$ -stat at some predetermined $\alpha$ (again, usually $5$ or $10\%$ ).

StG suffers from the same issues as GtS. First it will asymptotically include all relevant variables and some irrelevant ones and second tests done sequentially do not have correct asymptotic size. Choosing between StG and GtS is mainly user preference, although they rarely select the same model. One argument in favor of using a GtS approach is that the variance is consistently estimated in the first step of the general specification while the variance estimated in the first step of the an StG selection is too large. The leads StG processes to have $t$ -stats that are smaller than GtS $t$ -stats and so StG generally selects a smaller model than GtS.

3.13.1.3 Information Criteria

A third method of model selection uses Information Criteria (IC). Information Criteria reward the model for producing smaller SSE while punishing it for the inclusion of additional regressors. The two most frequently used are the Akaike Information Criterion (AIC) and Schwartz Information Criterion(SIC)or Bayesian Information Criterion(BIC).23 Most Information Criteriaareoftheform
−2l +P
where l is the log-likelihood value at the parameter estimates and P is a penalty term. In the case of least squares, where the log-likelihood is not known (or needed), IC’s take the form
lnσˆ 2+P
where the penalty term is divided by n