viernes, 15 de mayo de 2015

Simple Linear Regression (2012)

Definition


Linear Regression is a data analysis tool that allows us to model a linear relationship between two variables. Linear regression is most frequently used to make predictions based on actual data (Johnson & Christensen, 2012; Kohout, 1974; Hatch & Lazarton, 1991).

History


The first use of linear regression is attributed to Sir Francis Galton, cousin of Charles Darwin, on 1886. Most of his work focused on the relationships between sizes of vegetables, and also correlations between people’s heights (Klugh, 1986).

Assumptions



Linear Regression assumes certain conditions in order to model our data. These assumptions can also help you decide if Linear Regression is a convenient tool for your research. The degree of certainty of our assumptions will make our model more robust. Some fundamental and most frequently used assumptions are the following:

I. Ratio/Interval data in your variables
    • Data you can establish proportions with, and that holds a true zero (Krieg, 2012).
    • E.g. Test scores, age, time, amount of words, etc. 
II. Data are distributed normally
    • Most of the data will be clustered around a central value, the distribution is roughly symmetric, and less data is placed on the extremes.
    • Some tests you can run to verify normality are the Fisher Information Test of Normality, Allan Variance, Entropy methods, etc

III. Variables are correlated
    • One variable depends on the other.
    • If one variable increases, the other decreases or increases correspondingly.
    • Pearson’s r is close to 1 or -1.


Why do we use Linear Regression?

    • Because it is the simplest way to model your data, therefore a good first approach.
    • Because most relationships between variables in Social Sciences are roughly linear.
    • Because linear regression helps to approach research questions that involve relationship between two variables.

How do we use Linear Regression?

I. Decide your independent variable (X)
    • Depending mainly on your background knowledge, hypothesis, and control.
II.Verify data in each variable is normally distributed 
    • Use tests suggested above
III. Plot the data 
    • Place you independent variable on the X-axis and the dependant variable on the Y-axis of the scatter plot.
    • Each point on the scatterplot will have a (Xi, Yi) coordinate.
    • A first glance at the data will help you decide if it is convenient to continue using linear regression or if you need another model (Diamond & Jefferies, 2001).
IV. Calculate correlation coefficient
    • Calculate Pearsons’ r coefficient.
    • Use critical values as a guide for determining the usefulness of the obtained r.
V. Trace the regression line or "best fitting line". 
    • Use least squares method (recommended).
    • This method will provide the parameters a±∆a  and b±∆b, to build the linear equation: ^y= (a±∆a)x + b±∆b. Where a is the rate of change of ^y per 1 unit in x, and b is the value of ^y, when x=0
    • On the scatterplot, a is the slope of the line, and b is the point where the line goes through the y-axis. 

VI. Determine the error of the regression line.
    • Calculate the vertical difference between your actual data and the regression line: the distances are called residuals.
    • Verify the normal distribution of your model. Plot a histogram of the residuals, and check their normality. 
    • Calculate the mean of residuals and check that this value is close to zero. 
    • Calculate the standard deviation of residuals. 


    • Use the regression line equation and error to make accurate predictions of data in the range of the X data set.

Note:  It is recommended to calculate the confidence intervals for your parameters a and b using the “bootstrap method” (Efron & Tibshirani, 1986)

Advantages and Limitations


Pros
    • "A linear relationship is the most elementary form and hence a reasonable first approximation" (Knoke, Bohrnstedt, & Potter Mee, 2002).
    • If data are not visually linear, you can still “rectify” them and use linear regression.
    • Most statistics packages run Linear Regression automatically.
Cons
    • Its robustness is limited to its assumptions.
    • Make sure to account for the "Regression to the mean effect" (Klugh, 1986; Marascuilo & Serlin, 1988) in order to avoid this phenomenon; make sure your research design is suited for linear Regression analysis. 
    • Linear Regression does not determine causality between variables.

Sample Study


Source: Stricker, L. (2004) The performance of native speakers of English and ESL speakers on the computer-based TOEFL and GRE General Test. Language Testing, 21 (2) 146 -173.

Summary 

The study attempts to verify the construct validity of Computer-Based TOEFL. Two studies were articulated, one aiming to see whether native speakers and ESL speakers perform similarly. A second study investigated the relationship between TEOFL scores and GRE scores for both native and ESL speakers. 

Regarding the second study, previous research had found that verbal components of admission tests for adults are found to be highly correlated with ESL test-takers’ TOEFL scores. 

Methods

Participants were adult test-takers of both tests for the first time, all of them pursuing graduate education. Native speakers were recruited from all over U.S., while participants in the ESL group came from different parts of the globe.

Scores were pooled from ETS databases, sampling was performed according to restrictions, such as, having taken both tests in no more than 15 days of difference.

Results

Study 1: Native speakers perform slightly better than ESL speakers, and regarding the maximum possible score. Also the native speaker group showed less variance than the ESL group. 

In study 2: linear regression analysis showed that Computer-based TOEFL scores were moderately or highly correlated with scores on the GRE, and that this relationship was linear for the analytical and quantitative components. However, regression analysis discovered that the relationship between the verbal section of GRE and TOEFL was non linear, in fact, the shape of this relationship showed that even low proficiency level ESL speakers can obtain high scores in the GRE verbal portion, suggesting that the verbal portion of GRE General test, might not be sufficiently making use of verbal components.


Bibliography and Recommended Readings


Diamond, I., & Jefferies, J. (2001) Beginning Statistics: An Introduction for Social Scientists. London: Sage.

Efron, B., & Tibshirani, R. (1986). Bootstrap Methods for Standard errors, Confidence Intervals, and Other Measures of statistical accuracy. Statistical Science , 1 (1), 54-75.

Hatch, E., & Lazarton, A. (1991). The research manual: Design and statistics for applied linguistics. New York: Newbury House Publishers.

Johnson, B., & Christensen, L. (2012). Educational Research: Quantitative, qualitative, and Mixed Approaches. (4th edition ed.). California: Sage.

Klugh, H. E. (1986). Statistics: The essentials for research. Hillsdale, New Jersey: Lawrence Erlbaum Associates.
noke, D., Bohrnstedt, G. W., & Potter Mee, A. (2002). Statistics for Social Data Analysis (4th edition ed.). Itasca, Ill: F. E. 
Peacock Publisher.

Kohout, F. J. (1974). Statistics for Social Scientists: A coordinated learning system. Malabar, FL: Robert E. Krieger .

Krieg, E. J. (2012). Statistics and Data Analysis for Social Sciences. Boston: Pearson Education.

Marascuilo, L. A., & Serlin, R. C. (1988). Statistical Methods for the Social and Behavioral Sciences. New York: W. H. Freeman and Company.

Stricker, L. (2004) The performance of native speakers of English and ESL speakers on the computer-based TOEFL and GRE General Test. Language Testing, 21 (2) 146 -173.

...and the help of my great friend Néstor Espinoza.