The correlation coefficient, , tells us about the strength of the linear relationship between and . However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient and the sample size , together.
We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.
The sample data is used to compute , the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we only have sample data, we can not calculate the population correlation coefficient. The sample correlation coefficient, , is our estimate of the unknown population correlation coefficient.
The symbol for the population correlation coefficient is , the Greek letter "rho".
= population correlation coefficient (unknown)
= sample correlation coefficient (known; calculated from sample data)
The hypothesis test lets us decide whether the value of the population correlation coefficient is "close to 0" or "significantly different from 0". We decide this based on the sample correlation coefficient and the sample size .
If the test concludes that the correlation coefficient is significantly different from 0, we say that the correlation coefficient is "significant".
If the test concludes that the correlation coefficient is not significantly different from 0 (it is close to 0), we say that correlation coefficient is "not significant".
There are two methods to make the decision. Both methods are equivalent and give the same result.
Method 1: Using the p-value
Method 2: Using a table of critical values
In this chapter of this textbook, we will always use a significance level of 5%,
Note: Using the p-value method, you could choose any appropriate significance level you want; you are not limited to using . But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, . (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook).
The linear regression t-test LinRegTTEST on the TI-83+ or TI-84+ calculators calculates the p-value.
On the LinRegTTEST input screen, on the line prompt for or , highlight "≠ 0"
The output screen shows the p-value on the line that reads "".
(Most computer statistical software can calculate the p-value.)
You will use technology to calculate the p-value. The following describe the calculations to compute the test statistics and the p-value:
The p-value is calculated using a -distribution with degrees of freedom.
The formula for the test statistic is . The value of the test statistic, , is shown in the computer or calculator output along with the p-value. The test statistic has the same sign as the correlation coefficient .
The p-value is the combined area in both tails.
An alternative way to calculate the p-value () given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.
The p-value, 0.026, is less than the significance level of
Decision: Reject the Null Hypothesis Ho
Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between and because the correlation coefficient is significantly different from 0.
Because is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.
The 95% Critical Values of the Sample Correlation Coefficient Table at the end of this chapter (before the Summary) may be used to give you a good idea of whether the computed value of is significant or not. Compare to the appropriate critical value in the table. If is not between the positive and negative critical values, then the correlation coefficient is significant. If is significant, then you may want to use the line for prediction.
Suppose you computed using data points. . The critical values associated with are -0.632 and + 0.632. If negative critical value or positive critical value, then is significant. Since and 0.801 > 0.632, is significant and the line may be used for prediction. If you view this example on a number line, it will help you.
Figure 1. is not significant between -0.632 and +0.632. . Therefore, is significant.
Suppose you computed with 14 data points. . The critical values are -0.532 and 0.532. Since −0.624 < −0.532, is significant and the line may be used for prediction.
Figure 2. . Therefore, is significant.
Suppose you computed and . . The critical values are -0.811 and 0.811. Since −0.811 < 0.776 < 0.811, is not significant and the line should not be used for prediction.
Figure 3. . Therefore, is not significant.
Use the "95% Critical Value" table for with
The critical values are -0.602 and +0.602
Since 0.6631>0.602 , is significant.
Conclusion:There is sufficient evidence to conclude that there is a significant linear relationship between and because the correlation coefficient is significantly different from 0.
Because is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.
Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if is significant and the line of best fit associated with each can be used to predict a value. If it helps, draw a number line.
Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between and in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between and in the population.
The regression line equation that we calculate from the sample data gives the best fit line for our particular sample. We want to use this best fit line for the sample as an estimate of the best fit line for the population. Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.
Figure 4. The values for each value are normally distributed about the line with the same standard deviation. For each value, the mean of the values lies on the regression line. More values lie near the line than are scattered further away from the line.