Coefficient of Determination( Line Best Fit Error)

sherrellbc · Jul 1, 2013

https://www.khanacademy.org/math/probability/regression/regression-correlation/v/r-squared-or-coefficient-of-determination

Homework Statement

So, when determining how effectively a best-fit line describes the variance of a given set of measured data, the Coefficient of Determination is the value that represents this information. Essentially, we look at the total error associated with our measured data, and find out the percentage of error that is present that our line doesn't describe. In doing so, we then subtract from 1 this value, and we resolve the percentage of variance our line does describe.

Homework Equations

That is, r² = 1 - (Error each measured value is from line)/(Total error)
** Where r² is the Coefficient of Determination. Just notation.
The actual formula requires a but more background information, which would make this post very, very long.

The Attempt at a Solution

It struck me as nonsense that we can determine the total error associated with our measurements(y-values) given only the difference between them an a seemingly arbitrary value such as the average of the Y values.

This would make sense if the y value was a constant, say 6. You could measure the total error by taking the difference of each measured y and the value 6. The average, at least to me, really does not represent anything. So, how can a measured value of y over the average of all measured y's represent an error of anything? If the measured y's were for the same x value, then a variation in y could be measured as an error. But if the y has a relationship with x such that it increases as x increases, how does y/y_bar represent error in any sense?

-----------------------------------------
For example:

You are given an unknown resistance. You decide to experimentally determine the resistance of the component by measuring its i-V (current, voltage) curve (response).

Given that X is voltage, and Y is current, you may measure something like this:

_In an ideal case:_
X = 10V, Y = 1Amp
X = 20V, Y = 2Amp
X = 30V, Y = 3Amp
If you plot this curve, there is quite obviously a linear relationship. And, if you are familiar with Ohm's relationship(LAW, if you like), we have the resistance = 10Ohms.

-- The point is, as Voltage increases, current increases as well for any constant resistance R. So, we have a positively sloping linear relationship.

So, from the ideal case above.
y_bar = 2 Amps.
So, given what we have in this video:
The total error associated with our measured values(current, Y), is given by:
(y1-y_bar)^2 + (y2-y_bar)^2 + (y3-y_bar)^2 = (1-2)^2 + (2-2)^2 + (3-2)^2 = 2

Given an ideal world, where the resistance was EXACTLY equal to 10Ohms, and we measured precisely the expected values of current needed to resolve this, how can we say that the measured data had a total error associated with our measured values of current equal to 2?

haruspex · Jul 2, 2013

sherrellbc said:

So, when determining how effectively a best-fit line describes the variance of a given set of measured data,

It doesn't describe the variance of the data. It describes a correlation.

the Coefficient of Determination is the value that represents this information. Essentially, we look at the total error associated with our measured data, and find out the percentage of error that is present that our line doesn't describe.

Not at all. There need not be any error in the data. The coefficient states the error in taking the straight line to be a match for the data. If the straight line happens to be exactly what the data should have looked like, and all the discrepancies came from the measurements, then it would represent the error in the data.

sherrellbc · Jul 2, 2013

haruspex said:

It doesn't describe the variance of the data. It describes a correlation.

Not at all. There need not be any error in the data. The coefficient states the error in taking the straight line to be a match for the data. If the straight line happens to be exactly what the data should have looked like, and all the discrepancies came from the measurements, then it would represent the error in the data.

I didn't mean variance in terms of actual "variance" (sigma square), but rather how much it varied. If you watch the video, the whole thing it confusing with how it's worded.

And by, "Measure of error not described by the line," I mean after the line is in place how much error is associated with our data points and the line's value.
-If all data points were exactly on the line, the error would be zero.

My whole confusing was who the "entire" error can be summed up the by the difference between each y and average of all y values. ----------------
If you watch the video, Sal writes and explains this at about 6:20. Normally these videos are quite informative, but this particular video was just confusing and did not make sense.
-I understand the logic, but now how the total error is the difference between each y and y_bar(the average y value).

Would you mind watching the video? Or least from 6:20 forward a little bit to see what I am missing?

haruspex · Jul 3, 2013

OK, I think I see your problem. The ##\Sigma (y_i-\bar{y})^2## is not a measure of error in the data. It probably shouldn't be called an 'error' at all in this context, juswt the variation in Y, but the formula arises so often in relation to statistical measures of error that it is often referred to as 'standard error'. You could think of it as the minimum error that would apply if you were to try represent the Y values as constant (instead of depending on X).
The video then discusses how much that error is reduced by using a sloped line instead of a horizontal one. The greater the proportion of the error that is eliminated, the more confident you can be that the mx+b fit is appropriate.

Coefficient of Determination( Line Best Fit Error)

Homework Statement

Homework Equations

The Attempt at a Solution

Related to Coefficient of Determination( Line Best Fit Error)

What is the Coefficient of Determination?

How is the Coefficient of Determination calculated?

What does a high Coefficient of Determination indicate?

Can the Coefficient of Determination be negative?

What are the limitations of the Coefficient of Determination?

Similar threads

Hot Threads

Recent Insights