Coefficient of Determination( Line Best Fit Error)

In summary, the coefficient of determination is the value that represents the error in taking the straight line to be a match for the data. If the straight line happens to be exactly what the data should have looked like, and all the discrepancies came from the measurements, then it would represent the error in the data.
  • #1
sherrellbc
83
0
https://www.khanacademy.org/math/probability/regression/regression-correlation/v/r-squared-or-coefficient-of-determination

Homework Statement


So, when determining how effectively a best-fit line describes the variance of a given set of measured data, the Coefficient of Determination is the value that represents this information. Essentially, we look at the total error associated with our measured data, and find out the percentage of error that is present that our line doesn't describe. In doing so, we then subtract from 1 this value, and we resolve the percentage of variance our line does describe.

Homework Equations


That is, r² = 1 - (Error each measured value is from line)/(Total error)
** Where r² is the Coefficient of Determination. Just notation.
The actual formula requires a but more background information, which would make this post very, very long.

The Attempt at a Solution



It struck me as nonsense that we can determine the total error associated with our measurements(y-values) given only the difference between them an a seemingly arbitrary value such as the average of the Y values.

This would make sense if the y value was a constant, say 6. You could measure the total error by taking the difference of each measured y and the value 6. The average, at least to me, really does not represent anything. So, how can a measured value of y over the average of all measured y's represent an error of anything? If the measured y's were for the same x value, then a variation in y could be measured as an error. But if the y has a relationship with x such that it increases as x increases, how does y/y_bar represent error in any sense?

-----------------------------------------
For example:

You are given an unknown resistance. You decide to experimentally determine the resistance of the component by measuring its i-V (current, voltage) curve (response).

Given that X is voltage, and Y is current, you may measure something like this:

_In an ideal case:_
X = 10V, Y = 1Amp
X = 20V, Y = 2Amp
X = 30V, Y = 3Amp
If you plot this curve, there is quite obviously a linear relationship. And, if you are familiar with Ohm's relationship(LAW, if you like), we have the resistance = 10Ohms.

-- The point is, as Voltage increases, current increases as well for any constant resistance R. So, we have a positively sloping linear relationship.

So, from the ideal case above.
y_bar = 2 Amps.
So, given what we have in this video:
The total error associated with our measured values(current, Y), is given by:
(y1-y_bar)^2 + (y2-y_bar)^2 + (y3-y_bar)^2 = (1-2)^2 + (2-2)^2 + (3-2)^2 = 2

Given an ideal world, where the resistance was EXACTLY equal to 10Ohms, and we measured precisely the expected values of current needed to resolve this, how can we say that the measured data had a total error associated with our measured values of current equal to 2?
 
Physics news on Phys.org
  • #2
sherrellbc said:
So, when determining how effectively a best-fit line describes the variance of a given set of measured data,
It doesn't describe the variance of the data. It describes a correlation.
the Coefficient of Determination is the value that represents this information. Essentially, we look at the total error associated with our measured data, and find out the percentage of error that is present that our line doesn't describe.
Not at all. There need not be any error in the data. The coefficient states the error in taking the straight line to be a match for the data. If the straight line happens to be exactly what the data should have looked like, and all the discrepancies came from the measurements, then it would represent the error in the data.
 
  • #3
haruspex said:
It doesn't describe the variance of the data. It describes a correlation.

Not at all. There need not be any error in the data. The coefficient states the error in taking the straight line to be a match for the data. If the straight line happens to be exactly what the data should have looked like, and all the discrepancies came from the measurements, then it would represent the error in the data.

I didn't mean variance in terms of actual "variance" (sigma square), but rather how much it varied. If you watch the video, the whole thing it confusing with how it's worded.

And by, "Measure of error not described by the line," I mean after the line is in place how much error is associated with our data points and the line's value.
-If all data points were exactly on the line, the error would be zero.

My whole confusing was who the "entire" error can be summed up the by the difference between each y and average of all y values. ----------------
If you watch the video, Sal writes and explains this at about 6:20. Normally these videos are quite informative, but this particular video was just confusing and did not make sense.
-I understand the logic, but now how the total error is the difference between each y and y_bar(the average y value).

Would you mind watching the video? Or least from 6:20 forward a little bit to see what I am missing?
 
Last edited:
  • #4
OK, I think I see your problem. The ##\Sigma (y_i-\bar{y})^2## is not a measure of error in the data. It probably shouldn't be called an 'error' at all in this context, juswt the variation in Y, but the formula arises so often in relation to statistical measures of error that it is often referred to as 'standard error'. You could think of it as the minimum error that would apply if you were to try represent the Y values as constant (instead of depending on X).
The video then discusses how much that error is reduced by using a sloped line instead of a horizontal one. The greater the proportion of the error that is eliminated, the more confident you can be that the mx+b fit is appropriate.
 

Related to Coefficient of Determination( Line Best Fit Error)

What is the Coefficient of Determination?

The Coefficient of Determination, also known as R-squared, is a statistical measure that represents the proportion of the variance in the dependent variable that is explained by the independent variable(s).

How is the Coefficient of Determination calculated?

The Coefficient of Determination is calculated by squaring the correlation coefficient (r) between the dependent and independent variables. This value can range from 0 to 1, where 0 indicates no relationship between the variables and 1 indicates a perfect relationship.

What does a high Coefficient of Determination indicate?

A high Coefficient of Determination, closer to 1, indicates that a larger proportion of the variance in the dependent variable is explained by the independent variable(s). This means that the line of best fit is a good representation of the relationship between the variables.

Can the Coefficient of Determination be negative?

No, the Coefficient of Determination cannot be negative as it is the squared value of the correlation coefficient. However, the coefficient can be close to 0, indicating a weak relationship between the variables.

What are the limitations of the Coefficient of Determination?

The Coefficient of Determination only measures the strength of the linear relationship between two variables and does not account for any non-linear relationships. Additionally, it is affected by the inclusion of outliers and can be biased by the number of data points in the sample.

Similar threads

  • Precalculus Mathematics Homework Help
Replies
17
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Precalculus Mathematics Homework Help
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
658
  • Precalculus Mathematics Homework Help
Replies
6
Views
2K
  • Introductory Physics Homework Help
Replies
15
Views
1K
Replies
63
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
5K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
944
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
Back
Top