Best straight line fit to a set of data

In summary, the squire of the deviation is a math problem that allows us to find the line that, on average, has the smallest vertical deviations from the data points.
  • #1
ahmed markhoos
49
2
Hello,

I didn't know where to put my question, but I think here is the best section for it.

http://im60.gulfup.com/apkrpJ.png

The problem isn't that I can't solve it, I actually did but I don't understand the concept ! -- I don't remember anything from my high school stat and I didn't do college stat yet.

to be more specific in my question, what does the squire of the deviation mean? and how taking the sum of them give me the result I want?
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
Some of the data points will fall above the straight line, and some will fall below it. The difference, ##y_n - y## for the points above will be positive, and the same difference for the points that fall below will be negative. If you just add all these differences up, they will cancel each other on average and you will get zero. That's not very useful! So instead, you square the vertical distance between ##y_n## and ##y##, to get a positive number whether the data point falls above or below the line. Adding these numbers up will only give zero if all the data points are exactly on the line. Minimizing the sum of the squared deviations from the line will give you the line that, on average, has the smallest vertical deviations from the data points. It may help to draw a picture and try to visualize this argument.
 
  • Like
Likes ahmed markhoos and Greg Bernhardt
  • #3
ahmed markhoos said:
Hello,

I didn't know where to put my question, but I think here is the best section for it.

http://im60.gulfup.com/apkrpJ.png

The problem isn't that I can't solve it, I actually did but I don't understand the concept ! -- I don't remember anything from my high school stat and I didn't do college stat yet.

to be more specific in my question, what does the squire of the deviation mean? and how taking the sum of them give me the result I want?[/QUOT"actualE]
The problem isn't that I can't solve it, I actually did but I don't understand the concept ! -- I don't remember anything from my high school stat and I didn't do college stat yet.

to be more specific in my question, what does the squire of the deviation mean? and how taking the sum of them give me the result I want?
As stated, the problem has nothing to do with statistics; it is just a well-defined math problem. It is a whole separate issue as to whether the sum of squared deviations has something to do with probability and/or statistics; in some cases it does, and in other cases it does not.Anyway, you said that you "actually did solve it", but did not understand what you were doing. Well, first show us your work, so we can tell where you might need some assistance.Why the sum of squares? Here are some reasons:

(1) We (usually) want a "goodness-of-fit" measure that somehow has in it all the errors ##e_i = y_i - (m x_i + b)## for ##i = 1,2, \ldots, n##.

(2) We do not just want to add up all the errors (algebraically), because the positive ones may cancel out the negative ones, leaving us with a highly erroneous error measure of 0 (or something very small), even when the fit is not very good at all. So, for that reason, we should use a function of the magnitudes ##|e_i|##, rather than the ##e_i## themselves.

(3) Taking the sum of squares (which does involve ##|e_i|^2 = e_i^2##) is convenient, because it allows us to use calculus methods to arrive at a simple solution involving more-or-less straightforward arithmetical calculations. Furthermore, the method has been around for more than 200 years, so is familiar. Finally, IF certain types of statistical assumptions are made about the nature of the ##(x,y)## data points, THEN numerous interesting statistical facts and measurements can be derived from the solution. However, just to be clear: even if we are not doing statistics, the least-squares fit can still be useful.

(4) Other, sometimes more "robust" intercept-slope estimates can be obtained using alternative measures of errors, such as ##S_1 = \sum_{i=1}^n |e_i|## (total absolute error) or ##S_3 = \max (|e_1|, |e_2|, \ldots, |e_n|)## (largest single error) and finding the lines that minimize those measures instead. Such problems are doable nowadays using relatively recently-developed tools (Linear Programming, for example). They would not have been known to Gauss or Legendre and probably not have been solvable by them, either. I believe that the resulting statistical issues in these cases are much less well-understood (and harder to deal with) than in the least-squares case. Nevertheless, these types of fits are nowadays pretty widely used and are often preferred to those of least-squares; and sometimes the resulting statistical issues (if any) are handled using Monte-Carlo simulation methods, for example.
 
Last edited by a moderator:
  • Like
Likes ahmed markhoos
  • #4
As a technical note, the differences |y_i - y| are usually called the residuals. We want to minimize the sum of squares of residuals. There are additional tests you may want to run in order to have an idea of how well the approximation fits the data: a hypothesis test where ##H_0: a=0 , H_1 : a \neq 0##, where ##a## is the slope. If ##a=0## is accepted (not rejected) then the test tells you there is little, if any linear dependence between ##y## and ##x##. You ay also want to test the correlation between ##a,b## , especially so if you have multilinear regression ##y = b + a_1x_1 +a_2 x_2 +... ## , you want to tests the correlations between the pairs ## a_j, a_k ##, and get rid of those where the correlation is high. Then you also want to consider the coefficients ## r, r^2## that measure the level of dependence, the extent to which your independent variable determines your dependent variable.
 
  • Like
Likes ahmed markhoos

Related to Best straight line fit to a set of data

1. What is the purpose of finding the best straight line fit to a set of data?

The purpose of finding the best straight line fit is to determine the relationship between two variables and to predict future values based on the given data.

2. How is the best straight line fit determined?

The best straight line fit is determined by using a method called linear regression, which calculates the equation of a line that minimizes the distance between the line and the data points.

3. What is the significance of the slope and y-intercept in a best straight line fit?

The slope represents the rate of change between the two variables, while the y-intercept represents the starting point of the line. They both play a crucial role in determining the relationship and making predictions.

4. Can the best straight line fit be used for any type of data?

No, the best straight line fit is only suitable for data that shows a linear relationship between the two variables. If the data shows a curved relationship, a different method, such as polynomial regression, should be used.

5. How accurate is the best straight line fit in predicting future values?

The accuracy of the best straight line fit depends on the strength of the relationship between the two variables and the consistency of the data. It is always important to consider the margin of error and other factors that may affect the accuracy of the predictions.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
947
  • Calculus and Beyond Homework Help
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
653
  • Calculus and Beyond Homework Help
Replies
6
Views
3K
  • Electromagnetism
Replies
3
Views
1K
  • Calculus and Beyond Homework Help
Replies
9
Views
3K
  • Calculus and Beyond Homework Help
Replies
1
Views
1K
  • Introductory Physics Homework Help
Replies
2
Views
228
  • MATLAB, Maple, Mathematica, LaTeX
Replies
14
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
3K
Back
Top