Multiple least squares regression

I don't know... I guess I'm trying to say using a polynomial fit might be a bit artificial here, but I'm not sure.I'm not sure I understand the second part of your post though. You talk about improving the fit and then you talk about the statistical significance of the coefficients. I don't see how they are related in what you said. I think that if higher order polynomial terms improve the fit, then they will also be statistically significant. But, I don't see how the significance of the coefficients relates to the statistical significance of the model. Also, what do you mean by 'interaction' terms?Homework Statement design a regression model that will use the datasety trial x1 x2
  • #1
zzmanzz
54
0

Homework Statement



design a regression model that will use the dataset

y trial x1 x2 x3

0.08536, 1, -1, -1, -1.00000
0.09026, 2, -1, -1, -1.00000
0.10188, 1, -1, -1, -0.33333
0.09301, 2, -1, -1, -0.33333
0.10362, 1, -1, -1, 0.33333
0.09920, 2, -1, -1, 0.33333
0.11033, 1, -1, -1, 1.00000
0.10744, 2, -1, -1, 1.00000
0.10172, 1, -1, 0, -1.00000
0.09360, 2, -1, 0, -1.00000
0.10800, 1, -1, 0, -0.33333
0.11685, 2, -1, 0, -0.33333
0.11002, 1, -1, 0, 0.33333
0.11221, 2, -1, 0, 0.33333
0.11533, 1, -1, 0, 1.00000
0.12328, 2, -1, 0, 1.00000
0.21908, 1, -1, 1, -1.00000
0.19675, 2, -1, 1, -1.00000
0.22744, 1, -1, 1, -0.33333
0.21138, 2, -1, 1, -0.33333
0.28118, 1, -1, 1, 0.33333
0.26413, 2, -1, 1, 0.33333
0.32416, 1, -1, 1, 1.00000
0.30590, 2, -1, 1, 1.00000
0.32390, 1, 1, -1, -1.00000
0.34938, 2, 1, -1, -1.00000
0.13669, 1, 1, -1, -0.33333
0.12953, 2, 1, -1, -0.33333
0.07987, 1, 1, -1, 0.33333
0.07884, 2, 1, -1, 0.33333
0.05959, 1, 1, -1, 1.00000
0.06172, 2, 1, -1, 1.00000
0.21624, 1, 1, 0, -1.00000
0.21925, 2, 1, 0, -1.00000
0.11777, 1, 1, 0, -0.33333
0.11127, 2, 1, 0, -0.33333
0.07338, 1, 1, 0, 0.33333
0.07354, 2, 1, 0, 0.33333
0.05601, 1, 1, 0, 1.00000
0.05622, 2, 1, 0, 1.00000
0.69966, 1, 1, 1, -1.00000
1.58131, 2, 1, 1, -1.00000
0.18522, 1, 1, 1, -0.33333
0.17043, 2, 1, 1, -0.33333
0.09530, 1, 1, 1, 0.33333
0.10060, 2, 1, 1, 0.33333
0.06655, 1, 1, 1, 1.00000
0.06814, 2, 1, 1, 1.00000



Homework Equations



I loaded the dataset and calculated

c = (X'*X)^(-1)*X' * y

where

X = [ones X1 X2 X3]

48*4 data matrix

y is a 48*1 column vector

solving for column vector c -> [c_o c_1 c_2 c_3]'

The Attempt at a Solution



I got the regression coefficients but the predictions are terrible for my model. Am I doing something wrong?
 
Physics news on Phys.org
  • #2
your method looks correct to me. It is not surprising that the predictions are not very good. You need to keep in mind that even though your method might be correct, it still may be terrible at making predictions. In this case, there are 3 'dimensions' and it seems that the input variables take on only a few different possible values. Maybe you can try plotting the data, looking at one dimension at a time, to see intuitively whether it looks linear or not.

edit: when I say 3 'dimensions', I mean the 3 input variables, for example, (temperature, size, colour) might be the three 'dimensions', i.e. the 3 input variables which correspond to a particular value of y. I thought I should say this, because I am not sure about how widely used the word 'dimensions' is, in this context.
 
  • #3
zzmanzz said:

Homework Statement



design a regression model that will use the dataset

y trial x1 x2 x3

0.08536, 1, -1, -1, -1.00000
0.09026, 2, -1, -1, -1.00000
0.10188, 1, -1, -1, -0.33333
0.09301, 2, -1, -1, -0.33333
0.10362, 1, -1, -1, 0.33333
0.09920, 2, -1, -1, 0.33333
0.11033, 1, -1, -1, 1.00000
0.10744, 2, -1, -1, 1.00000
0.10172, 1, -1, 0, -1.00000
0.09360, 2, -1, 0, -1.00000
0.10800, 1, -1, 0, -0.33333
0.11685, 2, -1, 0, -0.33333
0.11002, 1, -1, 0, 0.33333
0.11221, 2, -1, 0, 0.33333
0.11533, 1, -1, 0, 1.00000
0.12328, 2, -1, 0, 1.00000
0.21908, 1, -1, 1, -1.00000
0.19675, 2, -1, 1, -1.00000
0.22744, 1, -1, 1, -0.33333
0.21138, 2, -1, 1, -0.33333
0.28118, 1, -1, 1, 0.33333
0.26413, 2, -1, 1, 0.33333
0.32416, 1, -1, 1, 1.00000
0.30590, 2, -1, 1, 1.00000
0.32390, 1, 1, -1, -1.00000
0.34938, 2, 1, -1, -1.00000
0.13669, 1, 1, -1, -0.33333
0.12953, 2, 1, -1, -0.33333
0.07987, 1, 1, -1, 0.33333
0.07884, 2, 1, -1, 0.33333
0.05959, 1, 1, -1, 1.00000
0.06172, 2, 1, -1, 1.00000
0.21624, 1, 1, 0, -1.00000
0.21925, 2, 1, 0, -1.00000
0.11777, 1, 1, 0, -0.33333
0.11127, 2, 1, 0, -0.33333
0.07338, 1, 1, 0, 0.33333
0.07354, 2, 1, 0, 0.33333
0.05601, 1, 1, 0, 1.00000
0.05622, 2, 1, 0, 1.00000
0.69966, 1, 1, 1, -1.00000
1.58131, 2, 1, 1, -1.00000
0.18522, 1, 1, 1, -0.33333
0.17043, 2, 1, 1, -0.33333
0.09530, 1, 1, 1, 0.33333
0.10060, 2, 1, 1, 0.33333
0.06655, 1, 1, 1, 1.00000
0.06814, 2, 1, 1, 1.00000



Homework Equations



I loaded the dataset and calculated

c = (X'*X)^(-1)*X' * y

where

X = [ones X1 X2 X3]

48*4 data matrix

y is a 48*1 column vector

solving for column vector c -> [c_o c_1 c_2 c_3]'

The Attempt at a Solution



I got the regression coefficients but the predictions are terrible for my model. Am I doing something wrong?

When you have such limited ranges of variables (values like -1, 0, 1, etc.) it starts to look like an experimental design problem for a *quadratic* fit. I suggest you re-run the model with added columns ##x_2^2, x_3^2, x_1 x_2, x_1 x_3, x_2 x_3.## That will give you a total of 1 + 3 + 2 + 3 = 9 terms in your expression for y. If you have the x-values already, you can (depending on the software you use) calculate those extra columns to add to the data set.

Note: the data set has only the two values -1 and +1 for ##x_1##, so does not distinguish between 1 and ##x_1^2##; that is why we omit ##x_1^2##.
 
Last edited:
  • #4
mm, it depends on what kind of behaviour we would believe the underlying system actually has. Using a quadratic fit might give no better than a linear fit (maybe even worse). Trying a quadratic fit is a good way to extend the homework though.

(@zzmanzz) Also, when you said the predictions are 'terrible', they shouldn't be like ridiculously far off. How bad are the predictions compared to the range of the y data?
 
  • #5
BruceW said:
mm, it depends on what kind of behaviour we would believe the underlying system actually has. Using a quadratic fit might give no better than a linear fit (maybe even worse). Trying a quadratic fit is a good way to extend the homework though.

(@zzmanzz) Also, when you said the predictions are 'terrible', they shouldn't be like ridiculously far off. How bad are the predictions compared to the range of the y data?

The best quadratic fit cannot be worse than a linear fit; if the best quadratic happens to have zero coefficients for the squared or product terms, it will reduce to linear.

Anyway, no matter what the exact form is, when the x-values are limited one cannot tell the difference between a more general model and a quadratic. The variable ##x_1## just takes the two values +1 and -1, so any ##f(x_1)## is indistinguishable from a linear function. The variable ##x_2## takes only the three values -1, 0 and +1, so any ##f(x_2)## is indistinguishable from a quadratic. The variable ##x_3## takes 4 values, so one can think of going to a cubic, but that any other function ##f(x_3)## would give the same results.

Where we have some wriggle room is in the "interaction" terms: we could include terms like ##x_1 x_2^2, x_1 x_3^2, x_2 x_3^2, x_2^2 x_3, x_1 x_2 x_3,## etc. Including them would improve the fit, but whether or not the coefficients would be statistically significant is another matter.
 
  • #6
Ray Vickson said:
The best quadratic fit cannot be worse than a linear fit
Yeah, I wasn't clear on what I meant. I mean the prediction accuracy for the best quadratic fit can be worse than the prediction accuracy for the best linear fit. (Which is the most meaningful way to test a model, I think). For example, if the underlying data is y=mx+c, with some noise added, then the linear fit will give closer predictions, on average.

Ray Vickson said:
Anyway, no matter what the exact form is, when the x-values are limited one cannot tell the difference between a more general model and a quadratic.
Ah, yeah that is a good point. As long as we treat the dimensions independently, and If the data in a certain dimension only take on n values, then a polynomial fit of degree n (for that dimension) will give the same predictions as a polynomial of any higher order. (assuming the data values we have already seen are the only possible data values, and I don't know what his data is coming from, but it does look that way).

So in this case, if we use a polynomial fit of higher order, we never make our model worse. In this situation, it will always give at least as good prediction as a lower order polynomial (under the conditions in my last paragraph). I guess this is a consequence of the fact that in this situation, a discrete distribution should be used really. But the problem was to use a regression model, and AFAIK, that implies using a continuous distribution. So I guess he should not use a discrete distribution.
 

Related to Multiple least squares regression

1. What is Multiple Least Squares Regression?

Multiple least squares regression is a statistical method used to identify and quantify the relationship between multiple independent variables and a single dependent variable. It is a type of linear regression that involves fitting a line or curve to a set of data points in order to make predictions and understand the strength and direction of the relationship between the variables.

2. When should Multiple Least Squares Regression be used?

Multiple least squares regression is commonly used when there are two or more independent variables that may have an impact on the dependent variable. It can be used to analyze data from various fields, such as economics, psychology, and environmental science.

3. What is the difference between Simple and Multiple Least Squares Regression?

Simple least squares regression involves only one independent variable while multiple least squares regression involves two or more independent variables. Simple regression is used when there is a single relationship between two variables, whereas multiple regression is used when there are multiple factors that may influence the dependent variable.

4. How is the best-fit line or curve determined in Multiple Least Squares Regression?

In multiple least squares regression, the best-fit line or curve is determined by minimizing the sum of the squared differences between the observed data points and the predicted values. This is done by calculating the least squares estimates for the regression coefficients, which represent the slope and intercept of the line or curve.

5. What are the assumptions of Multiple Least Squares Regression?

The assumptions of multiple least squares regression include that the relationship between the independent variables and the dependent variable is linear, the errors are normally distributed, the errors have constant variance, and the errors are independent of each other. It is important to check these assumptions before using the regression model to ensure the validity of the results.

Back
Top