Should I Ignore Data Driven Models or Use Bayesian Methods for Model Selection?

  • A
  • Thread starter FallenApple
  • Start date
  • Tags
    Data Models
In summary: It should be noted that neither stepwise regression nor fit indices like the AIC really protect against overfitting. Stepwise regression in particular is kind of a joke among statisticians, though it's still (unfortunately) popular in...
  • #1
FallenApple
566
61
So it was mentioned that one should ignore data driven models because this could inflate the type 1 error as a result of overfitting. So the model should be decided even before looking at the data set.

But at the same time, other sources say we should look at the scatterplot matrix to determine if there is a confounder or not because confounders should be adjusted for in the regression model.

Which should I do? Or is both involved in the process of data analysis?
 
Physics news on Phys.org
  • #2
Be judicious. You should propose a logical model that may reasonable fit the data and use a statistical process that will not over-fit the data. There are stepwise multiple regression procedures that will not throw everything in the model just to get a better fit. If you use those procedures, you should not be too concerned about over-fitting.
 
  • #3
FactChecker said:
Be judicious. You should propose a logical model that may reasonable fit the data and use a statistical process that will not over-fit the data. There are stepwise multiple regression procedures that will not throw everything in the model just to get a better fit. If you use those procedures, you should not be too concerned about over-fitting.

Are you taking about something like AIC?

Can I do both? That is, set up a regression equation from just the scientific question alone and then use the stepwise process to see if there are any possible confounders I may have missed?
 
  • #4
FallenApple said:
Are you taking about something like AIC?
Yes, I believe so. Stepwise multiple regression can help you avoid over-fitting data. There are three general approaches: forward, backward, and "both". Forward adds variables one at a time in order of the statistical benefit to the model till none of the remaining variables add enough to the model. Backward starts with a model including all independent variables and removes statistically insignificant variables one at a time. The third approach, "both", starts with forward and then does backward to see if some combination of variables has made another variable statistically insignificant and removable.
Can I do both? That is, set up a regression equation from just the scientific question alone and then use the stepwise process to see if there are any possible confounders I may have missed?
I don't think so. As I understand confounders, they are variables that are the true cause of both some independent variables and dependent variables. That is a cause-effect issue that is separate from the statistical analysis of correlation. Multiple linear regression will not help to determine cause and effect, only correlation. It will give you predictors, not necessarily causes.

PS. I am afraid that your questions are more technical than I originally thought and that I can not help any more. Perhaps a professional statistician will be able to give you more authoritative answers.
 
  • #5
@FallenApple there are of course methods for learning the model from the data. That is what machine learning is all about. If you want to use them then you need to really study the methods. In any sort of data driven model you want to use k-fold cross validation.

Another approach would be to use Bayesian methods. With Bayesian methods there is no multiple comparisons issue so you can just use all the models that come to mind. The Bayes factor allows you to compare models, although it has some issues of its own.
 
  • #6
FactChecker said:
Yes, I believe so. Stepwise multiple regression can help you avoid over-fitting data. There are three general approaches: forward, backward, and "both". Forward adds variables one at a time in order of the statistical benefit to the model till none of the remaining variables add enough to the model. Backward starts with a model including all independent variables and removes statistically insignificant variables one at a time. The third approach, "both", starts with forward and then does backward to see if some combination of variables has made another variable statistically insignificant and removable.
I don't think so. As I understand confounders, they are variables that are the true cause of both some independent variables and dependent variables. That is a cause-effect issue that is separate from the statistical analysis of correlation. Multiple linear regression will not help to determine cause and effect, only correlation. It will give you predictors, not necessarily causes.

PS. I am afraid that your questions are more technical than I originally thought and that I can not help any more. Perhaps a professional statistician will be able to give you more authoritative answers.

It should be noted that neither stepwise regression nor fit indices like the AIC really protect against overfitting. Stepwise regression in particular is kind of a joke among statisticians, though it's still (unfortunately) popular in the social/biological sciences. If you're concerned about overfitting, then there's really no substitute for cross validation and regularization (or, if you're a fan of Bayesian models, some kind of shrinkage prior). A sensible approach would be to identify a class of sensible models (based on subject knowledge), and then to select the model with the best predictive performance.
 
  • #7
FactChecker said:
Yes, I believe so. Stepwise multiple regression can help you avoid over-fitting data. There are three general approaches: forward, backward, and "both". Forward adds variables one at a time in order of the statistical benefit to the model till none of the remaining variables add enough to the model. Backward starts with a model including all independent variables and removes statistically insignificant variables one at a time. The third approach, "both", starts with forward and then does backward to see if some combination of variables has made another variable statistically insignificant and removable.

I don't think so. As I understand confounders, they are variables that are the true cause of both some independent variables and dependent variables. That is a cause-effect issue that is separate from the statistical analysis of correlation. Multiple linear regression will not help to determine cause and effect, only correlation. It will give you predictors, not necessarily causes.

PS. I am afraid that your questions are more technical than I originally thought and that I can not help any more. Perhaps a professional statistician will be able to give you more authoritative answers.

I think I was told before that its more optimal to use one of foward or backwards. I believe the example was some dataset with a very large amount of variables and something like doing backwards might not be as good as forwards. I forgot what the reason was though.

Oh I see. So if I just do the stepwise methods, it alone would not account for any possible confounders?
 
  • #8
Dale said:
@FallenApple there are of course methods for learning the model from the data. That is what machine learning is all about. If you want to use them then you need to really study the methods. In any sort of data driven model you want to use k-fold cross validation.

Another approach would be to use Bayesian methods. With Bayesian methods there is no multiple comparisons issue so you can just use all the models that come to mind. The Bayes factor allows you to compare models, although it has some issues of its own.

Would the stepwise procedures account for possible confounders or those in the causal pathway? For example, say I want to see if weight is associated with blood pressure. Then the response would be blood pressure. But if diabetes is in the data set and is included in the model, then the effect of weight on blood pressure could be explained by diabetes since weight leads to diabetes and diabetes leads to high blood pressure. So we would not want to include diabetes in the model.

Would the stepwise procedure or k-fold validation account for such a situation? If I want to see if weight is associated with blood pressure, I would likely want to adjust for all the possible confounders but at the same time exclude things in the causal pathway to blood pressure.
 
  • #9
Number Nine said:
It should be noted that neither stepwise regression nor fit indices like the AIC really protect against overfitting. Stepwise regression in particular is kind of a joke among statisticians, though it's still (unfortunately) popular in the social/biological sciences. If you're concerned about overfitting, then there's really no substitute for cross validation and regularization (or, if you're a fan of Bayesian models, some kind of shrinkage prior). A sensible approach would be to identify a class of sensible models (based on subject knowledge), and then to select the model with the best predictive performance.
All statistical methods must be used judiciously. The subject matter expert must protect against overfitting. Stepwise multiple regression allows you to specify a reasonable set of allowable independent variables and to set a lower limit on the statistical significance of variables that end up in the model. If the result is overfitting, then an unreasonably large number of independent variables were allowed and the limit was set too low.
 
  • #10
FallenApple said:
But if diabetes is in the data set and is included in the model, then the effect of weight on blood pressure could be explained by diabetes since weight leads to diabetes
I think you are (rightly) concerned about multicolinearity. When two explanatory variables are correlated then the overall model can still be fit with confidence, but the relative contribution of the two correlated variables cannot be determined.

Consider an extreme case where the two explanatory variables are temperature in Celsius and temperature in Fahrenheit. The overall amount of variance explained by temperature will be correct, but the partitioning into how much is explained by Celsius and how much is explained by Fahrenheit will just be based on noise. I think that is a fundamental limitation of regression, or at least I don't know of a solution.

FallenApple said:
or k-fold validation account for such a situation?
K fold cross validation addresses a different problem. Basically, it is "cheating" to use the same data to determine the model and also to test the model. So you will split the data into subsets where you generate the model with one subset and test it against a different subset.
 
  • Like
Likes FactChecker and FallenApple
  • #11
@FallenApple , There is no statistical procedure that can prove a cause-effect relationship. That is up to the subject matter expert. Statistics is a general numerical process that knows nothing about the physics, chemistry, genetics, etc. that would imply cause-effect. It only knows what tends to go together, not what caused what.
That being said, there is often a timing between two variables that time series analysis can identify. That might imply that the first effect caused the later effect, but not necessarily. It would be up to the subject matter expert to determine if the timing meant anything.

Consider your example of weight => diabetes => high blood pressure. There may be a stronger statistical correlation between weight and blood pressure, or there might be a stronger correlation between diabetes and blood pressure. A statistical process will pick the stronger correlating variable and will not, on its own, account for any physics or logic that makes you prefer weight to diabetes. It would be up to you, as a subject mater expert, to remove diabetes from the list of independent variables and re-run the statistical analysis without it. Of course, you have the problem that there are mixed cases -- some where weight caused the high blood pressure and others where there is no weight problem but diabetes still occurred and caused high blood pressure. So by eliminating diabetes from the model, you can expect to get a weaker predictor of high blood pressure based only on weight.

The bottom line is that you, as the subject matter expert, will have to make those decisions and see what the resulting statistical models are. You can not avoid that by leaving it up to a statistical procedure.
 
  • Like
Likes FallenApple and Dale
  • #12
FactChecker said:
@FallenApple , There is no statistical procedure that can prove a cause-effect relationship. That is up to the subject matter expert. Statistics is a general numerical process that knows nothing about the physics, chemistry, genetics, etc. that would imply cause-effect. It only knows what tends to go together, not what caused what.
That being said, there is often a timing between two variables that time series analysis can identify. That might imply that the first effect caused the later effect, but not necessarily. It would be up to the subject matter expert to determine if the timing meant anything.

Consider your example of weight => diabetes => high blood pressure. There may be a stronger statistical correlation between weight and blood pressure, or there might be a stronger correlation between diabetes and blood pressure. A statistical process will pick the stronger correlating variable and will not, on its own, account for any physics or logic that makes you prefer weight to diabetes. It would be up to you, as a subject mater expert, to remove diabetes from the list of independent variables and re-run the statistical analysis without it. Of course, you have the problem that there are mixed cases -- some where weight caused the high blood pressure and others where there is no weight problem but diabetes still occurred and caused high blood pressure. So by eliminating diabetes from the model, you can expect to get a weaker predictor of high blood pressure based only on weight.

The bottom line is that you, as the subject matter expert, will have to make those decisions and see what the resulting statistical models are. You can not avoid that by leaving it up to a statistical procedure.

Would leaving out diabetes matter too much if the only concern is to find out if weight=>high blood pressure? If I do that and get weight as a weak predictor, then wouldn't that just mean that weight isn't a good predictor, which answers the question?

What about this strategy?

Run regression for bloodpressure~ weight. Then run bloodpressure~diabetes, then bloodpressure~anyother variable. So basically a univariate on each variable in the data set. Then see which ones are associated with bloodpressure. Those that are could possibly be confounders. So I adjust for them while leaving out the one that is supected to be in the casual pathway between weight and high blood pressure.

From the univariate bloodpressure~ weight, if weight is significant, it might be due to a confounder, so adjust for all other confounders to check using bloodpressure~ weight+all_possible_confounders . If weight is still significant after adjustment, then we know that weight=>blood pressure and we are done with the analysis(however it's possible that some_variable=>weight=>high blood pressure, but it doesn't matter since we only care about if weight=>high blood pressure). If weight is not significant after adjustment, then we know that the original relation was due to confounding so we can conclude weight!=>high blood pressure and we are done with the analysis.

From the univariate bloodpressure~ weight, if weight NOT significant, then still do the analysis while adding in the possible confounders. If weight is significant after adjustment, then it means some other variable confounds the relationship and hence any effect of weight can be explained by confounders. If weight is not significant after adjustment then it could be due to confounding or something in the causal pathway. Check for the latter.
 
  • #13
Dale said:
I think you are (rightly) concerned about multicolinearity. When two explanatory variables are correlated then the overall model can still be fit with confidence, but the relative contribution of the two correlated variables cannot be determined.

Consider an extreme case where the two explanatory variables are temperature in Celsius and temperature in Fahrenheit. The overall amount of variance explained by temperature will be correct, but the partitioning into how much is explained by Celsius and how much is explained by Fahrenheit will just be based on noise. I think that is a fundamental limitation of regression, or at least I don't know of a solution.

K fold cross validation addresses a different problem. Basically, it is "cheating" to use the same data to determine the model and also to test the model. So you will split the data into subsets where you generate the model with one subset and test it against a different subset.

Ah I see. Then I would just pick one of either Celsius or Fahrenheit since one is just a substitute for the other.

So cross validation is like a generative approach to see what explains the response variable. So what would happen if the end resulting model doesn't have the predictor I'm interested in?
 
  • #14
FallenApple said:
Would leaving out diabetes matter too much if the only concern is to find out if weight=>high blood pressure? If I do that and get weight as a weak predictor, then wouldn't that just mean that weight isn't a good predictor, which answers the question?
That sounds right.
What about this strategy?

Run regression for bloodpressure~ weight. Then run bloodpressure~diabetes, then bloodpressure~anyother variable. So basically a univariate on each variable in the data set. Then see which ones are associated with bloodpressure. Those that are could possibly be confounders. So I adjust for them while leaving out the one that is supected to be in the casual pathway between weight and high blood pressure.
That is the right idea. You should use statistics as a tool to help you analyse the relationships. It does the calculations, but you have to use your expertise to decide on the preferred model. There are multiple regression packages where a single run will give you information on all the correlations between individual variables and combinations.

One thing I would caution you about is the danger of assuming cause and effect in one direction when the causal relationship is often the reverse. I see that a lot. For instance, you often hear that regular exercise improves a person's health. That is a treacherous assumption. Sick people often do not feel like exercising, so the causal relationship is often reversed. It takes very careful research to get it right.
From the univariate bloodpressure~ weight, if weight is significant, it might be due to a confounder, so adjust for all other confounders to check using bloodpressure~ weight+all_possible_confounders . If weight is still significant after adjustment, then we know that weight=>blood pressure and we are done with the analysis(however it's possible that some_variable=>weight=>high blood pressure, but it doesn't matter since we only care about if weight=>high blood pressure). If weight is not significant after adjustment, then we know that the original relation was due to confounding so we can conclude weight!=>high blood pressure and we are done with the analysis.
If I understand you correctly, that is sort of what forward stepwise regression would do. Suppose variable X1 shows the strongest correlation with variable Y. Then it would go into the model first. When looking at the value of adding another variable, X2, to the model, it would remove the influence of X1 on both X2 and Y and then check how the residual part of X2 correlates with the residual part of Y. It will only add X2 to the model if the correlation of the residuals justifies it. Sometimes a few of the later variables added predicts Y so well that it does not leave enough residual correlation between X1 and Y to justify leaving X1 in the model. In that case, the "both" option of stepwise regression would remove X1.
 
  • Like
Likes FallenApple
  • #15
FactChecker said:
That sounds right. That is the right idea. You should use statistics as a tool to help you analyse the relationships. It does the calculations, but you have to use your expertise to decide on the preferred model. There are multiple regression packages where a single run will give you information on all the correlations between individual variables and combinations.

One thing I would caution you about is the danger of assuming cause and effect in one direction when the causal relationship is often the reverse. I see that a lot. For instance, you often hear that regular exercise improves a person's health. That is a treacherous assumption. Sick people often do not feel like exercising, so the causal relationship is often reversed. It takes very careful research to get it right.If I understand you correctly, that is sort of what forward stepwise regression would do. Suppose variable X1 shows the strongest correlation with variable Y. Then it would go into the model first. When looking at the value of adding another variable, X2, to the model, it would remove the influence of X1 on both X2 and Y and then check how the residual part of X2 correlates with the residual part of Y. It will only add X2 to the model if the correlation of the residuals justifies it. Sometimes a few of the later variables added predicts Y so well that it does not leave enough residual correlation between X1 and Y to justify leaving X1 in the model. In that case, the "both" option of stepwise regression would remove X1.

Ah ok. So if we are not sure if being sick causes one to not exercise or if exercise causes one to be healthier, then should I do two analysis?
Analysis1: health~exercise followed by health~exercise+sickness
Analysis2: health~sickness followed by health~exercise+sickness
 
  • #16
FallenApple said:
So cross validation is like a generative approach to see what explains the response variable.
I am not sure what a generative approach is, but that doesn't sound right. K fold cross validation is just a method to avoid using the same data for model selection as for model validation.
 
  • #17
FallenApple said:
Ah ok. So if we are not sure if being sick causes one to not exercise or if exercise causes one to be healthier, then should I do two analysis?
Analysis1: health~exercise followed by health~exercise+sickness
Analysis2: health~sickness followed by health~exercise+sickness
There are a lot of occurrences of both. I don't think that simple analysis will do. I think you would have to investigate each case and find out in detail what happened. I assume that the original research did that and included results in published papers, but a lot of the news articles gloss over the details.
 
  • #18
Dale said:
@FallenApple there are of course methods for learning the model from the data. That is what machine learning is all about. If you want to use them then you need to really study the methods. In any sort of data driven model you want to use k-fold cross validation.

Another approach would be to use Bayesian methods. With Bayesian methods there is no multiple comparisons issue so you can just use all the models that come to mind. The Bayes factor allows you to compare models, although it has some issues of its own.
By k-fold you mean doing validation k times and averaging the results?
 

Related to Should I Ignore Data Driven Models or Use Bayesian Methods for Model Selection?

1. What is "ignoring data driven models"?

"Ignoring data driven models" refers to the practice of disregarding or not using data-driven approaches in scientific research or decision making processes. This can include ignoring statistical analysis or not considering data-driven models in favor of personal biases or opinions.

2. Why is it important to not ignore data driven models in scientific research?

Data driven models provide a more objective and accurate representation of reality, as they are based on empirical evidence rather than subjective opinions. Ignoring these models can lead to biased or incorrect conclusions, hindering the progress of scientific research.

3. What are some common examples of data driven models in science?

Some common examples of data driven models in science include regression analysis, machine learning algorithms, and computer simulations. These models are used to analyze and predict patterns and relationships in large sets of data.

4. How can ignoring data driven models affect decision making?

By ignoring data driven models, important insights and patterns in data can be missed. This can lead to uninformed or biased decision making, which can have negative consequences in various fields such as medicine, economics, and policy making.

5. What are some potential solutions to prevent ignoring data driven models?

To prevent ignoring data driven models, it is important for scientists to prioritize evidence-based approaches and actively seek out data to inform their research. Collaborating with experts in data analysis and utilizing tools such as peer review can also help ensure that data driven models are not disregarded in the scientific community.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
931
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Precalculus Mathematics Homework Help
Replies
6
Views
2K
  • Mechanical Engineering
Replies
4
Views
967
Replies
1
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
9
Views
2K
Replies
19
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • General Engineering
Replies
1
Views
792
Back
Top