Analyzing Curvature with Confidence Bands?

  • A
  • Thread starter FallenApple
  • Start date
  • Tags
    Curvature
In summary: If the curve is not significant, then the curve is not significant. So, in all, it seems like I can use the curve as a way to improve the model.In summary, when dealing with non-linear relationships between variables, it is important to have a theoretical reason to expect a certain type of response. Otherwise, fitting a complicated model without a strong theoretical motivation can lead to overfitting. To address non-linearity, one approach is to fit a non-linear function to the data, such as cubic splines, and then include this function in the regression model. This can improve the model and provide meaningful
  • #1
FallenApple
566
61
So let's say that there are several different treatments for being underweight. One way to deal with this is to make repeated measurements of each patient's weight across time. So I can fit a model to the panel data, regressing weight on treatment as an indicator variable, and then plot the different fitted mean trajectories along with their point wise confidence bands. Basically a plot of response vs time for all groups but accounting for the uncertainty of the paths as well. Also, say that scatterplot of suggests that curves might not linear so I have to do some sort of polynomial interpolation on the right hand side of the equation.

Moreover, say that I notice that some of the bands are convex for some treatment groups and some are not for others. Can I interpret this as some of the treatments have accelerating returns and some have diminishing returns?

I'm thinking that even if a confidence band for a particular trajectory is convex, I can imagine the true path being concave as a possible path within the band if the bands are particularly wide. Even if a convex shaped band were not wide, the true path may be concave for short intervals of time.
 
Physics news on Phys.org
  • #2
Do you have any a theoretical reason to expect a quadratic response? It is usually a bad idea to fit a complicated model without a theoretical motivation
 
  • Like
Likes FallenApple
  • #3
If I understand you correctly you are doing an OLS regression of weight (continuous variable) against time (continuous variable) and ##n-1## logical variables, being flags to indicate which treatment is being used, where there are ##n## treatments and one is chosen as the base. Let these variables be ##W_t^j## for weight at time ##t## for patient ##j##, and ##R_k^j## is true iff patient ##j## is receiving treatment ##k##. Let ##\vec R^j## be the vector with components ##R_1^j,...,R_{n-1}^j##, that specifies which treatment patient ##j## is receiving.

If the relationship of weight to time appears significantly non-linear, any confidence bands for ##E\left[W_t^j\ |\ \vec R^j=\vec r\right]## generated by basic OLS regression will be of little use.

A way to remedy this is to fit a reasonably nice non-linear function ##f_k## to the map
$$t\mapsto \hat W_{t,k}\triangleq E\left[W_t^j\ |\ \vec R^j=\vec r_k\right]$$
for each ##k\in\{0,...,n-1\}## where ##vec r_k## is the vector of flags for which only the ##k##th flag is True (this identifies Treatment ##k##), for ##k>0##, and ##\vec r_k=0## is the vector in which all flags are False (this identifies the Base treatment).

A common way to fit the functions ##f_k## is cubic splines.

We then regress ##W_t^j## against ##\vec R^j## and ##g(t,j)## where ##g(t,j)=f_{K_j}(t)## and ##K(j)## is the treatment being applied to patient ##j##, where 0 denotes the Base treatment.

If the splining process has succeeded, this regression will give reasonably linear-looking maps

$$f_k(t) \mapsto \hat W_{t,k}$$

and the confidence intervals should be meaningful.

EDIT: But having posted that (it took a while to write the latex), I agree with Dale's much more succinct response, which has come in while I was writing this.
 
  • Like
Likes FallenApple
  • #4
Dale said:
Do you have any a theoretical reason to expect a quadratic response? It is usually a bad idea to fit a complicated model without a theoretical motivation

Not particularly. So I guess if there is no reason, then we would just use linear? For parsimony out of concern for overfitting? What about the results from a scatterplot beforehand?

Also, what if I changed the question and am interested in the serum hormone level in the blood stream over time? Then we know that the levels go up during treatment, then go down after stopping treatment, so the spike and decrease can be modeled by combination of linear splines or a concave quadratic.

Or say I am doing a study on THC levels in cannabis smokers over time( hypothetically. not sure how ethical/common is it to get healthly people to volunteer for something like this). They start smoking and then quit during the middle of the study. So in that case, we expect from theory that THC would stay in the system, and then over time, go down. In this case, would we fit a concave quadratic curve? Since in this case, from theory, we wouldn't expect to overfit since future value presumably would follow the same trend based on scientific theory.
 
Last edited:
  • #5
andrewkirk said:
If I understand you correctly you are doing an OLS regression of weight (continuous variable) against time (continuous variable) and ##n-1## logical variables, being flags to indicate which treatment is being used, where there are ##n## treatments and one is chosen as the base. Let these variables be ##W_t^j## for weight at time ##t## for patient ##j##, and ##R_k^j## is true iff patient ##j## is receiving treatment ##k##. Let ##\vec R^j## be the vector with components ##R_1^j,...,R_{n-1}^j##, that specifies which treatment patient ##j## is receiving.

If the relationship of weight to time appears significantly non-linear, any confidence bands for ##E\left[W_t^j\ |\ \vec R^j=\vec r\right]## generated by basic OLS regression will be of little use.

A way to remedy this is to fit a reasonably nice non-linear function ##f_k## to the map
$$t\mapsto \hat W_{t,k}\triangleq E\left[W_t^j\ |\ \vec R^j=\vec r_k\right]$$
for each ##k\in\{0,...,n-1\}## where ##vec r_k## is the vector of flags for which only the ##k##th flag is True (this identifies Treatment ##k##), for ##k>0##, and ##\vec r_k=0## is the vector in which all flags are False (this identifies the Base treatment).

A common way to fit the functions ##f_k## is cubic splines.

We then regress ##W_t^j## against ##\vec R^j## and ##g(t,j)## where ##g(t,j)=f_{K_j}(t)## and ##K(j)## is the treatment being applied to patient ##j##, where 0 denotes the Base treatment.

If the splining process has succeeded, this regression will give reasonably linear-looking maps

$$f_k(t) \mapsto \hat W_{t,k}$$

and the confidence intervals should be meaningful.

EDIT: But having posted that (it took a while to write the latex), I agree with Dale's much more succinct response, which has come in while I was writing this.

So, from theory, I can determine if to include a curve or not. So say I fit a curve or splines in the model such that it is not linear. Can I just directly obtain the estimated variance covariance matrix of the model, ( vcov(model) in R), and then from that build up the confidence bands with loops across the time points, and do this for each of the treatment groups? That way I get to see the type of curvature that is manifested by the bands? So say there are 12 time points. From the variance covariance matrix, and the fitted estimates, I can get the CI for the response at time 1 for the base group, then I would do this again for all the other times, finally, I plot the CI's vertically at each time point.

Next, I do this for each group.

Is there a reason why the CI's constructed manually would be off?

I think from your method, I could determine which of the spline coefficients are significant or not, and get inference on whether there is any curvature or not. But would it tell me the direction of the curvature, if its accelerating or not? So if I fit a concave cubic, and it turns out to be statistically significant for say, group2(by doing group and time interaction), then that means that group 2 estimated response levels are decelerating?

But if everything is linearized, then how would I visualize the trajectories from the model?
 
  • #6
FallenApple said:
So I guess if there is no reason, then we would just use linear? For parsimony?
Yes, that would be my recommendation.

FallenApple said:
Also, what if I changed the question and am interested in the serum hormone level in the blood stream over time? Then we know that the levels go up during treatment, then go down after stopping treatment, so the spike and decrease can be modeled by combination of linear splines or a concave quadratic.
I am not an expert in pharmacokinetics, but I know that they have some fairly standard models that are used. Using one of those would give a good theoretical justification for a more complicated model
 
  • #7
Here is what I mean. This is an example of simulated data, and building of the confidence bands from scratch from the model estimates and the estimated variance covariance matrix for the model.

PlotOfConfidenceBands.png
Code:
#fitting curve

x <- c(1:24)mu <- 10 + 5 * sin( x * pi / 24 )  - 2 * cos( (x-6)*4/24 )

# errors

set.seed(2010)
eee <- rnorm( length(mu) )

# data

y <- mu + eee

x.squared <- x^2
x.cubed <- x^3

fit <- lm( y ~ x + x.squared + x.cubed )

print( summary( fit ) )

# Look at the fitted model:

fitted.mean <- predict( fit )
# Plot

#preliminary plot
plot(x,y)
lines( x, mu, col="red" )
lines( x, fitted.mean, col="blue", lwd=2 )
title("Data, true mean curve (red), and fitted (blue) using cubic")
#estimated variance covariance matrix of model
vcov(fit)

#vector of regression parameter estimates
B=as.numeric(fit$coefficients)#function for getting the estimated response variable at any x
estimate=function(x){
 
  return(B[1]+B[2]*x+B[3]*x^2+B[4]*x^3   )
}

v=NULL#building vector of estimates to plot
for(i in 1:24){
 
  v=c(v,estimate(i))
 
 
}
summary(fit)n=24
p=4
qt(.975,n-p)#function for getting the confidence interval for response variable at any x
CI=function(x){
 
  k=c(1,x,x^2,x^3 )
 
  return( t(k)%*%B+qt(.975,n-p)*c(-1,1)*sqrt(t(k)%*%vcov(fit)%*%k))
}CI(5)

conf=NULL

#building matrix of CIs to plot
for(i in 1:24){
 
  conf=rbind(conf,CI(i))
 
}

conf[1,]#plotting it all together.
plot(x,y)
lines( x, mu, col="red" )
lines(x,v,lwd=3)
lines( x, conf[,1], col="darkred", lwd=2 )
lines( x, conf[,2], col="darkred", lwd=2 )

title("Data, fitted using cubic")
legend("topleft",c("fitted mean curve","confidence band","true mean curve"), col=c("black","darkred","red"),
       lty=1,cex=.5)
 
  • #8
Dale said:
Yes, that would be my recommendation.

I am not an expert in pharmacokinetics, but I know that they have some fairly standard models that are used. Using one of those would give a good theoretical justification for a more complicated model

Ok got it. Then from there, would I just tweak the model depending to details such as residuals as usual?

Also, is the reason people often suggest to put emphasis on using theoretical concerns for model building, to over, say building the model from just what you see in the scatterplot, and the plot of sample means, is that the behavior of unseen data would often depend on the theory and prior knowledge and how well the model would extrapolate depends on this?
 
  • #9
FallenApple said:
Also, is the reason people often suggest to put emphasis on using theoretical concerns for model building, to over, say building the model from just what you see in the scatterplot, and the plot of sample means, is that the behavior of unseen data would often depend on the theory and prior knowledge and how well the model would extrapolate depends on this?
It is more of a statistical concern. Suppose that you have data with a true linear relationship with some random noise. Now, if you sample four data points and do a linear fit you should get something fairly close to the line but with some residuals. However, if you fit a cubic polynomial then you will get a perfect fit with no residuals. This is called fitting to the noise.

Then, suppose you want to test how well the model fits the data. If you use the same data to test (validate) the model as you used to select the model then you will unavoidably find that it fits the data well, because the noise is not independent between model selection and validation.

So you can do a model selection from the data, but then you need to use a different set of data for the model validation. This is essentially the reason why "big data" requires large data sets.
 
  • Like
Likes FallenApple
  • #10
Dale said:
It is more of a statistical concern. Suppose that you have data with a true linear relationship with some random noise. Now, if you sample four data points and do a linear fit you should get something fairly close to the line but with some residuals. However, if you fit a cubic polynomial then you will get a perfect fit with no residuals. This is called fitting to the noise.

Then, suppose you want to test how well the model fits the data. If you use the same data to test (validate) the model as you used to select the model then you will unavoidably find that it fits the data well, because the noise is not independent between model selection and validation.

So you can do a model selection from the data, but then you need to use a different set of data for the model validation. This is essentially the reason why "big data" requires large data sets.

That makes sense. So the noise not being independent means that I'm using the same noise as before.
And because I'm using the pattern in the data to generate the hypothesis, this becomes circular and may not generalize. So for another dataset, model selection may give a different result because the data could be different by chance.

So for big data, the issue of p-hacking is not there? That is, the dataset is so large, that model selection is done on a set that might as well be considered the population of interest. Or is it that the validation set would be large enough that analysis could be conducted on it in it's own right and avoids the risk of resampling the same noise, which could happen in a smaller data set?
 
Last edited:
  • #11
FallenApple said:
That is, the dataset is so large, that model selection is done on a set that might as well be considered the population of interest.
No, the dataset is so large that you can split it into separate subsets and use one subset for selecting the model and the other subset for validating it.
 
  • #12
Dale said:
No, the dataset is so large that you can split it into separate subsets and use one subset for selecting the model and the other subset for validating it.

Ok got it. If the test set is too small, then we could end up with less power and mathematical issues such as having too many parameters relative to observation.

So proper scientific analysis using statistics requires not only testing hypothesis using a priori knowledge to construct the model, but also to see if the model is of any use. So a model that can explain should also be able to predict.

I want to see if my understanding of data analysis is correct. Let's say that there's plenty of data, and the model fitted on the training data shows no statistical problems( fits well, homogenous variance checked etc), scientific guidelines for model structure are followed, and finally the model performs well on the test data, which ultimately implies that the model explains well and predicts well.
Does that imply that the statistics is done and the only concerns left are scientific? For example, missing a potential confounder that the science has not yet considered?
 

Related to Analyzing Curvature with Confidence Bands?

1. What is the purpose of analyzing curvature with confidence bands?

The purpose of analyzing curvature with confidence bands is to quantify and visualize the uncertainty in the curvature estimation of a function. This allows for a more accurate and reliable analysis of the curvature and can help in identifying important features of the function.

2. How are confidence bands for curvature calculated?

Confidence bands for curvature are typically calculated using statistical methods, such as bootstrapping or Monte Carlo simulations. These methods involve repeatedly sampling from the data and estimating the curvature, resulting in a distribution of curvature values. The confidence bands are then constructed using the upper and lower limits of this distribution.

3. What information can be obtained from the confidence bands for curvature?

The confidence bands for curvature provide information about the level of uncertainty in the curvature estimation. They can also show the range of curvature values that are statistically significant, indicating areas of significant curvature in the function.

4. How can confidence bands for curvature be used in data analysis?

Confidence bands for curvature can be used to compare the curvature of different functions, identify regions of significant curvature, and assess the overall shape of a function. They can also be used for hypothesis testing to determine if the curvature is significantly different from a specified value.

5. Are there any limitations to using confidence bands for curvature?

Yes, there are limitations to using confidence bands for curvature. These include assumptions about the underlying distribution of the data, as well as the potential for bias in the curvature estimation. Additionally, confidence bands may not accurately reflect the true uncertainty in the curvature if the data is highly non-linear or if there are outliers present.

Similar threads

  • STEM Career Guidance
Replies
33
Views
2K
Replies
13
Views
2K
  • Special and General Relativity
Replies
11
Views
307
  • Quantum Interpretations and Foundations
Replies
34
Views
4K
  • Beyond the Standard Models
Replies
2
Views
3K
Replies
1
Views
2K
  • Art, Music, History, and Linguistics
Replies
1
Views
1K
  • Sci-Fi Writing and World Building
Replies
13
Views
1K
Replies
33
Views
771
  • Beyond the Standard Models
Replies
14
Views
3K
Back
Top