# Heteroscedasticity

#### MarkFL

Staff member
Hello all,

A friend of mine on another forum, knowing I am involved in the math help community, approached me regarding a question in statistics. Here's what he said:

I'm involved in a debate.

I have a set of data x and y.

I did a simple linear regression using excel and it shows Heteroscedasticity﻿.

As it turns out, calculating Heteroscedasticity is beyond me since I'm not a Statistician.

What I need to know is; with the Heteroscedasticity and low R^2 value, can the correlation between the two variables be considered strong, medium, weak or invalid?

The two variables are for social science.

I just learned that the thresholds for social science (like economics) and natural science (like engineering) are completely different. As it turns out in social science R^2 value of 0.25-0.3 (max value is 1) is acceptable but in engineering that value is unacceptable.

To avoid ideological bias, I'm not going to tell what the two variables represent.

But the more statisticians assessing the data the better since I'll be relying on what others said here.
I told him I would pass along his data, along with his question, to a site where I know several folks knowledgeable in statistics participate. Here's a link to the data:

raw data.xlsx :: Free File Hosting - File Dropper: File Host for Mp3, Videos, Music, Documents.

Thanks to anyone who takes the time to visit the above link, download the data, and consider the question above.

#### Klaas van Aarsen

##### MHB Seeker
Staff member
Hi friend of MarkFL , welcome to MHB if you make it here!

I'm involved in a debate.

I have a set of data x and y.

I did a simple linear regression using excel and it shows Heteroscedasticity﻿.
How did you get Heteroscedasticity﻿ from Excel?
I'm not aware of Excel having a test for Heteroscedasticity﻿.
Or are you using a special add-in?

As it turns out, calculating Heteroscedasticity is beyond me since I'm not a Statistician.
Looking at your data visually (an advanced statistical technique that is also called eyeballing), it seems to me there is no Heteroscedasticity.
Instead it looks perfectly Homoscedastic (constant variance across the range of x) as required for a linear regression.
To be fair, I do not have the tools readily available to execute a test for Heteroscedasticity.
What kind of significance value do you have for it?

What I need to know is; with the Heteroscedasticity and low R^2 value, can the correlation between the two variables be considered strong, medium, weak or invalid?
If there is significant Heteroscedasticity, it is not valid to apply a linear regression, making an $R^2$ value invalid.
However, as I said, that does not seem to be the case here.

The two variables are for social science.

I just learned that the thresholds for social science (like economics) and natural science (like engineering) are completely different. As it turns out in social science R^2 value of 0.25-0.3 (max value is 1) is acceptable but in engineering that value is unacceptable.
The $R^2$ value of $0.25$ that we have here, is considered to show medium correlation.

And a test to evaluate if there is a significant correlation says, yes, there is a significant correlation between x and y with a $p$-value of $1.88\cdot 10^{-12}$.
The $p$-value is the probability that we're wrong about saying that there's a significant correlation.
Both the social sciences and engineering generally ask for a $p$-value less than $0.05$ to be considered significant.
It's just that in the social sciences we must carefully apply these statistical tests, since it's often pretty hard to achieve that level of significance, and it's much easier to make a subjective statement that is worthless without support from the numbers.
In engineering it's usually just obvious that there is a correlation and there's no real need to dive into careful significance tests.

#### loot

##### New member
Hi I like Serena,

I've decided to just register instead of making Mark go back and forth with my questions. I've received several replies from other statisticians at another place that just confuse me even more.

The crux of the debate is whether x caused y, that is to say whether an increase of x will decreases y. So it's not just a correlation but a correlation-causation.

#### Klaas van Aarsen

##### MHB Seeker
Staff member
Hi I like Serena,

I've decided to just register instead of making Mark go back and forth with my questions. I've received several replies from other statisticians at another place that just confuse me even more.

The crux of the debate is whether x caused y, that is to say whether an increase of x will decreases y. So it's not just a correlation but a correlation-causation.
Hi loot ,

I'm afraid that's a common pitfall in statistics.
Statistics does not say anything about cause-effect. It only says if there's a correlation.
And a correlation can mean many things, such as:
• x causes y.
• y causes x.
• some unknown z causes both x and y, but x and y do not cause each other.
• both x and y cause some unknown z, which we accidentally conditioned on.
• x and y mutually cause each other.
• and so on with other possible causal relationships.
This is why we try to choose an x such that it is independent as we call it.
Something that can not possibly be caused by y.
For instance the result of a math test in high school, which can not possibly be caused by the result of a statistics test in college. The causality the other way around is quite plausible though.
Or the result of a specific training for a test. The test result cannot cause the training, but the training might affect the test result.

#### loot

##### New member
Hi,

According to a certain ideology, x is dogmatically believed to be independent of y and that y is the side effect of x.

I'm looking at the following possibilities:

1. x causes y
2. a known z causes both x and y, but x and y do not cause each other

I strongly believe in 2. but for now the debate can be temporarily settled as 1. if the correlation is very strong. The dogma itself assert that x and y are strongly correlated, where x is the cause of y.

I'm kinda confused by your explanation;

R^2 value is 0.25, is considered to show medium correlation.

p-value is 1.88⋅10−12, there is a significant correlation between x and y.

My questions are:

a. What does significant mean? Medium correlation (same as indicated by R^2) or strong correlation (much stronger than indicated by R^2)?

b. For p-value; if less than 0.05 means we're likely to be right that there is a strong correlation, if above then we're likely to be wrong and there is little to no correlation?

Thanks in advance for taking your time to explain these to me.

#### Klaas van Aarsen

##### MHB Seeker
Staff member
Hi,

According to a certain ideology, x is dogmatically believed to be independent of y and that y is the side effect of x.

I'm looking at the following possibilities:

1. x causes y
2. a known z causes both x and y, but x and y do not cause each other

I strongly believe in 2. but for now the debate can be temporarily settled as 1. if the correlation is very strong. The dogma itself assert that x and y are strongly correlated, where x is the cause of y.
Just saying, we need to be a bit careful with dogma's.

After all, a typical causal chain is:
2. We ignore all correlations that contradict the dogma.
3. We selectively pick the one correlation that is aligned with the cause-effect implied in the dogma.
4. We conclude that the causal relationship in the dogma is true.

I hope it is clear from my previous posts that this is wrong: we cannot conclude a causal relationship based on a correlation.
Not to mention that evidence to the contrary should not be ignored.

I'm kinda confused by your explanation;

R^2 value is 0.25, is considered to show medium correlation.

p-value is 1.88⋅10−12, there is a significant correlation between x and y.

My questions are:

a. What does significant mean? Medium correlation (same as indicated by R^2) or strong correlation (much stronger than indicated by R^2)?
Significant means that the chance that we're wrong about a specific statement is lower than typically 0.05.
We have to take into account what the statement is though.

In your case there is definitely a correlation, which is what the p-value shows beyond doubt.

However, the $R^2$ value that is medium indicates that we cannot accurately predict what for instance y will be based on x.
There is some 'noise'. Possibly other factors that we did not take into account that influence how x and y correlate to each other. Or perhaps we simply cannot measure x and y very accurately.
They are still definitely correlated though.

b. For p-value; if less than 0.05 means we're likely to be right that there is a strong correlation, if above then we're likely to be wrong and there is little to no correlation?
There's a distinction in whether there is a correlation and how strong the correlation is.
As I said, there is a correlation, but it is of medium strength.

#### loot

##### New member
Thanks a lot for the great explanation!

x is actually Economic Freedom Index https://www.heritage.org/index/ranking

y is actually Press Freedom Index https://rsf.org/en/ranking#

The [Right Wing] Libertarian ideology imply that expansion of economic freedom also caused the expansion of free speech. Instinctively I believe this implication to be false and there is no actual causal relationship between the two.

While manually inputting the data I intuitively realized that both x and y actually has a causative correlation with z (type of government) but this intuition need to be proved empirically first.

The conclusion that I derived from all these is that economic freedom uber alles which is the foundation of American Libertarianism does not guarantee freedom of speech and thus the ideology is susceptible to become a tyranny - a result that contradict completely the ideology's dogma of liberty.

#### Ackbach

##### Indicium Physicus
Staff member
Just as an addition to the conversation: to determine causality, as has already been mentioned, there must be much more than correlation. In science in general, Mill's Methods are what are generally used to determine causation. What is common to all five of Mill's Methods is this: variable manipulation in carefully-designed controlled experiments. If you do not have that, or if you cannot do that, you do not have causation. You cannot get causation from observational studies, nor can you get causation from computer models.

But I will say this: if you want causation, correlations are absolutely the best place to start. You can think of them as clues.

#### Klaas van Aarsen

##### MHB Seeker
Staff member
Here's a nice and surprising article about a chase for causation after establishing a correlation.
Moreover, the author is using only observational results.
Developers Who Use Spaces Make More Money Than Those Who Use Tabs

The author makes a careful analysis of the various possible confounding factors... and finds none.
Still, he is very careful in pointing out that correlation is not causation up to and including in his conclusion, as he should.
Either way, the evidence that he presents is compelling.

For the record, I work as a professional developer... and I intend to stick with spaces just to be sure.

#### loot

##### New member
Everyone, thanks very much!
I learned a lot from these exchanges (including the methods to determine causation).