# [SOLVED]Symmetry of distribution

#### mathmari

##### Well-known member
MHB Site Helper
Hey!!

We are given a list of $300$ data which are the square meters of houses. I have calculated the mean value and the median. After that we have to say something about the symmetry of the distribution. For that do we have to make a diagram from the given data? Is there a program to do that?

Last edited:

#### Klaas van Aarsen

##### MHB Seeker
Staff member
Hey mathmari !!

That sounds as if you want to make a histogram of the given data.
Excel can do that for, and so can TikZ.
If you want to go on and apply a statistical test for symmetry, you might consider R (free and online) or SPSS.
They can draw a histogram as well.

#### mathmari

##### Well-known member
MHB Site Helper
That sounds as if you want to make a histogram of the given data.
Excel can do that for, and so can TikZ.
If you want to go on and apply a statistical test for symmetry, you might consider R (free and online) or SPSS.
They can draw a histogram as well.
Could you explain to me how I could use Excel or R for that, since I haven't done that before?

#### Klaas van Aarsen

##### MHB Seeker
Staff member
Could you explain to me how I could use Excel or R for that, since I haven't done that before?
Here is an explanation for Excel.

#### mathmari

##### Well-known member
MHB Site Helper
Here is an explanation for Excel.
Ok! I have also an other question. To check the symmetry do we make the histogram from the given data or do we have to order the data first in an increasing order and then make the histogram?

If I have applied that correctly, the histogram of the ordered data is this one.

And the histogram of the given data is this one.

By which of these two do we check the symmetry?

Last edited:

#### Klaas van Aarsen

##### MHB Seeker
Staff member
Ok! I have also an other question. To check the symmetry do we make the histogram from the given data or do we have to order the data first in an increasing order and then make the histogram?

If I have applied that correctly, the histogram of the ordered data is this one.

And the histogram of the given data is this one.

By which of these two do we check the symmetry?
Those are not histograms. They appear to be plots of the data itself. And indeed they have 300 points.
A histogram categorizes the data in bins and makes a bar graph of them.
It means that the data is effectively sorted in those bins, and we should have only 10 or 20 bars or so.

How did you make those graphs?

#### mathmari

##### Well-known member
MHB Site Helper
Those are not histograms. They appear to be plots of the data itself. And indeed they have 300 points.
A histogram categorizes the data in bins and makes a bar graph of them.
It means that the data is effectively sorted in those bins, and we should have only 10 or 20 bars or so.
The minimum value is 42,075 and the maximum value is 153,574. So the bins could have the interval length $11$ and so we would get the intervalls $42-53$, $53-64$, $64-75$, $75-86$, $86-97$, $97-108$, $108-119$, $119-130$, $130-141$, $141-152$, $152-163$, right?

How did you make those graphs?
I selected all the $300$ points and then I created the graph

#### Klaas van Aarsen

##### MHB Seeker
Staff member
The minimum value is 42,075 and the maximum value is 153,574. So the bins could have the interval length $11$ and so we would get the intervalls $42-53$, $53-64$, $64-75$, $75-86$, $86-97$, $97-108$, $108-119$, $119-130$, $130-141$, $141-152$, $152-163$, right?
That is a possible choice for the bins yes.

I selected all the $300$ points and then I created the graph
I guess you created a general bar graph instead of an actual histogram.

#### mathmari

##### Well-known member
MHB Site Helper
That is a possible choice for the bins yes.
I got the following:

That means that the distribution is symmetric, or not?

I guess you created a general bar graph instead of an actual histogram.
Ahh ok!

#### Klaas van Aarsen

##### MHB Seeker
Staff member
I got the following:

That means that the distribution is symmetric, or not?
Yep. All correct.

#### mathmari

##### Well-known member
MHB Site Helper
Yep. All correct.
Great!!

At the next question we have to create the frequency distribution of the prices for sale. The given data is the square meters of the houses for sale, how can we get the frequency distribution of the prices? I got stuck right now. Isn't some information missing?

#### Klaas van Aarsen

##### MHB Seeker
Staff member
At the next question we have to create the frequency distribution of the prices for sale. The given data is the square meters of the houses for sale, how can we get the frequency distribution of the prices? I got stuck right now. Isn't some information missing?
If we only have data about the square meters, we can only make a histogram of those.
Perhaps that is intended?
Prices are correlated to square meters after all.
Still, without price information, we can indeed not say anything about prices.

#### mathmari

##### Well-known member
MHB Site Helper
If we only have data about the square meters, we can only make a histogram of those.
Perhaps that is intended?
Prices are correlated to square meters after all.
Still, without price information, we can indeed not say anything about prices.
If the histogram of the square meters is intented, then did we have to check the symmetry in an other way, since for the histogram is asked in the next question?

#### Klaas van Aarsen

##### MHB Seeker
Staff member
If the histogram of the square meters is intented, then did we have to check the symmetry in an other way, since for the histogram is asked in the next question?
A histogram is the bar graph of a frequency distribution table.
So first we make the table and then we create the graph.

#### mathmari

##### Well-known member
MHB Site Helper
A histogram is the bar graph of a frequency distribution table.
So first we make the table and then we create the graph.
I got stuck now. How do we create the frequency distribution table?

#### Klaas van Aarsen

##### MHB Seeker
Staff member
I got stuck now. How do we create the frequency distribution table?
Take a look at your previous histogram. Doesn't it have a table on the left? A table with columns titled Class and Frequency?
That is the frequency distribution table.

#### mathmari

##### Well-known member
MHB Site Helper
Take a look at your previous histogram. Doesn't it have a table on the left? A table with columns titled Class and Frequency?
That is the frequency distribution table.
Ahh so we get this table also automatically from Excel.

I have a question. At the intervals is it correct that the upper bound of the one is equal to the lower bound of the next interval or should it be the next number?

#### Klaas van Aarsen

##### MHB Seeker
Staff member
I have a question. At the intervals is it correct that the upper bound of the one is equal to the lower bound of the next interval or should it be the next number?
If the next interval starts at the next number, doesn't that mean we have 'gaps' between the intervals?
Whatever we do, there must not be gaps!

The classes must cover all possible values. And yes, that means there is some ambiguity at the boundaries.
Different conventions are used here.

If we are talking about integers, it is quite common that upper bounds are 1 less than the next lower bound.
This also happens with age groups.
So we might have for instance age groups 18-24, 25-29, 30-34. Note that in this case age 24 also covers people that are 1 day before their 25th birthday.

If we are talking about real numbers, the lower boundaries must be equal to the upper boundaries, since otherwise there would be gaps.
Of course we have a problem now with a number that is exactly on a boundary. Which interval should it belong to?
Then we need to make a consistent choice to either put the number either in the interval below, or the interval above.
The classes are then for instance [1.1, 2.2), [2.2, 3.3), [3.3, 4.4), [4.4, 5.5].
This is more explicit than writing 1.1-2.2, 2.2-3.3, 3.3-4.4, 4.4-5.5, which does not address the ambiguity.
Note that different programs use different conventions.
Excel identifies each class with the upper bound of the corresponding interval, and additionally introduces the extra class 'Larger'.
So with bins 1.1, 2.2, 3.3, 4.4, 5.5, we get the classes ($-\infty$, 1.1], (1.1, 2.2], (2.2, 3.3], (3.3, 4.4], (4.4, 5.5], Larger.

Btw, if we are talking about continuous probability distributions, the chance that a value is exactly on a boundary is supposedly infinitely small (up to machine precision), so there should be no need to worry about it too much.

#### mathmari

##### Well-known member
MHB Site Helper
I got it!!

At the next question we have to estimate the the mean value and the median from the data of frequency distribution.

We get the following, don't we?

The first mid-point is $(0+42)/2=21$, or not? And we cannot calculate the median of the class Larger, can we?

Therefore the mean value is $\frac{30739}{300}=102.463$.

At the beginning of the exercise I calculated the mean value of the square meters to be $102.307$. So the estimated mean value $102.463$ is closed to it, right?

For the estimated median do we use the formula $$\text{lower boundary of group of median}+\frac{\frac{\text{total number of values}}{2}-\text{sum of frequencies before median}}{\text{frequency of the median group}}\cdot \text{group width}$$ ?

#### Klaas van Aarsen

##### MHB Seeker
Staff member
I got it!!

At the next question we have to estimate the the mean value and the median from the data of frequency distribution.

We get the following, don't we?

The first mid-point is $(0+42)/2=21$, or not?
We have a fixed bin size of 11, don't we?
Shouldn't we pick the first mid-point then at $42 - \frac{11}2 = 36.5$ for consistency?
It doesn't really matter though, since the corresponding frequency is 0. So it doesn't contribute to the calculation of the median. Good.

And we cannot calculate the median of the class Larger, can we?
We might calculate its midpoint by using the fixed bin size of 11 again.
There is no need though, as this bin should be empty. And it is.

Therefore the mean value is $\frac{30739}{300}=102.463$.

At the beginning of the exercise I calculated the mean value of the square meters to be $102.307$. So the estimated mean value $102.463$ is close to it, right?
Yep.

For the estimated median do we use the formula $$\text{lower boundary of group of median}+\frac{\frac{\text{total number of values}}{2}-\text{sum of frequencies before median}}{\text{frequency of the median group}}\cdot \text{group width}$$ ?
That looks correct to me yes.
We can compare it with the real median, which is the average of the 2 values in the middle after sorting.

#### mathmari

##### Well-known member
MHB Site Helper
That looks correct to me yes.
So, for that formula do we need to know the real median? Or do we assume in which interval the median will be?

Last edited:

#### Klaas van Aarsen

##### MHB Seeker
Staff member
So, for that formula do we need to know the real median? Or do we assume in which interval the median will be?
Can't we find the interval with the median uniquely?

Suppose we add a column with the partial sums of the frequencies that came before.
Then the median is in the interval where that partial sum grows beyond $\frac{\text{total number of values}}{2}$ or $50\%$, isn't it?
The $\text{sum of frequencies before the median}$ is that partial sum before we cross $\frac{\text{total number of values}}{2}$.

#### mathmari

##### Well-known member
MHB Site Helper
Can't we find the interval with the median uniquely?

Suppose we add a column with the partial sums of the frequencies that came before.
Then the median is in the interval where that partial sum grows beyond $\frac{\text{total number of values}}{2}$ or $50\%$, isn't it?
The $\text{sum of frequencies before the median}$ is that partial sum before we cross $\frac{\text{total number of values}}{2}$.
Ahh ok!! Thank you very much for your help!!