Do I have enough information to create a normal distribution

alpha754293 · Mar 13, 2015

Given five data points (minimum, 25th percentile, 50th percentile, 75th percentile, maximum), do I have enough information to be able to construct what a normal (Gaussian) distribution would look like?

I have no data on any other statistical information (population size, mean, median, mode, standard deviation, etc...) nor do I have any means of collecting or obtaining that data.

If it isn't possible, that's okay. Just thought that I would ask.

Thank you in advance for all your help.

Stephen Tashi · Mar 13, 2015

Statistical estimation is the problem of estimating properties of the population distribution from a sample of data. Apparently this is the type of problem you have since a normally distributed population wouldn't have a maximum or minimum value, but a sample from that distribution would.

You can't "solve" such a problem of estimation in the sense of finding an estimate that is guaranteed to be correct. (For example, you can't "solve" for the mean of a population by using data from a sample. All you can do is compute an estimate of the mean of the population. )

You can certainly estimate a normal distribution that fits the data you have. However, there are different ways to doing estimation and different definitions of what makes an estimate "good" or "best". Are you are looking for a "textbook method" - one that everybody in your field would acknowledge as respectable? I don't know of such a method. Your best bet would be to look at papers that have been published in your field and see if you can find one.

On the other hand, if you don't need precedent and the weight of authority, you could fit a normal distribution to the data in various ways. This would require some further thinking. For example, you could estimate the mean of the distribution from the 50% value and the standard deviation from the difference in the 50% and 75% values, but what can we say about the properties of such an estimation method?

alpha754293 · Mar 13, 2015

Well...it was my understanding that if you were to plot a normal distribution, you will still have like a 0th (or 1st percentile) value and also either a 100th (or 99th percentile) which would be your min and max respectively. (Please correct me if I am wrong, but from what I remember/understand, what's colloquially termed as "tails" of the distribution describes those max/min values).

So I didn't know if there was a way to construct the plot using the data that I have (however minimal) -- which is the premise of the question. (Or put in another way, construct the distribution such that it will have to pass through those data points. I might not describe the parameters that describe the "shape" of the distribution (in layman's terms, how "narrow" or "wide" the shape of the "bell" is), but because we know that it has to pass through those points, that that might give some indication of it.

Like I said, I don't know. And I can totally be talking outta my butt, but that's why I posted the question to see if there are other people who know more than I am try and help me answer/figure this question out.

I'm just looking to "visualise" the data that I've got. That's all. It doesn't have to perfect and it isn't going to be use for anything else, by anyone else.

Thanks.

Stephen Tashi · Mar 13, 2015

alpha754293 said:

Well...it was my understanding that if you were to plot a normal distribution, you will still have like a 0th (or 1st percentile) value and also either a 100th (or 99th percentile) which would be your min and max respectively.

You are correct if maximum and minimum refer to percentiles.

. (Or put in another way, construct the distribution such that it will have to pass through those data points. I might not describe the parameters that describe the "shape" of the distribution (in layman's terms, how "narrow" or "wide" the shape of the "bell" is), but because we know that it has to pass through those points, that that might give some indication of it.

If you are lucky enough to find a normal distribution that matches exactly all the data points then your are in business. However the usual situation in estimation-by-means-of-curve-fitting is that there is some mismatch between the data in the curve. So the fitting method depends on defining how you measure the "error" in the fit. A common method of fitting distributions to data is to fit the cumulative distribution to the data instead fo fitting the probability density function to the data. This means you would plot the CDF of a normal distribution vs data points of the cumulative histogram. You can leave the question of which normal distribution best fits the data to "eyeball" judgement. However, if you want to make any specific claims about your estimation method, you will have to define what measure of "error" your curve fit minimizes. For example, does it minimize least squares error between the data points and the curve? - or does it minimize the absolute error? -or does it minimize some weighted measure of error where the larger percentile data points are given more weight than the smaller ones?

alpha754293 · Mar 13, 2015

Stephen Tashi said:

You are correct if maximum and minimum refer to percentiles.
If you are lucky enough to find a normal distribution that matches exactly all the data points then your are in business. However the usual situation in estimation-by-means-of-curve-fitting is that there is some mismatch between the data in the curve. So the fitting method depends on defining how you measure the "error" in the fit. A common method of fitting distributions to data is to fit the cumulative distribution to the data instead fo fitting the probability density function to the data. This means you would plot the CDF of a normal distribution vs data points of the cumulative histogram. You can leave the question of which normal distribution best fits the data to "eyeball" judgement. However, if you want to make any specific claims about your estimation method, you will have to define what measure of "error" your curve fit minimizes. For example, does it minimize least squares error between the data points and the curve? - or does it minimize the absolute error? -or does it minimize some weighted measure of error where the larger percentile data points are given more weight than the smaller ones?

Yeah, I'm not sure if I would be able to calculate the error because I don't have any access to the underlying data behind what gave those numbers. (If I did/could get access to it, then I would have processed the raw data instead, but I can't.)

But as a first pass/crack at it, it will be "good enough". If I really need to/want to refine it, then I can make the pitch that I need more data to be able to do a better job at it. And then I can get other people's help to see if they can get me the underlying data (although I don't think that they'll ever release it), but that's okay too.

But the question was really around the feasbility of it - a kind of "gut" check if you will - about whether I can do it, and the feasbility of it, knowing that it can't, by definition, be "accurate", but "accurate enough".

Thanks!

Stephen Tashi · Mar 13, 2015

You can certainly do a trial fit by trial and error. It is a reasonable idea. It's "not completely absurd", if that's all the assurance you need.

Do I have enough information to create a normal distribution

Related to Do I have enough information to create a normal distribution

1. Do I need a large sample size to create a normal distribution?

2. What is the minimum amount of data necessary to create a normal distribution?

3. Can I create a normal distribution from non-numerical data?

4. Does my data need to be normally distributed to create a normal distribution?

5. Can I use any statistical software to create a normal distribution?

Similar threads

Hot Threads

Recent Insights