Do I have enough information to create a normal distribution

In summary, the conversation discusses the problem of statistical estimation and whether it is possible to construct a normal distribution with only five data points. The participants also discuss the different methods of estimating a normal distribution and the importance of defining the measure of error in the fit. Additionally, they mention the possibility of fitting the cumulative distribution instead of the probability density function.
  • #1
alpha754293
29
1
Given five data points (minimum, 25th percentile, 50th percentile, 75th percentile, maximum), do I have enough information to be able to construct what a normal (Gaussian) distribution would look like?

I have no data on any other statistical information (population size, mean, median, mode, standard deviation, etc...) nor do I have any means of collecting or obtaining that data.

If it isn't possible, that's okay. Just thought that I would ask.

Thank you in advance for all your help.
 
Physics news on Phys.org
  • #2
Statistical estimation is the problem of estimating properties of the population distribution from a sample of data. Apparently this is the type of problem you have since a normally distributed population wouldn't have a maximum or minimum value, but a sample from that distribution would.

You can't "solve" such a problem of estimation in the sense of finding an estimate that is guaranteed to be correct. (For example, you can't "solve" for the mean of a population by using data from a sample. All you can do is compute an estimate of the mean of the population. )

You can certainly estimate a normal distribution that fits the data you have. However, there are different ways to doing estimation and different definitions of what makes an estimate "good" or "best". Are you are looking for a "textbook method" - one that everybody in your field would acknowledge as respectable? I don't know of such a method. Your best bet would be to look at papers that have been published in your field and see if you can find one.

On the other hand, if you don't need precedent and the weight of authority, you could fit a normal distribution to the data in various ways. This would require some further thinking. For example, you could estimate the mean of the distribution from the 50% value and the standard deviation from the difference in the 50% and 75% values, but what can we say about the properties of such an estimation method?
 
  • Like
Likes FactChecker
  • #3
Well...it was my understanding that if you were to plot a normal distribution, you will still have like a 0th (or 1st percentile) value and also either a 100th (or 99th percentile) which would be your min and max respectively. (Please correct me if I am wrong, but from what I remember/understand, what's colloquially termed as "tails" of the distribution describes those max/min values).

So I didn't know if there was a way to construct the plot using the data that I have (however minimal) -- which is the premise of the question. (Or put in another way, construct the distribution such that it will have to pass through those data points. I might not describe the parameters that describe the "shape" of the distribution (in layman's terms, how "narrow" or "wide" the shape of the "bell" is), but because we know that it has to pass through those points, that that might give some indication of it.

Like I said, I don't know. And I can totally be talking outta my butt, but that's why I posted the question to see if there are other people who know more than I am try and help me answer/figure this question out.

I'm just looking to "visualise" the data that I've got. That's all. It doesn't have to perfect and it isn't going to be use for anything else, by anyone else.

Thanks.
 
  • #4
alpha754293 said:
Well...it was my understanding that if you were to plot a normal distribution, you will still have like a 0th (or 1st percentile) value and also either a 100th (or 99th percentile) which would be your min and max respectively.

You are correct if maximum and minimum refer to percentiles.

. (Or put in another way, construct the distribution such that it will have to pass through those data points. I might not describe the parameters that describe the "shape" of the distribution (in layman's terms, how "narrow" or "wide" the shape of the "bell" is), but because we know that it has to pass through those points, that that might give some indication of it.

If you are lucky enough to find a normal distribution that matches exactly all the data points then your are in business. However the usual situation in estimation-by-means-of-curve-fitting is that there is some mismatch between the data in the curve. So the fitting method depends on defining how you measure the "error" in the fit. A common method of fitting distributions to data is to fit the cumulative distribution to the data instead fo fitting the probability density function to the data. This means you would plot the CDF of a normal distribution vs data points of the cumulative histogram. You can leave the question of which normal distribution best fits the data to "eyeball" judgement. However, if you want to make any specific claims about your estimation method, you will have to define what measure of "error" your curve fit minimizes. For example, does it minimize least squares error between the data points and the curve? - or does it minimize the absolute error? -or does it minimize some weighted measure of error where the larger percentile data points are given more weight than the smaller ones?
 
  • #5
Stephen Tashi said:
You are correct if maximum and minimum refer to percentiles.
If you are lucky enough to find a normal distribution that matches exactly all the data points then your are in business. However the usual situation in estimation-by-means-of-curve-fitting is that there is some mismatch between the data in the curve. So the fitting method depends on defining how you measure the "error" in the fit. A common method of fitting distributions to data is to fit the cumulative distribution to the data instead fo fitting the probability density function to the data. This means you would plot the CDF of a normal distribution vs data points of the cumulative histogram. You can leave the question of which normal distribution best fits the data to "eyeball" judgement. However, if you want to make any specific claims about your estimation method, you will have to define what measure of "error" your curve fit minimizes. For example, does it minimize least squares error between the data points and the curve? - or does it minimize the absolute error? -or does it minimize some weighted measure of error where the larger percentile data points are given more weight than the smaller ones?

Yeah, I'm not sure if I would be able to calculate the error because I don't have any access to the underlying data behind what gave those numbers. (If I did/could get access to it, then I would have processed the raw data instead, but I can't.)

But as a first pass/crack at it, it will be "good enough". If I really need to/want to refine it, then I can make the pitch that I need more data to be able to do a better job at it. And then I can get other people's help to see if they can get me the underlying data (although I don't think that they'll ever release it), but that's okay too.

But the question was really around the feasbility of it - a kind of "gut" check if you will - about whether I can do it, and the feasbility of it, knowing that it can't, by definition, be "accurate", but "accurate enough".

Thanks!
 
  • #6
You can certainly do a trial fit by trial and error. It is a reasonable idea. It's "not completely absurd", if that's all the assurance you need.
 

Related to Do I have enough information to create a normal distribution

1. Do I need a large sample size to create a normal distribution?

No, a small sample size can also be used to create a normal distribution. However, larger sample sizes tend to provide more accurate results.

2. What is the minimum amount of data necessary to create a normal distribution?

There is no specific minimum amount of data required to create a normal distribution. However, it is generally recommended to have at least 30 data points.

3. Can I create a normal distribution from non-numerical data?

No, a normal distribution can only be created from numerical data. Non-numerical data such as categories or words cannot be used to create a normal distribution.

4. Does my data need to be normally distributed to create a normal distribution?

No, your data does not need to be normally distributed to create a normal distribution. However, your data may not accurately reflect a true normal distribution if it is not normally distributed.

5. Can I use any statistical software to create a normal distribution?

Yes, most statistical software programs have the capability to create a normal distribution. However, it is important to ensure that the software is using the correct parameters and assumptions for creating a normal distribution.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
526
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
981
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
591
  • Set Theory, Logic, Probability, Statistics
Replies
22
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
828
Back
Top