# Determine the size of sample

#### mathmari

##### Well-known member
MHB Site Helper
Hey!! The daily turnover $X$ of Cafes has the expected value $\mu_X = 600$ Euro and the standard deviation $\sigma_X = 30$ Euro.

(a) How many cafes at least have to be surveyed in a random sample, so that $\overline{X}_n$ deviates from $\mu_X$ with a probability of at least $95\%$ by less than $12$ euros?

(b) After a survey of $500$ Cafes the arithmetic mean is $690$. Is this result surprising after the question (a) ?

I have done the following:

(a) From Chebyshev's inequality we have that \begin{equation*} P\left (\left |\overline{X}_n-E(\overline{X}_n)\right |< \epsilon\right )\geq 1-\frac{V(\overline{X}_n)}{\epsilon^2}\end{equation*} with $E(X_n)=\mu_X=600$ and $V(X_n)=\frac{\sigma_X^2}{n}=\frac{30^2}{n}=\frac{900}{n}$.

It must hold \begin{align*} P\left (\left |\overline{X}_n-600\right |< 12\right )\geq 95\% &\Rightarrow 1-\frac{V(\overline{X}_n)}{12^2}\geq 95\% \\ & \Rightarrow 1-\frac{\frac{900}{n}}{144}\geq 0.95 \\ & \Rightarrow 1-0.95\geq \frac{\frac{900}{n}}{144} \\ & \Rightarrow 0.05\geq \frac{25}{4n} \\ & \Rightarrow n\geq \frac{25}{4\cdot 0.05} \\ & \Rightarrow n\geq 125\end{align*}

That means that at least $125$ Cafes have to be surveyed.

Is this correct?

(b) Why does not hold when number of the surveyed Cafes is $500$ ? #### Klaas van Aarsen

##### MHB Seeker
Staff member
Hey mathmari !!

(a) Unfortunately Chebyshev's inequality only gives a very wide result.
We can do better. Since both the expected value and the standard deviation are given, the distribution of the sample mean is $\overline X_n \sim N(\mu_X, \frac{\sigma_X}{\sqrt n})$.
Therefore we have:
$$P(|\overline X_n - \mu_X| < 12) = P\left(\frac{|\overline X_n - \mu_X|}{\sigma_X/\sqrt n} < \frac{12}{\sigma_X/\sqrt n}\right) = P\left(|Z| < \frac{12}{\sigma_X/\sqrt n}\right)$$
The critical z-value for a confidence interval of $95\%$ is $z^*=1.96$.
Therefore the critical sample size $n^*$ satisfies:
$$z^* = \frac{12}{\sigma_X/\sqrt {n^*}}=1.96\quad\Rightarrow\quad n^*=24.0$$
Thus we need a sample size $n>24$. (b) Are we surprised to see the much larger deviation $\overline x - \mu_X = 690-600=90$ when the number of surveyed Cafes is $n=500$?
Yes! It means that our assumed expected value and standard deviation are likely not correct!
Can we calculate the probability how confident we are that they are incorrect? #### mathmari

##### Well-known member
MHB Site Helper
(a) Unfortunately Chebyshev's inequality only gives a very wide result.
We can do better. Since both the expected value and the standard deviation are given, the distribution of the sample mean is $\overline X_n \sim N(\mu_X, \frac{\sigma_X}{\sqrt n})$.
Therefore we have:
$$P(|\overline X_n - \mu_X| < 12) = P\left(\frac{|\overline X_n - \mu_X|}{\sigma_X/\sqrt n} < \frac{12}{\sigma_X/\sqrt n}\right) = P\left(|Z| < \frac{12}{\sigma_X/\sqrt n}\right)$$
The critical z-value for a confidence interval of $95\%$ is $z^*=1.96$.
Therefore the critical sample size $n^*$ satisfies:
$$z^* = \frac{12}{\sigma_X/\sqrt {n^*}}=1.96\quad\Rightarrow\quad n^*=24.0$$
Thus we need a sample size $n>24$. I understand!! We use here the Central Limit Theorem, or not? (b) Are we surprised to see the much larger deviation $\overline x - \mu_X = 690-600=90$ when the number of surveyed Cafes is $n=500$?
Yes! It means that our assumed expected value and standard deviation are likely not correct!
Can we calculate the probability how confident we are that they are incorrect? How can we calculate that probability? #### Klaas van Aarsen

##### MHB Seeker
Staff member
I understand!! We use here the Central Limit Theorem, or not?
That's one way to look at it.

Then again, suppose that $X_1, ..., X_n$ are indendent random variables with the same distribution $N(\mu_X,\sigma_X)$.
Then:
$$\sigma^2\left(\frac 1n(X_1+...+X_n)\right) = \frac 1{n^2}\left(\sigma^2(X_1) + ... + \sigma^2(X_n)\right) = \frac {\sigma_X^2}n \quad\Rightarrow\quad \sigma(\overline X) = \frac{\sigma_X}{\sqrt n}$$
Isn't it? In other words, the standard deviation of the sample means $\sigma_{\overline X}$, which is also called the standard error $SE$, is:
$$SE=\sigma_{\overline X} = \frac{\sigma_X}{\sqrt n}$$

How can we calculate that probability? Pick the null hypothesis $H_0: \mu=600, \sigma=30$, and the alternative hypothesis $H_1: \mu\ne 600 \lor \sigma\ne 30$.
Then assuming $H_0$, $\overline X$ has the normal distribution $N(600, \frac{30}{\sqrt{500}})$ for a sample size of $n=500$.
What is:
$$P(|\overline X-600| > |690 - 600|)$$
? #### mathmari

##### Well-known member
MHB Site Helper
That's one way to look at it.

Then again, suppose that $X_1, ..., X_n$ are indendent random variables with the same distribution $N(\mu_X,\sigma_X)$.
Then:
$$\sigma^2\left(\frac 1n(X_1+...+X_n)\right) = \frac 1{n^2}\left(\sigma^2(X_1) + ... + \sigma^2(X_n)\right) = \frac {\sigma_X^2}n \quad\Rightarrow\quad \sigma(\overline X) = \frac{\sigma_X}{\sqrt n}$$
Isn't it? In other words, the standard deviation of the sample means $\sigma_{\overline X}$, which is also called the standard error $SE$, is:
$$SE=\sigma_{\overline X} = \frac{\sigma_X}{\sqrt n}$$
I see!! Pick the null hypothesis $H_0: \mu=600, \sigma=30$, and the alternative hypothesis $H_1: \mu\ne 600 \lor \sigma\ne 30$.
Then assuming $H_0$, $\overline X$ has the normal distribution $N(600, \frac{30}{\sqrt{500}})$ for a sample size of $n=500$.
What is:
$$P(|\overline X-600| > |690 - 600|)$$
? Do we calculate this probability using the significance level? Or am I thinking in a wrong way? #### Klaas van Aarsen

##### MHB Seeker
Staff member
Do we calculate this probability using the significance level? Or am I thinking in a wrong way? Sure. Let's pick $\alpha=0.05$, although I'm really interested in the so called p-value that we can compare with $\alpha$. #### mathmari

##### Well-known member
MHB Site Helper
Sure. Let's pick $\alpha=0.05$, although I'm really interested in the so called p-value that we can compare with $\alpha$. We have that $$P(|\overline X-600| > |690 - 600|)=P(|\overline X-600| > 90)=1-P(|\overline X-600| \leq 90)$$ Do we use here the distribution function of the normal distribution?

I got stuck right now. #### Klaas van Aarsen

##### MHB Seeker
Staff member
We have that $$P(|\overline X-600| > |690 - 600|)=P(|\overline X-600| > 90)=1-P(|\overline X-600| \leq 90)$$ Do we use here the distribution function of the normal distribution?

I got stuck right now. Yep.
We have:
$$p = P(|\overline X-600| > |690 - 600|)=P\left(\frac{|\overline X-600|}{30/\sqrt{500}} > \frac{90}{30/\sqrt{500}}\right) \approx P\left(|Z| > 67.1\right) \approx 0.00000$$
Since $p<\alpha$, we can reject $H_0$.

Conclusion
The daily turnover $X$ of Cafes has an expected value $μ_X\ne 600$ Euro and/or standard deviation $σ_X\ne 30$ Euro ($p < 0.00000$). #### tkhunny

##### Well-known member
MHB Math Helper
Let's also not forget the Finite Population Correction Factor. 125/600 = 21%ish. The standard rule is about 5% max. Something must be done. If you have the right circumstances, your analysis can suggest sampling more than the entire population unless you correct it for the absence of an infinite population.

#### Klaas van Aarsen

##### MHB Seeker
Staff member
Let's also not forget the Finite Population Correction Factor. 125/600 = 21%ish. The standard rule is about 5% max. Something must be done. If you have the right circumstances, your analysis can suggest sampling more than the entire population unless you correct it for the absence of an infinite population.
If we're sampling a significant fraction of the population, doesn't $\sigma_{\overline X}$ just go down even further?
Making the conclusion of rejecting the null hypothesis only more significant?

#### tkhunny

##### Well-known member
MHB Math Helper
If we're sampling a significant fraction of the population, doesn't $\sigma_{\overline X}$ just go down even further?
Making the conclusion of rejecting the null hypothesis only more significant?
You cannot extend this argument forever with a finite population.

Even without the practical concerns, with a finite population, the assumption of Normality is less clearly met. Worrying about Bias in your estimator leads to a different and unbiased test statistic. Guess what one such test statistic is? 