Determining Sample Size for Statistical Significance

smudge · Oct 27, 2006

Hello everybody,

I'm a bit stuck on a statistical significance problem. I have the following data: The number of visitors for each of 2 web pages, and a number of conversions for each web page. (A conversion could be the number of visitors who completed the web form.) I would like to be able to say that there is X probability that the ratio of visitors to conversions will remain the same as number of visitors increases. Is this possible? Intuitively it seems impossible, since I could imagine a scenario where the composition of visitors changes drastically due to a link from a popular site. In other words, I don't know how representative my sample is of the population of possible visitors.

I am currently using the chi-square method to determine the probability that my data is not randomly distributed, but would like have a stronger significance measurement.

If anyone can point me in the right direction with a link or the name of an algorithm, I would appreciate it.

Thanks,

jessica

EnumaElish · Oct 31, 2006

You seem to be addressing two separate but related issues:
1. Non-randomness: chi-sq. is a test of "randomness" in the sense that if your observations are not uniformly distributed then they probably have an underlying probability distribution which is different from the uniform distribution. (I think in your case you can also use the binomial test.)
2. Significance: suppose you hypothesized that your obervations (a.k.a. sample or data) have a Poisson distribution (as in "customer arrival" data which is usually assumed to be Poisson). See also. Poisson has the peculiar characteristic that mean = variance. Let v represent the mean (and the variance). Now, you can use a "best fit" method (e.g. maximum likelihood) to estimate the value of the v parameter. See also. Suppose your estimated v turned out to be 3. This means that based on your data, you expect the population mean to be 3. Then you can obtain the probability with which, e.g., 2 < v < 4, or 2.9 < v < 3.1, or 2.99 < v < 3.01 (or any other interval centered on 3) by using the t test. Note that in the t-test your measure of the standard deviation (s.d.) is the sample s.d. divided by the square root of the number of observations (a.k.a. sample size or n); because s.d.(of v) = (sample s.d.)/(square root of n) = square root of (v/n). S.d. of v (or of any estimated parameter) is a.k.a. the standard error (or s.e. for short).

smudge · Nov 7, 2006

Thanks for your post, Enuma. It was exactly the type of thing I was looking for. I have spent some time researching these links, and have a few follow-up questions:

The t-test is the most commonly used method to evaluate the differences in means between two groups.

-- http://www.statsoft.com/textbook/stbasic.html#t-test for independent samples"

Isn't this the same thing that a Chi-Square measures? It seems to me that the main difference between t-test and chi-square is that chi-test is a non-parametric test, meaning that it is applicable without making as many assumptions about the distribution. Can somebody confirm or deny this?

Now, assuming the above is true, I still haven't found a method that will allow me to say: there is X probability that the ratio of visitors to conversions will remain the same as number of visitors increases. I have had a gut feeling all along that it isn't mathematically possible make such a statement, but I would appreciate anyone who can convince me that I'm wrong.

Thanks,

jessica

EnumaElish · Nov 8, 2006

I agree with your statement that Chi-sq. is a non-parametric test. Like the t-test, Chi-sq. measures the degree of closeness between two distributions. The difference is that in a t-test the two distributions are parametrized by their means (or locations), whereas in a Chi-sq. test the distributions are represented by the number of items falling into each category.

http://www.statsoft.com/textbook/stbasic.html#spearson
http://en.wikipedia.org/wiki/Pearson's_chi-square_test

See also: http://en.wikipedia.org/wiki/Chi-square_distribution

My first interpretation of your 2nd question was in terms of a confidence interval, i.e. you'd like to say something like "with X probability, the expected visitor/conversion ratio is between a and b" or Prob(a < E[#V/#C] < b) = X%. I guess that interpretation is not entirely accurate, wouldn't you say? I guess you'd like to say something like "the bounds a and b are invariant with respect to the sample size, with a certain probability."

One idea is to calculate growth rates over time, then test that they are zero. More formally, you could estimate the regression equation #V/#C = A + B1 n [+ B2 n^2 + ... etc.], where n is the number of visitors, then test B1 = [B2 = ... =] 0 (which will be given by the regression F-statistic). [The square brackets indicate optional terms.] Caution: #V/#C may not have a normal distribution, and regression analysis (e.g. the Ordinary Least Squares package in canned software) assumes that the left-hand side variable is normally distributed. See this on non-normal error terms of a regression.

smudge · Nov 8, 2006

the bounds a and b are invariant with respect to the sample size, with a certain probability.

Yes, exactly.

After googling for "statistical unbiasedness" and "statistical consistency", I suspect what I am trying to show is that a and b are statistically consistant with some probability. However, I wasn't able to find any pages that demonstrate how to calculate statistical consistency. Do you know of any pages that might help?

jessica

EnumaElish · Nov 9, 2006

See, e.g., http://en.wikipedia.org/wiki/Consistent_estimator#Consistency

For example, suppose a researcher is taking random sample(s) from a normally distributed population, and calculating the arithmetic average of the sample. The formula with which the arith. aver. is being calculated is a consistent estimator of the population mean, because we can picture that as the sample size approaches the population size (here, infinity), the sample average will approach the true mean.

Unbiasedness implies consistency. Also see #3 here.

Under "mild" assumptions, the ordinary least squares (see also) estimators of the A and the B(k) coefficients in the regression equation y = A + B(1) n [+ ... + B(K) n^K], where k = 1, ..., K, are unbiased (ergo, consistent) estimators of the true relationship between the LHS and the RHS.

BTW, a and b aren't estimators, they are just bounds (constants), so they are not subject to consistency -- they are consistent by definition. What you need to show consistent is the estimator that goes between those bounds, i.e. the average #V/#C.

smudge · Nov 29, 2006

After talking this problem through, someone else has suggested the following formula which he says will allow me to caluculate the sample size needed in order to say that my conversion rate will remain the same (with a certian confidence) as time increases. This method seems much simpler than showing that the estimator of the conversion rate is consistent.

Can anyone verify that the following reasoning is correct?
N = {(t)^2 * (p)(q)}/ (d)^2 (Sorry about the formatting, I'm not sure how to make the superscripts)

where I am solving for N, the sample size

t is the value for selected alpha level of .005 in each tail = 2.57

(p)(q) is the estimate of variance = .25. (Maximum possible proportion (.5)* {1-maximum possible proportion (.5)}

d is the acceptable margin of error for proportion being estimated = .01

which gives (2.57)^2 (.5)(.5)/ (.01)^2 = 4,128.63

It seems to me that the variance estimate was pretty much pulled out of thin air. And I don't know what the formula is called, so I wasn't able to find reference to this formula elsewhere. But hey, what do I know? I'm a programmer, not a statistician.

Any help would be greatly appreciated.

jessica

reilly · Nov 30, 2006

smudge -- First, if you disregard autocorrelations -- associations over time --, then you have a simple binary distribution situation. If you consider associations over time, then life is a bit more tricky. For example, a first visit might not result in a conversion, but a second might. If, in fact, the sample was totally composed of the same visitors at each measurement, then the tested samples are not independent, and paired comparisons will be required. (This problem is a standard part of t-test theory, just do a Google or hit a basic textbook. Note also that for a sample of 100 or more, the binary test and t-test are virtually identical, as the binomial distribution tends toward the normal distribution.) In this case, the smartest thing to do, in my opinion, is to look at both pooled and paired comparisons; if they are similar in results, you are pretty much home free. If not, it will be time to scramble a bit.

If you do more than 2 cases, the t-test must be replace by an 1-way Analysis of Variance approach.

Still without worrying about autocorrelations, there's a simple way to do many cases; that is estimate a regression equation; conversions against time and visitor number. If the coefficient of time is not significant, then there's no statistically important variation in conversion rates over time .

To do all this "right" you'll need to do something like the 5-Gold-Star approach outlined below.. If not, then take your statistical lumps about bias, and inefficient estimators and non-normal distributions, and ..., and do as most business analysts do; admit the approximations and assumptions, and get to the results. In fact, people have worried greatly about the efficacy of the practical assumptions used in day-to-day statistics -- in many instances, standard approxs work quite well. However, the issue of correlations over time, autocorrelations, is huge, important, and studied under the "Time Series" label. There's also a filed, Robust Statistics, that studies better ways of dealing with less than perfect samples.

The 5-Gold Star approach would be something like: take multiple measurements of conversion rates. Then remove the autocorrelations -- this gets you into basic time-series, and Box-Jenkins or other models, which is important to help make the individual samples statistically independent.Then, do an Analysis of Variance on the corrected points. This approach is, in fact, closely related to testing sales for the influence of spot advertising, of coupons or price promotions, testing the effectiveness of a change in quarterbacks on winning football games, and so on.

I've had a couple of bouts with non-parametric tests for survey research. After 30 years of doing and teaching statistics and analysis, starting with a PhD in physics, I found non-parametric tests to be difficult, although not impossible. There are a lot of such tests, sometimes the differences are subtle, and I found it a challenge to get the right tests for my work -- not to mention that authors are not aways in agreement about the names of some of the NP tests. I used SAS, which has a good selection of NP tests, and fairly good documentation.

The moral: use big samples and the Law of Large Numbers -- in the promised land, all distribtions are gaussian.

After all that, I'll say that your's is a standard problem in business work. You could well find useful info in the business literature -- analysis, market research, forecasting, business statistics,...; worth a google. I've done 25-30 such problems over time. Chances are that binary or t-tests will do the job. Simple is best.

Good luck and regards,
Reilly Atkinson

EnumaElish · Nov 30, 2006

Solving the confidence limits with respect to the sample size is a thoughtful idea. (See especially.) You could say that "as long as my sample exceeds this minimum sample size, the C/V ratio will stay within a lower bound a and an upper bound b, with probability X." The variance (standard deviation) is something that you can calculate as the sum of squared deviations from the average C/V. For each individual visitor V is always 1, and C is either 0 or 1, so you will be averaging a bunch of (0 - your average)^2 terms and another bunch of (1 - your average)^2 terms. In your post you used the binomial variance formula, which is pq = p(1-p), where p is your average C/V. I think that would be the preferred method in this case.

"PRO's": easy to calculate, defensible as "straight out of a textbook formula."

"CON": may not really address visitor heterogeneity over time. E.g., suppose "converters" are positively correlated with the cumulative number of visits (or time), or alternatively, suppose "converters" are negatively correlated with the cumulative visits (or time) -- either way, a large sample size will not help -- you will observe a systematic increase or alternatively a systematic decrease in conversions.

reilly said:

Still without worrying about autocorrelations, there's a simple way to do many cases; that is estimate a regression equation; conversions against time and visitor number. If the coefficient of time is not significant, then there's no statistically important variation in conversion rates over time .

I guess one way to write this would be (Equation 1):

conversions = a + b1 time [+ b2 time^2 + ...] + c1 visitors [+ c2 vistr^2 + ...] + d1 time * visitors [+ d2 time * visitors^2 + d3 time^2 * vistr + d4 time^2 vistr^2 + ...]

When the objective is to test the conversion rate, the equation is a little trickier, as the # of visitors is now in the denominator on the left-hand side:

conversions/visitors = a + b1 time [+ b2 time^2 + ...],

so now let's multiply both sides by visitors, so that we get (Equation 2):

conversions = a visitors + b1 time * visitors [+ b2 time^2 visitors + ...]

Notice that Eq. 2 is structurally different from Eq. 1. Notably, the "correct" model does not have an intercept term, and does not include higher powers of the visitors variable. The relevant test statistic is the T-test on "b1*visitors" [joint F-test of the terms "bk * visitors" for k = 1, 2, ...] (which is the total derivative with respect to the powers of time, NOT the partial derivative with respect to time -- but the two are equivalent when time's higher powers are not included and the relevant test is a T-test of "b1*visitors"). In all of these tests, the "visitors" term is conventionally set to the sample average of the number of visitors (e.g. per day).

A further issue is whether to use cumulative conversions and cumulative visits, or C and V per unit time (e.g. per day). Using cumulatives may introduce a co-integration problem, in that two entirely unrelated entities will appear correlated when they accumulate over time. One example is "the cumulative amount of rainfall in Seattle" and "the cumulative amount of garbage collected in New York City." Although the two events are unrelated, they may appear correlated because both accumulate over time (the keyword being "cumulative"). For this reason, a "per unit time" measurement might be preferred.

Keep in mind that a linear regression can do anything that ANOVA does, and more -- although user preferences might differ. See this link for an "okay" primer; although it is not entirely accurate -- e.g., predictor variables (aka the X variables or "the X matrix") in a linear regression do NOT have to be continuous -- the trick is to figure out which binary (aka "indicator" or "dummy" or "zero/one") variables should be included as (or among) the predictors such that they exactly identify the categories being analyzed.

EnumaElish · Dec 1, 2006

smudge said:

Can anyone verify that the following reasoning is correct?
N = {(t)^2 * (p)(q)}/ (d)^2 (Sorry about the formatting, I'm not sure how to make the superscripts)

where I am solving for N, the sample size

t is the value for selected alpha level of .005 in each tail = 2.57

(p)(q) is the estimate of variance = .25. (Maximum possible proportion (.5)* {1-maximum possible proportion (.5)}

d is the acceptable margin of error for proportion being estimated = .01

which gives (2.57)^2 (.5)(.5)/ (.01)^2 = 4,128.63

It seems to me that the variance estimate was pretty much pulled out of thin air. And I don't know what the formula is called, so I wasn't able to find reference to this formula elsewhere.

jessica,

The formula is correct. See the Z-ratio formula here. If you solve Z = d/(sigma/sqrt(N)) for N, you shall get N = sigma^2 Z^2/d^2. (d is the difference "X bar" - "mu" where the Greek letter "mu" rhymes with "hue" and shows the "true" population mean). Now, sigma^2 is the variance and is given by pq = p(1-p) = "X bar" times (1 - "X bar") for the Bernoulli distribution. However, the person whom you have talked to must've been thinking "let us assume the worst that can happen in terms of the variance, i.e. suppose the variance has the largest possible value, and see what kind of N will be needed" and that must be the reason that your formula assumes a variance of 0.25. If you replace it with your actual estimated variance, "X bar" times (1 - "X bar") (which is necessarily < 0.25 for a Bernoulli distribution), your formula will calculate a smaller N.

However, I am not sure whether you should be using a t distribution (the "t" in your formula), or a normal distribution (the "Z" in the "Z formula"). You should be using a "t" distribution (a.k.a. Student's t distribution) when you have an estimated standard deviation in the denominator -- however, in your case you are assuming a constant variance (0.25), and hence a constant s.d. (0.5). In effect, you are multiplying a normally-distributed random variable d with the constant 1/(sigma/sqrt(n)), which produces a new normally-dist. random variable, Z (a.k.a. the standard normal [random] variable, or the standard normal distribution). Which indicates usage of a "standard normal table" rather than a "t table" when you look up the critical value (2.57 in your post).

This whole approach does not address visitor heterogeneity over time. If you are "okay" with the assumption that conversions are essentially random (i.e. visitors are homogeneous over time, and their preferences are time-invariant), then you might go with this formula.

I hope this is responsive and useful.

Determining Sample Size for Statistical Significance

Related to Determining Sample Size for Statistical Significance

What is statistical significance?

Why is statistical significance important?

How is statistical significance calculated?

What is the difference between statistical significance and practical significance?

What are some limitations of statistical significance?

Similar threads

Hot Threads

Recent Insights