# calculating probablity that random subset of population contains duplicates

##### New member
Hi,

Apologies that this is basic question but I have to start somewhere! (-:

The problem is succinctly stated in the msg title but, in greater detail; I'm working with some biological data from which samples have been taken. The sampling should have been at random. The samples include duplicates. What I need to know is how to calculate the expected number of duplicates in a sample size drawn from a population size.

For example, if I have a population size, p, of 3 million, and take 3 million samples, s, then the extent of duplicates within the samples s would be expected to be greater than if I take 300thousand samples.

But how do I calculate the expected rate given various values of p and s?
I have access to R & should be able to find my way to any libraries which might be helpful in answering this. Thanks

m

#### Mr Fantastic

##### Member
Hi,

Apologies that this is basic question but I have to start somewhere! (-:

The problem is succinctly stated in the msg title but, in greater detail; I'm working with some biological data from which samples have been taken. The sampling should have been at random. The samples include duplicates. What I need to know is how to calculate the expected number of duplicates in a sample size drawn from a population size.

For example, if I have a population size, p, of 3 million, and take 3 million samples, s, then the extent of duplicates within the samples s would be expected to be greater than if I take 300thousand samples.

But how do I calculate the expected rate given various values of p and s?
I have access to R & should be able to find my way to any libraries which might be helpful in answering this. Thanks

m
If I understand the problem correctly, then I think you should take a look at the hypergeometric distribution (use your prefered search engine).