Calculating probablity that random subset of population contains duplicates

In summary: When calculating the expected number of duplicates in a sample size, you would take the sample size and divide it by the population size. This would give you the "expected number of duplicates" for the sample size. In your example, if you take 3 million samples and the population size is 3 million, then the expected number of duplicates is 300,000.
  • #1
mads1
1
0
Hi,

Apologies that this is basic question but I have to start somewhere! (-:

The problem is succinctly stated in the msg title but, in greater detail; I'm working with some biological data from which samples have been taken. The sampling should have been at random. The samples include duplicates. What I need to know is how to calculate the expected number of duplicates in a sample size drawn from a population size.

For example, if I have a population size, p, of 3 million, and take 3 million samples, s, then the extent of duplicates within the samples s would be expected to be greater than if I take 300thousand samples.

But how do I calculate the expected rate given various values of p and s?
I have access to R & should be able to find my way to any libraries which might be helpful in answering this. Thanks

m
 
Mathematics news on Phys.org
  • #2
mads said:
Hi,

Apologies that this is basic question but I have to start somewhere! (-:

The problem is succinctly stated in the msg title but, in greater detail; I'm working with some biological data from which samples have been taken. The sampling should have been at random. The samples include duplicates. What I need to know is how to calculate the expected number of duplicates in a sample size drawn from a population size.

For example, if I have a population size, p, of 3 million, and take 3 million samples, s, then the extent of duplicates within the samples s would be expected to be greater than if I take 300thousand samples.

But how do I calculate the expected rate given various values of p and s?
I have access to R & should be able to find my way to any libraries which might be helpful in answering this. Thanks

m
If I understand the problem correctly, then I think you should take a look at the hypergeometric distribution (use your preferred search engine).
 
  • #3
Hi Mads,

What do you mean by a "duplicate"? Do you mean its like you caught a fish, threw if back into the lake, and then caught the same fish again? Or is it like catching another fish of the same species? And to pursue the fishing analogy further, do you return the fish to the lake ("sampling with replacement"), or do you keep it ("sampling without replacement")?
 

Related to Calculating probablity that random subset of population contains duplicates

1. What is the formula for calculating the probability of a random subset containing duplicates?

The formula for calculating the probability of a random subset containing duplicates is: P(duplicates) = 1 - (n! / (n^k * (n-k)!)), where n is the size of the population and k is the size of the subset.

2. Can you provide an example of how to use the formula to calculate the probability?

For example, if we have a population of 10 people and we randomly select a subset of 5 people, the probability of that subset containing duplicates would be: P(duplicates) = 1 - (10! / (10^5 * (10-5)!)) = 0.409.

3. How does the size of the population and subset affect the probability of duplicates?

The larger the population and subset, the lower the probability of duplicates. This is because as the size increases, there are more unique options and less chance of selecting the same item more than once.

4. Is there a way to decrease the probability of duplicates in a random subset?

Yes, one way to decrease the probability is by increasing the size of the subset. Another way is by increasing the size of the population, as this will also increase the number of unique options to choose from.

5. Are there any real-world applications for calculating the probability of duplicates in a random subset?

Yes, this concept is often used in statistical analysis and data mining in order to determine the likelihood of duplicate data points in a sample. This can be helpful in identifying errors or anomalies in a dataset, or in predicting the accuracy of a sample in representing the larger population.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
Replies
2
Views
1K
Replies
2
Views
1K
Replies
1
Views
845
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
679
  • General Math
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
688
Back
Top