Probability of Duplicate Identifiers in a Set of Components

In summary, the probability of having two components with the same identifier (duplicate) is 1-1/Nc, assuming N>>C. The probability of having one component assigned to an identifier is L*e^{-L}, where L is the average number of components per identifier. The number of identifiers with exactly one component is N1=N*L*e^{-L}. The percentage of components uniquely assigned to an identifier is given by e^{-L}.
  • #1
DaanV
26
0
Mod note: Removed "Basic calculus" from thread title, as the question doesn't seem to have anything to do with calculus

Homework Statement


Real world kind of thing.

Assume I have a set of C components that I want to investigate.
I want to give each component a unique identifier. I have N identifiers.

What is the probability of having two components with the same identifier (duplicate), given N and C?

Homework Equations


What I'm looking for!

The Attempt at a Solution


So let's say one component gets some identifier. The probability that none of the other components get the same identifier would be:
(1- 1/N)C, assuming N>>C (how much bigger would N have to be than C for this to be a valid assumption, btw?)

However, by my logic this doesn't yet exclude the possibility that any of the other components have duplicate identifiers to one another. This is where I'm getting stuck.

Thanks in advance for any help provided!
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
DaanV said:

Homework Statement


Real world kind of thing.

Assume I have a set of C components that I want to investigate.
I want to give each component a unique identifier. I have N identifiers.

What is the probability of having two components with the same identifier (duplicate), given N and C?

Homework Equations


What I'm looking for!

The Attempt at a Solution


So let's say one component gets some identifier. The probability that none of the other components get the same identifier would be:
(1- 1/N)C, assuming N>>C (how much bigger would N have to be than C for this to be a valid assumption, btw?)

However, by my logic this doesn't yet exclude the possibility that any of the other components have duplicate identifiers to one another. This is where I'm getting stuck.

Thanks in advance for any help provided!

You might want to look at this. https://en.wikipedia.org/wiki/Birthday_problem
 
  • Like
Likes DaanV
  • #3
Well.. That just goes to show I should probably dive into Wikipedia first next time. :)
Thanks for your help, that solved it perfectly.
 
  • #4
DaanV said:
I want to give each component a unique identifier.
If the "birthday solution" applies, then didn't you mean, "I want to give each component a randomly chosen identifier"?
 
  • #5
Sorry, yes, I do suppose I meant that.
More completely:
"I am giving each component a randomly chosen identifier. I would like to give each component a unique identifier. How can I determine the number of unique identifiers N, so that I have at least 99% odds of having no duplicate identifiers?"

I think I have that figured out though. Thanks for clearing it up. :)
 
  • #6
Thread revival!

So I’ve been mulling this one over a bit more. I have come to the conclusion that the previous answer, although correct, actually wasn’t what I was looking for. The requirement of X% odds of having zero duplicates is too stringent for my application. So please allow me to introduce a new (and entirely related!) problem.

Mind you, I think I have the correct answer. I’m asking for feedback on if I’m making stupid assumptions. So here goes, thanks in advance!

Homework Statement


Real world kind of thing.

Assume I have a set of C components that I want to investigate.
I am giving each component a random identifier. I have N identifiers.
The ‘loading’ (L) is the average number of components per identifier.
(e.g. if there are 500 components and 100 identifiers, L=C/N=5)

The problem:
How small must L be (or how much greater the number of identifiers than the number of components) for >= 99% of components to be assigned to a unique identifier?

Homework Equations


The partitioning of components into identifiers is described by a Poisson distribution (assumption! Is this likely to be correct?), where P(k;L) is the probability for there to be k components in one identifier given an average loading L.
##P(k;L) = \frac{L^k*e^{-L}}{k!}##

The Attempt at a Solution


Fraction of identifiers with exactly one component assigned to them is
##P(1;L) = \frac{L^1*e^{-L}}{1!} = L * e^{-L}##

To get number of identifiers with exactly 1 component (N1), multiply by N
##N1 = N * L * e^{-L} = C * e^{-L}##

The number of identifiers with exactly 1 component corresponds to the number of components that are uniquely assigned to an identifier (N1 = Cunique). The percentage of components uniquely assigned to an identifier is then given by
##\frac{Cunique}{C} = e^{-L}##

So that the critical value of 99% unique is obtained when:
##L = -ln(0.99) = 0.01005## components/identifier
In other words, as long as there are at least ~100x as many identifiers as components, I can expect 99% of the components to be uniquely assigned to an identifier.

Is this correct?
The outcome looks.. surprisingly simple. Did I overlook something somewhere? Did I take a tremendous detour to get to this answer?
Thanks in advance for taking the time!

Regards, DaanV
 
  • #7
DaanV said:
Thread revival!

So I’ve been mulling this one over a bit more. I have come to the conclusion that the previous answer, although correct, actually wasn’t what I was looking for. The requirement of X% odds of having zero duplicates is too stringent for my application. So please allow me to introduce a new (and entirely related!) problem.

Mind you, I think I have the correct answer. I’m asking for feedback on if I’m making stupid assumptions. So here goes, thanks in advance!

Homework Statement


Real world kind of thing.

Assume I have a set of C components that I want to investigate.
I am giving each component a random identifier. I have N identifiers.
The ‘loading’ (L) is the average number of components per identifier.
(e.g. if there are 500 components and 100 identifiers, L=C/N=5)

The problem:
How small must L be (or how much greater the number of identifiers than the number of components) for >= 99% of components to be assigned to a unique identifier?

Homework Equations


The partitioning of components into identifiers is described by a Poisson distribution (assumption! Is this likely to be correct?), where P(k;L) is the probability for there to be k components in one identifier given an average loading L.
##P(k;L) = \frac{L^k*e^{-L}}{k!}##

The Attempt at a Solution


Fraction of identifiers with exactly one component assigned to them is
##P(1;L) = \frac{L^1*e^{-L}}{1!} = L * e^{-L}##

To get number of identifiers with exactly 1 component (N1), multiply by N
##N1 = N * L * e^{-L} = C * e^{-L}##

The number of identifiers with exactly 1 component corresponds to the number of components that are uniquely assigned to an identifier (N1 = Cunique). The percentage of components uniquely assigned to an identifier is then given by
##\frac{Cunique}{C} = e^{-L}##

So that the critical value of 99% unique is obtained when:
##L = -ln(0.99) = 0.01005## components/identifier
In other words, as long as there are at least ~100x as many identifiers as components, I can expect 99% of the components to be uniquely assigned to an identifier.

Is this correct?
The outcome looks.. surprisingly simple. Did I overlook something somewhere? Did I take a tremendous detour to get to this answer?
Thanks in advance for taking the time!

Regards, DaanV
That's not really valid because the probabilities of how many get assigned to a given label are not independent.
Also, you seem to be missing a parameter. You want the probability that 99% are unique.

This is quite a hard problem. Let's start with a simpler question. Given R distinct objects and N distinct buckets, how many ways are there of assigning objects to buckets such that no bucket gets exactly one object?
Do you know the principle of inclusion and exclusion?
 
  • Like
Likes DaanV
  • #8
Thinking about this some more, I suggest making an approximation appropriate to the question asked.
Since you are interested in the case where there are very few duplicates, the probability of three or more getting the same label is vanishingly small.
So consider this:
r distinct objects are placed in n distinct buckets. In how many ways can t <= r/2 buckets get exactly two, and no bucket get more than two?
 

Related to Probability of Duplicate Identifiers in a Set of Components

What is the "probability of duplicates"?

The probability of duplicates refers to the likelihood that two or more objects, events, or individuals will have the same value, characteristic, or outcome.

How is the probability of duplicates calculated?

The probability of duplicates is calculated by dividing the number of possible duplicate outcomes by the total number of possible outcomes. For example, if there are 10 objects and 2 of them have the same value, the probability of duplicates would be 2/10 or 20%.

What factors affect the probability of duplicates?

The probability of duplicates can be affected by various factors such as sample size, number of possible outcomes, and the likelihood of each outcome occurring. It can also be influenced by external factors such as randomness and chance.

Why is understanding the probability of duplicates important?

Understanding the probability of duplicates is important because it can help in making informed decisions and predicting outcomes. It is also a fundamental concept in statistics and plays a crucial role in many scientific fields such as genetics, epidemiology, and economics.

How can the probability of duplicates be used in research or experiments?

The probability of duplicates can be used in research or experiments to design sampling strategies, estimate sample sizes, and evaluate the reliability of results. It can also be used to assess the validity of study findings and identify potential sources of error or bias.

Similar threads

  • Precalculus Mathematics Homework Help
Replies
7
Views
1K
  • Precalculus Mathematics Homework Help
2
Replies
58
Views
3K
  • Precalculus Mathematics Homework Help
Replies
29
Views
2K
  • Special and General Relativity
Replies
15
Views
640
Replies
12
Views
779
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Precalculus Mathematics Homework Help
Replies
16
Views
2K
  • Precalculus Mathematics Homework Help
Replies
3
Views
1K
  • Introductory Physics Homework Help
Replies
18
Views
1K
  • Precalculus Mathematics Homework Help
Replies
12
Views
3K
Back
Top