N% uniqueness of x columns * y rows of output

thallstd · Apr 4, 2011

Hi all. I need more brain power than I can muster.

This is a two part question for a software function I need to write.

I'm trying to

1) pre-determine how many n% unique ways there are to intermix items from various lists and
2) produce output of only combinations that are n% unique

For example, assume 8 list with varying items:

Size list (3 items): small, medium, large
Color list (7 items): orange, blue, red, purple, yellow, green, black
Weight list (7 items): 4oz, 12oz, 1lb, 4lb, 10lb, 15lb
Age list (5 items): ancient, antique, very old, old, new
Temp list (3 items): cold, warm, hot
Animal list (6 items): dog, cat, mouse, bear, horse, zebra
Thing list (4 items): ball, chair, TV, floor, lamp
Action list (3 items): run, jump, hop

With these 8 lists there are (3*7*7*5*3*6*4*3) 158,760 combinations of possible unique output. ie

1) small orange 4oz ancient cold dog ball run
2) small orange 4oz ancient cold dog ball jump
3) small orange 4oz ancient cold dog ball hop
4) small orange 4oz ancient cold dog chair run
5) small orange 4oz ancient cold dog chair jump
6) small orange 4oz ancient cold dog chair hop
... etc

Each list occupies one column in the ouput. So each column reflects 12.5% uniqueness or sameness.

Comparing any of the first 3 outputs to each other, they are all only 12.5% unique since they are identical except for the Action column - run, jump, hop. The same is true if you compare outputs 4 thru 6 with each other.

If we compare output 1 to output 4 it is again only 12.5% unique since it varies only by the Thing column - ball vs chair. If we compare output 1 to outputs 5 or 6, it is 25% unique - the first 6 columns are identical, only the last 2 of output 1 differ from the last 2 of outputs 5 & 6 - ball run vs chair jump and chair hop.

If I were looking for combinations that were at least 25% unique, from these first 6 I could choose (1,5), (1,6), (2,4), (2,6), (3,4) or (3,5).

The first question: Is there is a formula, recursive or not, that can determine how many combinations there are that would be at least n% unique? It would need to work with varied data sets, each having a different number of lists numbering in the hundreds with each list containing from 1 to hundreds, possibly thousands of items.

The second question: Is there an algorithm that would allow the generation of only n% unique output? So instead of generating all possible combinations then going back and comparing them (as I did manually in the example above), to determine as each output row is being generated which columns should be populated with which item from that list.

My fall-back is to generate all possible combinations then go through and programmatically compare each output to all the others, which could take quite a long time to run. Any assistance that could speed up that effort would be greatly appreciated.

Stephen Tashi · Apr 7, 2011

You've done a better job than most people who ask real world questions in math forums, but a few things need to be clarified.

When you ask for the number of "combinations" of these property lists that meet the requirement that they differ by exactly k columns, do you mean "combinations of two"? Or are you interested in larger groupings ? 3? 4? etc.?

( It's simpler to talk about the number of entries where these lists differ than to use a percentage measure.)

thallstd · Apr 7, 2011

Hi Stephen,

By combinations I mean output rows, where each row has one column for each list and contains one element from that list.

The # of columns that must be unique will vary depending on how many lists there are and what % of uniqueness is required.

If there are 100 lists and 40% uniqueness is asked for, then each output row would have to differ from every other output row by at least 40 columns. If 200 lists and 40% then 80 columns... etc.

Stephen Tashi · Apr 7, 2011

thallstd said:

Hi Stephen,

By combinations I mean output rows, where each row has one column for each list and contains one element from that list.

You still aren't making it clear whether "output rows" means two rows (i.e. a pair of rows, as in your example) or whether it might mean a group of 3 rows, each of which differed by, say, 2 columns from the other rows in the group.

thallstd · Apr 7, 2011

It is n rows.

If I have 100 lists that average 25 entries each and someone want to generate all possible rows that are 40% unique, how many rows is that?

Question 1: how to compute n.

Question 2. can each column of each row be algorithmically determined?

Stephen Tashi · Apr 8, 2011

I don't know if you intended to answer my question.

As I understand your problem, there is no simple algorithm to determine the number of combinations of lists that have a given property since we don't know in advance what types of lists will exist. The lists that you get as input presumably won't have an example of every possible list that could exist.

A computer program would have to duplicate the work that you did manually. This is certainly possible. Any sane implementation of it is going to use a variety of different functions, so it won't look like one simple algorithm.

N% uniqueness of x columns * y rows of output

Related to N% uniqueness of x columns * y rows of output

What does "N% uniqueness of x columns * y rows of output" mean?

Why is it important to measure the uniqueness of data?

How is the N% uniqueness of x columns * y rows of output calculated?

What factors can affect the uniqueness of data?

Can the N% uniqueness of x columns * y rows of output be improved?

Similar threads

Hot Threads

Recent Insights