N% uniqueness of x columns * y rows of output

In summary: If I wanted to generate all possible rows that are 80%... unique from the other rows, it would require 16000 rows.
  • #1
thallstd
3
0
Hi all. I need more brain power than I can muster.

This is a two part question for a software function I need to write.

I'm trying to


  • 1) pre-determine how many n% unique ways there are to intermix items from various lists and
    2) produce output of only combinations that are n% unique

For example, assume 8 list with varying items:

  • Size list (3 items): small, medium, large
    Color list (7 items): orange, blue, red, purple, yellow, green, black
    Weight list (7 items): 4oz, 12oz, 1lb, 4lb, 10lb, 15lb
    Age list (5 items): ancient, antique, very old, old, new
    Temp list (3 items): cold, warm, hot
    Animal list (6 items): dog, cat, mouse, bear, horse, zebra
    Thing list (4 items): ball, chair, TV, floor, lamp
    Action list (3 items): run, jump, hop

With these 8 lists there are (3*7*7*5*3*6*4*3) 158,760 combinations of possible unique output. ie

  • 1) small orange 4oz ancient cold dog ball run
    2) small orange 4oz ancient cold dog ball jump
    3) small orange 4oz ancient cold dog ball hop
    4) small orange 4oz ancient cold dog chair run
    5) small orange 4oz ancient cold dog chair jump
    6) small orange 4oz ancient cold dog chair hop
    ... etc

Each list occupies one column in the ouput. So each column reflects 12.5% uniqueness or sameness.

Comparing any of the first 3 outputs to each other, they are all only 12.5% unique since they are identical except for the Action column - run, jump, hop. The same is true if you compare outputs 4 thru 6 with each other.

If we compare output 1 to output 4 it is again only 12.5% unique since it varies only by the Thing column - ball vs chair. If we compare output 1 to outputs 5 or 6, it is 25% unique - the first 6 columns are identical, only the last 2 of output 1 differ from the last 2 of outputs 5 & 6 - ball run vs chair jump and chair hop.

If I were looking for combinations that were at least 25% unique, from these first 6 I could choose (1,5), (1,6), (2,4), (2,6), (3,4) or (3,5).

The first question: Is there is a formula, recursive or not, that can determine how many combinations there are that would be at least n% unique? It would need to work with varied data sets, each having a different number of lists numbering in the hundreds with each list containing from 1 to hundreds, possibly thousands of items.

The second question: Is there an algorithm that would allow the generation of only n% unique output? So instead of generating all possible combinations then going back and comparing them (as I did manually in the example above), to determine as each output row is being generated which columns should be populated with which item from that list.

My fall-back is to generate all possible combinations then go through and programmatically compare each output to all the others, which could take quite a long time to run. Any assistance that could speed up that effort would be greatly appreciated.
 
Physics news on Phys.org
  • #2
You've done a better job than most people who ask real world questions in math forums, but a few things need to be clarified.

When you ask for the number of "combinations" of these property lists that meet the requirement that they differ by exactly k columns, do you mean "combinations of two"? Or are you interested in larger groupings ? 3? 4? etc.?

( It's simpler to talk about the number of entries where these lists differ than to use a percentage measure.)
 
  • #3
Hi Stephen,

By combinations I mean output rows, where each row has one column for each list and contains one element from that list.

The # of columns that must be unique will vary depending on how many lists there are and what % of uniqueness is required.

If there are 100 lists and 40% uniqueness is asked for, then each output row would have to differ from every other output row by at least 40 columns. If 200 lists and 40% then 80 columns... etc.
 
  • #4
thallstd said:
Hi Stephen,

By combinations I mean output rows, where each row has one column for each list and contains one element from that list.

You still aren't making it clear whether "output rows" means two rows (i.e. a pair of rows, as in your example) or whether it might mean a group of 3 rows, each of which differed by, say, 2 columns from the other rows in the group.
 
  • #5
It is n rows.

If I have 100 lists that average 25 entries each and someone want to generate all possible rows that are 40% unique, how many rows is that?

Question 1: how to compute n.

Question 2. can each column of each row be algorithmically determined?
 
  • #6
I don't know if you intended to answer my question.

As I understand your problem, there is no simple algorithm to determine the number of combinations of lists that have a given property since we don't know in advance what types of lists will exist. The lists that you get as input presumably won't have an example of every possible list that could exist.

A computer program would have to duplicate the work that you did manually. This is certainly possible. Any sane implementation of it is going to use a variety of different functions, so it won't look like one simple algorithm.
 

Related to N% uniqueness of x columns * y rows of output

What does "N% uniqueness of x columns * y rows of output" mean?

N% uniqueness of x columns * y rows of output refers to the percentage of distinct or unique data in a table or matrix that has x number of columns and y number of rows.

Why is it important to measure the uniqueness of data?

Measuring the uniqueness of data helps to determine the quality and reliability of information. It can also help identify any potential errors or duplications in the data.

How is the N% uniqueness of x columns * y rows of output calculated?

The N% uniqueness is calculated by dividing the number of distinct data entries by the total number of entries and then multiplying it by 100. For a table or matrix with x columns and y rows, the total number of entries would be x * y.

What factors can affect the uniqueness of data?

The uniqueness of data can be affected by factors such as data collection methods, data entry errors, and data processing techniques. It can also be influenced by the type and nature of the data itself.

Can the N% uniqueness of x columns * y rows of output be improved?

Yes, the uniqueness of data can be improved by implementing quality control measures during data collection and processing, using advanced data cleaning techniques, and regularly reviewing and updating the data.

Similar threads

  • Programming and Computer Science
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Mechanical Engineering
Replies
8
Views
2K
  • Programming and Computer Science
Replies
34
Views
3K
  • Linear and Abstract Algebra
Replies
1
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
2
Views
1K
  • General Math
Replies
3
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
  • Programming and Computer Science
Replies
23
Views
2K
  • Programming and Computer Science
Replies
12
Views
1K
Back
Top