Tinderbox User-to-User Forum (for formal tech support please email: info@eastgate.com)
http://www.eastgate.com/Tinderbox/forum//YaBB.cgi
Tinderbox Users >> Agent, Actions, Rules & Automation >> Finding set values that frequently appear together
http://www.eastgate.com/Tinderbox/forum//YaBB.cgi?num=1455829305

Message started by Derek Van Ittersum on Feb 18th, 2016, 4:01pm

Title: Finding set values that frequently appear together
Post by Derek Van Ittersum on Feb 18th, 2016, 4:01pm

I have 500 notes, each with a set attribute $Codes. $Codes consists of about 20 values. Each of the 500 notes could have 0,1, or more values. I would like to know which values frequently occur together.

E.g.,

$Name                   $Codes
Apple                       fruit;red;tree
Strawberry               fruit;red;plant
Orange                    fruit;orange;tree
Cucumber                 veg;green;plant
Pumpkin                   veg;orange;vine


In the list above, the values "fruit" and "red" appear frequently together, as do "fruit" and "tree".

From my understanding of the intersect operator, it doesn't quite match what I'm looking for. For one, it looks for matches between different sets, not within one. While I suppose I could create many individual sets based on each value and intersect them against each other, this seems like a roundabout method I'm not sure would even work in the end. Am I missing an operator or technique? Anyone have experience doing something similar? Is this something better suited to a different tool?

Thanks!

Title: Re: Finding set values that frequently appear together
Post by Mark Anderson on Feb 18th, 2016, 5:42pm

Although a basic example like:

$MySet = $MySetA.intersect($MySetB)

…might show two different sets as that was (IIRC) the case behind this being added. However, the action is note-scoped. So instead of testing the values of two different sets for Note X, it could test the same set in two different notes:

$MySet = $MySetA("Note X").intersect($MySetA("Note Y"))  // (tested in v6.4.0)

Your overall problem is, I believe, a bigger task. You actually want to iterate every pair to look for commonalities in the list's values. I don't believe there to be a method specifically for this.

Assume the unique list of values for your attribute, for all notes, is A, B and C. You need to check for co-occurrences of AB, and BC; luckily we can ignore dupes, AA, and cardinality as AB==BA in this context). Add more values and that list of pairings rises. More than two item co-occurring? The task scales rapidly. My hunch is this is something where you're better pushing the data out to the shell and running a script there (language of your choice). That said, i'm happy to be wrong about this.  someone with more formal coding expertise my see an achieveable way to do this in TB (without overtaxing it).

Title: Re: Finding set values that frequently appear together
Post by Mark Bernstein on Feb 19th, 2016, 12:03pm

The usual term for this is co-occurence (hyphen optional). You'll find this discussed in the information retrieval literature and in library science.

If you're not using $Text, you could copy the sets to $Text and then use note similarity, which uses a reasonable metric (tf-idf) for cooccurrence. The  "tf" stands for "term frequency" -- the number of terms a note pair has in common. "IDF" is "inverse domain frequency", which measures how rare the terms are; obviously,since persimmons are rare and applies are common,  it's more remarkable that pears and persimmons are found together than pears and apples.

Tinderbox User-to-User Forum (for formal tech support please email: info@eastgate.com) » Powered by YaBB 2.2.1!
YaBB © 2000-2008. All Rights Reserved.