Welcome, Guest. Please Login
Tinderbox
  News:
IMPORTANT MESSAGE! This forum has now been replaced by a new forum at http://forum.eastgate.com and no further posting or member registration is allowed. The forum is still accessible via read-only access for reference purposes. If you wish to discuss content here, please use the new forum. N.B. - posting in the new forum requires a fresh registration in the new forum (sorry - member data can't be ported).
  HomeHelpSearchLogin  
 
Pages: 1 2 
Send Topic Print
Extracting #hashtags and @mentions (Read 15974 times)
Sumner Gerard
Full Member
*
Offline



Posts: 359

Extracting #hashtags and @mentions
Jan 28th, 2013, 8:07pm
 
Threadnote (iOS) uses Twitter-style #hashtags and @mentions within note text for painless filtering and search within filters. It exports cvs in these columns: Note (text), Month, Day, Year, Hour, Minute, Place Name, Latitude, Longitude. Easy to paste into Tinderbox. But how do I extract the #hashtags and @mentions from the text into user attributes?

If a note's text is something like: Suggested @mention1 ask @mention2 to explain #topic1 and #topic2 before 30Jan deadline.

Want in attribute $Hashtags: topic1;topic2
Want in attribute $Mentions:  mention1;mention2

Remembering this thread I set up a GetHashtags agent with this query:

  $Text.contains("#([a-zA-Z0-9_]*)")

And this action:

  $Hashtags=$1

And a similar GetMentions agent, substituting @ for # in the regex.

This, of course, only grabs the first #hashtag and the first @mention.

How would one collect them all?
Back to top
 
« Last Edit: Jan 29th, 2013, 9:05am by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #1 - Jan 29th, 2013, 4:34am
 
Firstly, there is currently no action to get regex-based sub-string(s) back from a string, although I've proposed such a feature (off-forum) for v6. This leaves 2 options:
  • Use runCommand() and scripting language of your choice to recover the tags from text
  • A brute-force method in TB.
I commend the former, for those who can write such a script (not my area) as such languages have better tos for this.  however, if you only have TB, there is a way. however, if going the latter route, I'd call the code so as to not be running it all the time. That part of the exercidse we can discuss later if needs be. so, the approach:
  • replace all spaces/punctuation in $Text with a common delimiter. We'll replace comma/exclamation/full stop/question/space with '|||' as our word delimiter
  • Split $Text to a list (of words). ASSUMPTIO: the $Text here is short, a few words, not numerous paragraphs. Why? Performance.
  • Iterate the list via .each().
  • Each list item is a word with no leading/trailing space or punctuation. So test first character of the word and assign to attributes as appropriate
So:

$MyString = $Text.replace("[,\.\!\? ]","|||");
$MyList = $MyString.split("\|\|\|");
$MyList.each(X){
  if(X.substr(0,1)=="#"){
     $Hashtags = $Hashtags + X;
  } else {
     if(X.substr(0,1)=="@"){
        $Mentions = $Mentions + X;
     };
  };
};
$MyString=;
$MyList=;


Notes (code tested in v5.12.1 in both a rule and an agent action):
  • Note the need to escape the pipe (|) symbol in the spilt() step. this indicates the split delimiter is a regex, in which a pipe is a control character.  If all that confuses you use "QQQQ"as the replace and split delimiter as 4 uppercase Q are unlikely to occur as an actual substring in the $Text.
  • You could do 2 successive if() tests on X but as a word can't be both a hashtag and a mention so it's only worth doing the second test is the first fails. thus a mention is tested for in the 'else' branch of the hashtag test.
  • The tests deliberately only look at the first character of word X. this means we don't have to worry if the rest of the word is in non-Roman characters (where an [A-Z,a-z] might otherwise not match). IOW, Arabic or Chinese names should still work OK.
  • The last two steps are just clean up. It can be useful to leave them out for initial testing so you can see the in-process variables. When done, it's better to tidy up so those variables aren't being stored for each note (for which you'd otherwise be storing $Text data three times: $Text, $MyString and $MyList).
  • Given all the code, you are probably best setting this up via a code note - it saves squinting into a Rule or Action box.
In a small file like the (v5.12.1) used for my text of the above, the result is instantaneous, but do consider scale. Thus I'd run it in an agent so I could switch the agent (and its action) 'off' when not needed, cutting an overhead of extra code.
Back to top
 
« Last Edit: Jan 29th, 2013, 4:35am by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Extracting #hashtags and @mentions
Reply #2 - Jan 29th, 2013, 1:18pm
 
Well, who said Tinderbox was hard?! Take a knotty (at least for non-technical types) problem, ask on the forum, get a short action code snippet back same day, paste it into a code note, run the action code, problem solved. Thank you, Mark, for the brilliant solution and also the helpful explanation!

Since this only needs to run once (after an import) I put it in a code note named cGetHashMen and ran it with a stamp with this action: action($Text("cGetHashMen")).

For anyone using the very interesting Threadnote iOS app, the first column labeled "Note" in the csv export automatically maps to $Title in Tinderbox, so need to first apply a stamp (or run an agent) with action $Text=$Name before running cGetHashMen.

As for performance, I tested with 800+ notes, some deliberately bulked up to be longer than the usual few sentences to see how it would do. On my MBA it was done in about a second. So in my current case the convenience of staying within Tinderbox outweighs performance considerations.

I am curious, though, if performance were to become an issue (with, say, lots of longer notes) how one would pursue option one: run an external script from Tinderbox.  I have been able to figure out how to pass attribute values through runcommand() to osascript to run an AppleScript (example), but I'm guessing AppleScript would be no speed demon here. How would one run, say, this presumably faster Python script?
Back to top
 
« Last Edit: Jan 29th, 2013, 1:44pm by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #3 - Jan 29th, 2013, 2:27pm
 
Funny thing, I tried my first attempt at Python today. My script is hashtag.py (stored in ~ home directory):

Code:
import re
import sys
hashlist = re.findall(r"#[^\!\?\., ]+", sys.argv[1])
print ";".join(hashlist) 



Called in Terminal like so:

Code:
echo | python hashtag.py '@mention1 ask @méntion2 to explain #topic1 and #topic2 before 30Jan @dead_line.' 



I get a stdout value of:  #topoic1;#topic2

So if the source test is from $Text, this should work (but doesn't):

$MySet = runCommand("echo | python hashtag.py '"+$Text+"')");

It's not the command, I've tested that other ways. It just seems that runCommand won't read stdout or the Python print command looks like stdout, but isn't. Perhaps someone with better Unix skills can explain.

Read on as Mark Bernstein explains why…


Back to top
 
« Last Edit: Jan 29th, 2013, 3:48pm by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Mark Bernstein
YaBB Administrator
*
Offline

designer of
Tinderbox

Posts: 2871
Eastgate Systems, Inc.
Re: Extracting #hashtags and @mentions
Reply #4 - Jan 29th, 2013, 3:35pm
 
When you place text on the command line, the shell (in this case bash) assigns special meanings to certain characters.  For example, "*.txt" is a wild card that's expanded to a list of file files with the "txt" extension.

The "#" symbol introduces a comment; all remaining characters on that line will be ignored.  This only applies to the non-interactive shell.  See COMMENTS section in man bash in your terminal window for details if you like, or do a web search for "bash special characters".

I believe you could fix this by excaping the # sign, preceding it with a backslash.

In general, though, it is MUCH easier to pass the results on stdin than to dance around the shell's escape characters!

    runCommand("python hashtag.py",$Text)

rather than putting $Text on the command line.
Back to top
 
 
WWW   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Extracting #hashtags and @mentions
Reply #5 - Jan 29th, 2013, 6:09pm
 
Escaping for the shell (actually, I'm beginning to understand, lots of different shells: interactive, etc., etc.) is awfully fussy! I understand the two-argument form of runCommand() is the way to go.  I've left out something here, though.

I have the above hashtag.py script at Macintosh HD/Users/sumnerg/hashtag.py. It gives the same results from the command line in Terminal as MarkA describes.

But I get no results from this in a stamp:

     $MySet=runCommand("python hashtag.py",$Text)

where hashtag.py is exactly as in MarkA's post, and $Text contains the same string that "worked" when typed into the command line.

What should I be looking to adjust?
Back to top
 
« Last Edit: Jan 29th, 2013, 6:11pm by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #6 - Jan 30th, 2013, 5:52am
 
In case it was python I tried a ruby script, with the same result. The script run fine in Terminal, the script permissions are also correct (they aren't by default - I needed to set 755 but also tried 777).

I'm aware different processes can run in different shells. I wonder (free of expertise in this area!) if the script output is being passed to a different shell's stdout than that which runCommand is monitoring.
Back to top
 
 

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Extracting #hashtags and @mentions
Reply #7 - Jan 30th, 2013, 10:49am
 
Not knowing what I was doing, I tried the whole shebang line thing and all that.  No joy.  Perhaps Mark B can steer us in the right direction.  The idea of being able to run pithy Python scripts like hashtag.py from Tinderbox without too much Unix anguish has appeal.
Back to top
 
 
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #8 - Jan 30th, 2013, 12:24pm
 
This is being followed up off forum. There are two puzzles, the first solved.

1. Paths for shell scripts. TB's working directory is '/' i.e. root. This you need to indicate the path from there.  So for user with account short name 'mary' and a script (test.py) stored in her home folder (~):

BAD:  runCommand("python text.py")

Use either of:
    runCommand("python /Users/mary/text.py")
    runCommand("python ~/text.py")

Of course if she's stored it elsewhere:

    runCommand("python ~/Documents/text.py")

Put a space in the script name or in any path folders? Then quote:

    runCommand("python '~/My Scripts./text.py'")

[I'll add this to the aTbRef notes in due course]

2. Encoding of arguments. Not sure this is a TB problem, Python, TB+Python, or generic but it looks like Python isn't happy with the attribute strings it is passed, possibly a (assumed) encoding issue. This issue is being followed up off forum. Who'd have guessed - see reply #10 (two down from this).

I'll report back when I have more on the latter.
Back to top
 
« Last Edit: Jan 31st, 2013, 1:01pm by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Extracting #hashtags and @mentions
Reply #9 - Jan 31st, 2013, 12:49am
 
This worked!

$MyString=runCommand("python ~/HelloWorld.py")

Where HelloWord.py was:

x="Hello world!"
print(x)


But when HelloWorld.py is:

x="@mention1 ask @méntion2 to explain #topic1 and #topic2 before 30Jan @dead_line."
print(x)


Get no result running from runCommand()...

...And this error when running from the command line:

Non-ASCII character '\xc3' in file HelloWorld3.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

Same error from command line when HelloWorld.py is:

x="@mention1 ask @méntion2 to explain #topic1 and #topic2 before 30Jan @dead_line."
y="hello world"
print(y)


So it seems at a minimum a declaration needs to be made in the script though not sure how to do that and whether that will allow non-ascii arguments to be passed through runCommand().  (With AppleScript I've found runCommand() to be fussy about characters in the script or arguments that AppleScript Editor accepts with no problem, in particular straight apostrophes.)

Really happy with the Tinderbox action code solution upthread. Thanks!  Look forward to guidance in due course on running an external Python script.
Back to top
 
« Last Edit: Jan 31st, 2013, 12:50am by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #10 - Jan 31st, 2013, 5:05am
 
As so often one small error in assumption falls foul of flawed documentation. I did read up references on Python and Ruby but neither made clear that input from stdin is not treated the same as first script input argument. Not guessable before the fact if you're not a Computer science grad. In each of those languages (and doubtless others) reading stdin uses a different syntaxes from reading script arguments (or 'parameters' is you prefer). In fairness, those documenting the languages thing this all obvious to the Unix-user [sic]. But Unix is egregiously bad at offering up resources on concepts you need to know. Still, this is a lesson I won't need to learn again.

Re Python, it's transpires that it's not a good choice here as stdin input is assumed to be a file so you have to go through all sorts of extra stuff 'just' to run the simple regex. (Of course, if you use Python a lot, then it's probably easy!).

Picking something that should be installed (I tested on a Mac running OS 10.6.8), Ruby looks a better choice.

Demo follows ere long.
Back to top
 
« Last Edit: Jan 31st, 2013, 5:55am by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #11 - Jan 31st, 2013, 6:22am
 
Do-it-yourself demo.
  • Open Terminal (Appications -> Utilities folder).
  • In Terminal type "cd ~" (without quotes) and hit the Return key to ensure your home folder is the current focus.
  • One at a time, copy/paste to Terminal each of the following code, hitting the Return key after pasting in each line (you're running 4 separate commands in succsesion):
    • touch hashtag.rb
    • chmod 755 hashtag.rb
    • touch mention.rb
    • chmod755 mention.rb
  • You have now made 2 new empty Ruby script files in your home folder and changed their permissions so they are executable as scripts.
  • Now add code. Depending on your system it may open in some specialist code editor, otherwise in TextEdit. N.B. you want the file to be in plain text format.
  • Open hashtag.rb. Paste in the following, and then save/close: $stdin.read.scan(/#[^\!\?\., ]+/) {|w| print "#{w};"}
  • Open mention.rb. Paste in the following, and then save/close: $stdin.read.scan(/@[^\!\?\., ]+/) {|w| print "#{w};"}
  • To your Tinderbox add 2 Set-type attributes $HashtagSet and $MentionSet.
  • To a rule or agent action, add this code:
    $HashtagSet=runCommand("ruby ~/hashtag.rb",$Text);$MentionSet=runCommand("ruby ~/mention.rb",$Text);
  • Save the TBX.
Any notes using the action code above will parse the current note's $Text and, via the command line, call the Ruby script and populat $HashTagSet and $MentionSet accordingly. Code tested in v5.12.1. The method works with accented characters, non-letter and other alphabets.

I've also tried Cyrillic and Chinese text too. The latter test works with the code but if the results are displayed as Key Attributes, any non-Roman text displays as a series of question marks as it is non-Unicode (a known v5.x limitation); the attribute data itself is unharmed.

If you want to use differently named attributes just edit the rule code accordingly. Set attributes make more sense than List-type here as sets de-dupe (i.e. don't allow duplicate values).

If you want to store your scripts other than in the home folder, then amend the paths in the rule accordingly.  If your path contains spaces, put quotes around it: "ruby '~/My Scripts/hashtag.rb'". If you store the script outside your home folder you'll need to give the full path from root (i.e. from "/").; if you do the latter likely you're not a novice so will know what to do!

If I get round to making a file of this I'll put it in my demo bank (a then-current copy of which is included in the Eastgate Tutorial CDs).
Back to top
 
« Last Edit: Jan 31st, 2013, 6:24am by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Extracting #hashtags and @mentions
Reply #12 - Jan 31st, 2013, 10:53pm
 
What a fantastic tutorial! Successfully applied to 840 dummy notes, each mostly two to three sentences with Chinese mixed in just a little. The (to me) astonishing result: The Tinderbox action code approach in this context leaves Ruby in its dust, taking less than two seconds to Ruby's twelve seconds or so.

On my machine Ruby speeds up a little, to ten seconds or so, if the scripts are placed in Tinderbox code notes instead of external files. I put the contents of hashtag.rb in a code note named cHashtagRb and those of mention.rb in a note named cMentionRb, then, after selecting the target notes, ran the scripts from a stamp with this action:

Code:
$HashtagSet=runCommand("ruby -e '"+$Text("cHashtagRb")+"'",$Text);$MentionSet=runCommand("ruby -e '"+$Text("cMentionRb")+"'",$Text); 



As I found with osascript and AppleScript, the only way to pass values of Tinderbox attributes via runCommand() as command line arguments (as opposed to STDIN) seems to be to surround them with spaces and straight single quotes within double quotes. Ugly and easy to get wrong, and any single straight quotes within the scripts and arguments being passed through runCommand() have to be hunted down and converted to curlies first. On the other hand, that way the performance can be a little better plus one doesn't have to worry about mysterious Unix paths and permissions and, probably the greatest advantage for me, the TBX is self-contained. So I guess it's pick the poison.  

It would be much more convenient for this user if in addition to being able to pass STDIN via the two-argument form of runCommand() one could also pass command line arguments without sweating about quoting for the shell and all that, with something like:

       runAScript(script, arg1, arg2, arg3…)

In any case, despite the complexity, doing something like this in Tinderbox is already **much** easier than in, say, Excel with VBA.  Especially with the expert guidance here.  Thanks!
Back to top
 
« Last Edit: Jan 31st, 2013, 10:58pm by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #13 - Feb 1st, 2013, 6:58am
 
Interesting notes on the performance - I didn't have a big dataset to test on so it's good to hear the 'internal' TB code fares.

Like @Sumner, I like TBXs to have no external file dependencies (scripts, templates). In truth, this habit has been acquired as I'm developing solutions for other people (and whom are often not of a technical nature). However, there comes a point where complexity works against simplicity and the case of script code needing to use single and double quotes is a case in point. I think that needing both types of argument in a script is Nature's way of saying you should store/use the script outside TB.

Also bear in mind that things like Ruby's -e option doesn't (appear) to work with multi-line scripts. It so happens the script cites above is a single line one so doesn't show the problem.

The weakest link here, as displayed by my mistakes up-thread, is the user's Unix smarts - or lack of them. Although the answer to more complex transforms may well be to push it out to the command line, it assumes the user knows how to use Unix. Of course, most looking for help in this area don't have such expertise. Catch-22. Still the forum's here to try and help. I've certainly learned a fair bit about stdin in the last few days.  Smiley
Back to top
 
« Last Edit: Feb 1st, 2013, 8:41am by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #14 - Feb 1st, 2013, 10:21am
 
Her's another useful little building block for creating . Assumptions:
  • You are going to add scripts to your home (~) OS folder. If you want to use a subfolder within home, amend code accordingly.
  • Your new script files need to be made executable before calling from TB.
  • Your script file names are only letters/numbers/underscore (and recommend an appropriate file extension)
  • You have a note whose $Name is the name of the script file to be  created and whose $Text is the script's code.
Create a new stamp. I called mine "Make Script File". Stamp code:

runCommand("cd ~; touch "+$Name+"; chmod 755 "+$Name+"; echo '"+$Text+"' | cat >> "+$Name)

Now, create your script note (as per assumptions). Select it and apply the above stamp. Your multi-line script is 'deployed and permissions set ready for access from TB via runCommand as up-thread. Viewed in BBEdit the files created have Mac (CR, '\r') type line ends and there is a trailing mystery control character. However, the scripts seem to run OK. Note that we start by setting home (~) as the current working folder as TB's default is root (/)

We 'deployed' our script to the OS from TB. Why not delete it when no longer needed? New stamp "Delete Script File". code:

runCommand("cd ~; rm "+$Name)

Same method, select the 'script' note for the unwanted deployed script and apply the stamp. The OS script file's gone (check in finder if you doubt it) and all without leaving Tinderbox.

Expectation management: DO NOT just dive into this using your only copy of your key TB document. Keep back-ups. Do small tests first. Check you are happy you get the expected results before you move on. You break anything, you have to clean up: garbage in, garbage out!
Back to top
 
 

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Pages: 1 2 
Send Topic Print