Welcome, Guest. Please Login
Tinderbox
  News:
IMPORTANT MESSAGE! This forum has now been replaced by a new forum at http://forum.eastgate.com and no further posting or member registration is allowed. The forum is still accessible via read-only access for reference purposes. If you wish to discuss content here, please use the new forum. N.B. - posting in the new forum requires a fresh registration in the new forum (sorry - member data can't be ported).
  HomeHelpSearchLogin  
 
Pages: 1 2 
Send Topic Print
Extracting #hashtags and @mentions (Read 15877 times)
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #15 - Feb 1st, 2013, 1:17pm
 
A point I missed earlier, re performance, is that the command line's I describe above use a separate runCommand call from TB for each tag type (hashtag and mention). This was simply so each script return one's attribute's worth on data. In real world use it might make sense to make one runCommand() call to a single script that located both types of substring and concatenate the two results with a delimiter that can then be found in TB. It does mean an extra attribute in TB (no overhead!) to receive the runCommand. Thus if the script joins the two results with "ZZZZ", use something like:

$MyString = runCommand(...etc.
$HashtagSet = $MyString.split("ZZZZ").at(0)
$MentionSet = $MyString.split("ZZZZ").at(1)
$MyString=;


As so often, more than one approach is possible. Remember, with things like runCommand, if you don't need to run them all the time, either use an agent or if the source $Text never changes use a |= to set the result of runCommand.

[Later]aTbRef updated here.
Back to top
 
« Last Edit: Feb 01st, 2013, 1:35pm by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Extracting #hashtags and @mentions
Reply #16 - Feb 1st, 2013, 4:26pm
 
The aTbRef update and pointers upthread are great. Not sure if permissions are needed for a script stored in a TB code note... I didn't set any because I didn't know any better, and it still workedSmiley. Still puzzling through what the ruby code might be for use with only one runCommand() call.

I marvel at how the two approaches for extracting #hashtags and @mentions are both real champs in doing something pretty hard for someone without a technical background to do elsewhere.

Though some Unix smarts (and expert advice) are needed for the ruby solution, the amazing thing is that in the end it boils down to a "simple" line of Tinderbox action code paired with a line of ruby code.

Unfortunately, some things that are (relatively) easy on the command line in Terminal become daunting in runCommand(). It seems that osascript, for example, will only accept parameters as command line arguments, not as STDIN, resulting in quoting headaches in runCommand(), no matter whether the script itself is stored inside TB or outside TB.

And what to do if there is more than one attribute's value (not just one as we have here) to pass to Ruby? Is there a not-too-painful way to send more than one Tinderbox attribute via STDIN and have a script pick them up as separate parameters?

BTW, my understanding is that the -e option with osascript is supposed to be used before each line of code something like osascript -e 'first line' -e 'second line' ... on the command line. But, happily, "only one -e" works when running a multiline AppleScript stored in a Tinderbox code note with something like this:

Code:
runCommand("osascript -e '"  +$Text("MyAppleScriptCodeNote") + "' '" + $MyTbAttr1AsArg1+"' '"+$MyTbAttr2AsArg2+"'") 


 
Perhaps the ruby -e is similar?
Back to top
 
« Last Edit: Feb 1st, 2013, 4:42pm by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #17 - Feb 1st, 2013, 6:14pm
 
Quote:
Still puzzling through what the ruby code might be for use with only one runCommand() call.


Use the two existing one-liners in one script and concatenate (using something not used by the two strings you're joining. Note this one end up as a 'one-liner' - at least not unless you want to try out a ruby forum and find a friendly expert and perhaps even not then. Read the runCommand output (i.e. stdout) into a TB string attribute and use String.split() using whatever concatenator you used in the script.

Quote:
And what to do if there is more than one attribute's value (not just one as we have here) to pass to Ruby? Is there a not-too-painful way to send more than one Tinderbox attribute via STDIN and have a script pick them up as separate parameters?


Don't stop at the first roadblock. Do the reverse of the above! concatenate the attributes into a single script input parameter. Then split them apart in your chosen shell scripting environment. Not tried but with care likely you can code AppleString lists, etc., as required. Avoid getting too focused on the 'ease' of one line commands. They suit small simple things. For more complex tasks you need a slightly more complex solution. TB has simple string split/join features so it's not hard to route around the limits you describe.
Back to top
 
« Last Edit: Feb 1st, 2013, 6:15pm by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Extracting #hashtags and @mentions
Reply #18 - Feb 2nd, 2013, 1:52am
 
I like the idea of concatenating multiple attributes on the TB side, passing the string to a script via STDIN in the two-argument form of runCommand(), then parsing it on the script side. When you or Mark B say "it's not hard" you're probably right.  But, alas, for the non-technical, understanding an idea is not the same as actually getting it to work. Navigating the runCommand() shell quoting ugliness can work better because it's closer to the command line examples one can find and study on the net.

For example, as discussed, the Python script in reply #3 upthread (like many Python script examples I've tried to study) expects input via command line arguments, not STDIN.

Rather than rejigger it somehow to take STDIN, I found that when the script is revised to this:

Code:
import re, sys
# coding: utf-8
hashlist = re.findall(r"#[^\!\?\., ]+", sys.argv[1])
print ";".join(hashlist) 



it will run from a TB code note named cHashtagPy with this action code:

Code:
$HashtagSet=runCommand("python -c '" +$Text("cHashtagPy")+ "' '"+$Text+"'") 



or (without setting any permissions) from an external file hashtag.py in my "home" directory with this:

Code:
$HashtagSet=runCommand("python ~/hashtag.py '"+$Text+"'"); 



Here Python is no performance champ though. On my sample set of 800+ notes, it took 50 seconds vs. 12 for Ruby and 2 for Tinderbox action code. It doesn't run any faster after compiling (just loads a little faster).

Though slow, Python does have the attraction of being implemented on iOS with the rave-reviewed Pythonista.

The comment about the dangers of always searching for one-liner "ease" is spot on. A little like Twitter: a great medium for masters of pithiness but also a false refuge for those who don't really know how to compose a paragraph! When it comes to scripts I'm afraid I'm in the latter camp, though, after a helpful thread like this, I at least can read a few paragraphs now and then.
Back to top
 
« Last Edit: Feb 2nd, 2013, 1:54am by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #19 - Feb 2nd, 2013, 9:59am
 
The simple answer with Python & stdin is to stop hammering the nail that happens to be in front of you. Uses Perl, or Ruby or awk, or shell scripting which don't have such a limitation. I write that as one who learned by falling into that trap. I'm not a coder but having hit a deadend with Python was able (with some Googling) to quickly knock up Perl and ruby alternatives. Counter-intuitive as it seems at outset, trying to engineer one-click TB 'extensions' via runCommand becomes an exercise in diminishing returns.

Python can read stdin, just not via argv. thus:

Code:
#coding: utf8
import re
import sys
source = sys.stdin.read()
hashlist = re.findall(r"#[^\!\?\., ]+", source)
print ";".join(hashlist) 


To avoid faffing around with qoute on argument, assumme the inputs in TB using a delimiter to make a single string. Pass that string as std in. Then:

Code:
#coding: utf8
import re
import sys
source = sys.stdin.read()
# break stdin value into discrete attribute inputs
#do work on each - just as you would with discrete script argument inputs.
# concatenate  results as a variable, e.g. resultstring
print resultstring 


The contents of resultstring goes to stdout, becoming the return value of your runcommand(). action code then splits the attribute values back out of the returned string. Surely that's simpler than messing about with quoting.

Bottom line, it's very easy to fall in the trap of conflating things we don't know (and haven't looked up) with things that are genuinely hard. The former generally require more effort (that we intuited was needed), the latter often require deeper understanding of the
process/tools.  I don't think string manipulation falls into the latter.

Performance. I'd hazard a guess (not having run a test) that exporting all N items as a single block of data to a shell script and unpacking the results is faster than making N hundred per-note runCommand() calls. I'd suspect that would have greater performance impact than language A vs B. Plus, experts in the various languages (as researched in fora pertinent to the language) may well show faster code than the examples above. also, unless one's doing the task daily, the milliseconds of performance shaved are probably massively less than the hours spend researching the solution. If I, personally, had to do this data extraction task, I'd find a language script that seems to work, then go of to a forum for that language (or even stackoverflow or such) and ask if it's efficient or not.
Back to top
 
 

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Mark Bernstein
YaBB Administrator
*
Offline

designer of
Tinderbox

Posts: 2871
Eastgate Systems, Inc.
Re: Extracting #hashtags and @mentions
Reply #20 - Feb 2nd, 2013, 12:00pm
 
Further to Mark Anderson's point on performance: setting up to perform a runCommand is indeed a fairly complex and slow operation -- one that can take lots of milliseconds.  Doing it once is much faster than doing it thousands of times.
Back to top
 
 
WWW   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Extracting #hashtags and @mentions
Reply #21 - Feb 2nd, 2013, 8:14pm
 
So the tradeoffs seem to be ...
  • The faster performance of TB action code enencumbered by the processing overhead of setting up runCommand() vs. the flexibility and wide availability of examples and debugging environments for the various external scripting languages.

  • The convenience of keeping scripts within a TBX vs. fewer quoting headaches in runCommand() if the scripts are kept in external files (though the latter introduces other complexities: paths, unix permissions and such).

  • The convenience of just calling runCommand() once  for each note (actually twice here!) vs. the time spent figuring out how to concatenate everything on the TB side, pass it through runCommand() just once, process it in a script, return it back through runCommand() and explode it back into individual notes. Faced with that, I think I'd take my chances with Excel and VBA!

  • The quoting simplicity of passing *one* value to an external script via STDIN vs. navigating the runCommand() quoting complexity to pass *multiple* command line arguments for the promise of (much) easier sailing on the script end.

For me runCommand() is truly a distinguishing feature for Tinderbox, something few, if any, competitors offer. It opens up a whole new world to the semi-technical user for the many situations like this where performance speed is not critical. Perhaps it would be a smart investment to make runCommand() a *little* easier to use. Does it really *have* to make a hash of single straight quotes?  Or (apparently) simply not allow some of the kinds of escaping that can be done on the command line in the Terminal?

In any case, thanks for the explanations and thanks to Mark A for that last script showing how to get STDIN into Python. (I wish it were that easy for AppleScript, but unfortunately osascript wants command line arguments and doesn't think in terms of STDIN, and even when that is hacked around, one usually is dealing with more than one TB attribute anyway). The revised Python script in a code note cHashtagPySTDIN now runs with this:

Code:
$HashtagSet=runCommand("python -c '"+$Text("cHashtagPySTDIN")+"'",$Text) 





As a note to myself, here is my attempt at a plain English description of the valuable techniques Mark A demonstrated in scripts upthread:

Approach A. The "Humpty-Dumpty." Replace, split, select, reassemble (Tinderbox action code)

1) Take a string, say:

@mention1 ask @méntion2,etc. to explain #topic1 and #topic2 before 30Jan #dead_line xxx@gmail.com.

2) Replace each comma, period, exclamation mark, question mark and space with something that won't otherwise appear in the string, e.g. three successive pipe characters, resulting in something like this:

@mention1|||ask|||@méntion2|||etc||||||to|||explain|||#topic1|||and|||#topic2|||
before|||30Jan|||#dead_line|||xxx@gmail|||com|||


3) Split the string into a separate chunk wherever there is |||. Here there would be 16 chunks (assuming one is supposed to count the empty one between etc and to and the one at the end).

4) Loop through the chunks and grab just those that begin with # and put them back together again adding semicolons in-between, giving:

#topic1;#topic2;#dead_line

and then grab just those chunks that begin with @, giving:

@méntion1;@méntion2


Approach B. The "Fishnet." Collect regex matches (in Ruby or Python)

1) Scan the string looking for matches on a pattern of the character # followed by one or more characters that are not an exclamation mark, question mark, period, comma, or space. The regex is #[^\!\?\., ]+,  where the carat inside the square brackets means do not match any of the characters following within the square brackets, the forwardslashes are simply needed to escape the special characters that have a special meaning in regex so they are treated literally, and the + following the brackets means require at least one character following the # (that doesn't match any of the characters inside the square brackets).

2) Collect the matches and "print" them (i.e., send them to STDIN) separated by semicolons so runCommand() can assign the result to the desired TB set attribute.

3) Same drill looking for matches on a patterns starting with @ ...
Back to top
 
« Last Edit: Feb 3rd, 2013, 1:11am by Sumner Gerard »  
  IP Logged
Mark Bernstein
YaBB Administrator
*
Offline

designer of
Tinderbox

Posts: 2871
Eastgate Systems, Inc.
Re: Extracting #hashtags and @mentions
Reply #22 - Feb 2nd, 2013, 9:37pm
 
Quote:
For me runCommand() is truly a distinguishing feature for Tinderbox


It's really nice to hear this; it was an absolute bear to implement.
Back to top
 
 
WWW   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Extracting #hashtags and @mentions
Reply #23 - Feb 5th, 2013, 7:02pm
 
Trying to implement a "non-capturing group" in regex to allow extraction of just the part of the tags after the # and @ (e.g. thisTag rather than #thisTag) got ugly. But Python's "list comprehensions" together with runCommand() power turned out to be a real friend, enabling me to come up with this:

Code:
import sys
# -*- coding: utf-8 -*-
source=sys.stdin.read()
hashtags=set([word[1:] for word in source.split() if word.startswith("#")])
print ";".join(hashtags) 



The line of code beginning with hashtags= splits the source text into "words", loops through the list filtering for "words" that start with #, and then puts them in a set, which, just as with sets in Tinderbox, eliminates duplicates.

If one really does want #thisTag not thisTag, then just change the 1 to 0.

And if things like thisTag, and thisTag. are showing up as separate tags in the results, just append .rstrip("insert unwanted trailing punction"), giving something like this:

Code:
hashtags=set([word[0:].rstrip(".,") for word in source.split() if word.startswith("#")]) 



I put the script in the text of a Tinderbox code note named cHashPy and run it with this:

Code:
$HashtagSet=runCommand("python -c '"+$Text("cHashPy")+"'",$Text) 



Having the script within a self-contained TBX is convenient for me and performs well as long as I don't have a single straight quote anywhere in the script. But if the script unavoidably were to have a single straight quote somewhere, I could put it in an external file hash.py and run it with something like this:

Code:
$HashtagSet=runCommand("python ~/hash.py ",$Text) 



On my machine Python doesn't make me set Unix file permissions and all that, but maybe I've just gotten lucky or somewhere along the line unknowingly changed some setting that one day will lead to a disastrous Unix security breach.

Anyway, runCommand() and Python can provide a friendly alternative for getting some things done (debugging in particular seems easier; Pythonista in iOS is pretty handy) though Mark A's native Tinderbox action code example upthread remains by far the speediest solution here... and there is surely an easy way to strip those leading #s and @s if desired.
Back to top
 
« Last Edit: Feb 5th, 2013, 7:03pm by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Extracting #hashtags and @mentions
Reply #24 - Feb 5th, 2013, 8:07pm
 
Or:

$HashtagSet = runCommand(code)
$HashtagSet = $HashtagSet.replace("#","")


Or (not tested):

$HashtagSet = runCommand(code).replace("#","")

It might require wrapping runCommand in extra parentheses to ensure runCommand output is evaluated correctly before the replace:

$HashtagSet = (runCommand(code)).replace("#","")

Whatever, for the non-scripter, the above might be an easier tweak to returned content than writing a new script regex.
Back to top
 
« Last Edit: Feb 5th, 2013, 8:27pm by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Extracting #hashtags and @mentions
Reply #25 - Feb 5th, 2013, 10:36pm
 
Yes indeed. Thanks! The .replace() does nicely so no further wrestling with regex. Didn't realize that to replace() sets must just look like strings with semicolons in them. On my machine no need to wrap the runCommand() in extra parentheses.

Restudy of the native Tinderbox action code in reply #1 finally yielded a solution there too for dropping the leading # and @ from the extracted tags (if needed; it's interesting that Common Words view drops them automatically).  Easy. Well it's easy, anyway, with an understanding of how .substr() works when given only one argument.

Instead of:

    $Hashtags=$Hashtags + X;

Just use:

    $Hashtags=$Hashtags + X.substr(1);

And similar with $Mentions.
Back to top
 
« Last Edit: Feb 5th, 2013, 11:06pm by Sumner Gerard »  
  IP Logged
Pages: 1 2 
Send Topic Print