Welcome, Guest. Please Login
Tinderbox
  News:
IMPORTANT MESSAGE! This forum has now been replaced by a new forum at http://forum.eastgate.com and no further posting or member registration is allowed. The forum is still accessible via read-only access for reference purposes. If you wish to discuss content here, please use the new forum. N.B. - posting in the new forum requires a fresh registration in the new forum (sorry - member data can't be ported).
  HomeHelpSearchLogin  
 
Pages: 1
Send Topic Print
Make a Set from $Text via agent back-reference (Read 3540 times)
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Make a Set from $Text via agent back-reference
Oct 30th, 2012, 11:51am
 
So, I've several hundred notes with data in the $Text sequences like so:

Quote:
Present in Version:
7, 8, 8.5, 9, 9.5, 10

I'd like to collect all the version 'numbers' as Set values in $VersionSet.  An added complication is some entries are like this, with added (variable) text on the second line of the intended match:

Quote:
Present in Version:
8.5, 9, 9.5, 10 (though only if module Z installed)


Having made $VersionSet and a test String $MyTest, it's time for an agent query:

$Text.contains("Present in Version:\n((\d|, |\.)*).*\n")

The substring match is that anyhere within $Text we find:
- Literal string 'Present in Version:'
- a single line break '\n'
- open back reference '('
- a sequence of zero or more '*' of any of the pipe-delimited '|' sequences in the parentheses '()'
-- a digit '\d'
-- a litera l comma+space ', '
-- a literal full stop '\.' (as '.' regex matches any character)
- close the back reference ')'
- zero or more of any character plus a line break '.*\n'

Set an action: $MyTest = $1

Were '$1' is the back-reference we made in the query. We now get values like "8.5, 9, 9.5, 10", and importantly any trailing text or whitespace on the source line is gone. These tests as a good precursor to doing the full transform as this string is the source of the set data.  By using an interim test attribute we can easily ditch the results (and the attribute when done).

Last piece of the jigsaw, change the agent action to this:

$VersionSet = $1.replace(", ",";")

Here we take the string returned via $1 and use String.replace() to replace the comma+space sequences with semi-colons. The overal result is passed to $VersionSet. In this case, as it's a one off task, a simple '=' assignment works. If you need to ensure other pre-existing Set values aren't lost, use:

$VersionSet = $VersionSet  + $1.replace(", ",";")

In this scenario is so happens the the match sequence is know to occur only once allowing one other trick. If we enhance the query to make 3 back references -  all the text before the the match, the match, and all after it is possible to set $Text to the first and last of those effectively deleting the matched text (now in an attribute); set $Text to $1 + $3. In practice, I find it useful to delay this latter step until after doing some checks on the data ingest as it's easier to roll back if something went wrong.
Back to top
 
 

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Make a Set from $Text via agent back-reference
Reply #1 - Oct 30th, 2012, 3:59pm
 
Thanks for this. Now understand regular expressions a little better.  I was having trouble making this work, until I realized I (obviously) needed to add a line break at the end of the sample text quoted above because the expression was expecting one.
Back to top
 
 
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Make a Set from $Text via agent back-reference
Reply #2 - Oct 30th, 2012, 8:01pm
 
An exasperating aspect of regex, for a comparative starter like me, is how nuanced they are in ways you wouldn't guess at outset. I think (because it's hidden inside the app) that Tb's use of the Boost regex library matches, by default on a per-line (== paragraph) basis. IOW, a match will run from the start of the match to the next line break. I think it can match more widely but I'm not 100% sure.

Let's say you want to match from "XYZ" through to the end of the line. You might guess "XYX.*" but that seems to match to the end of text (cue my confusion above). More certain is "XYZ[^\n]*". This equates to 'XYZ' followed by zero or more characters until the next line return. The '[]' is a group of things to match. The '^' inside it flips the match, i.e. do not not match. A line return is a special match character '\n'. So '[^\n]' means match anything that is not a new line (line feed) character. The asterisk '[^\n]*provides the scoping of zero or more matches. Notice it sits outside the square brackets as it applies to their contents.
Back to top
 
 

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Pages: 1
Send Topic Print