Tinderbox User-to-User Forum (for formal tech support please email: info@eastgate.com)
http://www.eastgate.com/Tinderbox/forum//YaBB.cgi
Tinderbox Users >> Agent, Actions, Rules & Automation >> Using a regex to get the domain name of a URL
http://www.eastgate.com/Tinderbox/forum//YaBB.cgi?num=1405248422

Message started by Paul Atlan on Jul 13th, 2014, 6:47am

Title: Using a regex to get the domain name of a URL
Post by Paul Atlan on Jul 13th, 2014, 6:47am

I'm trying to get the subtitle of my notes to carry the domain name of the URL they are linked to.
I've hacked together a kludgy way of doing this, by lifting a Regex of the web:

1) I have an agent gather all my notes with a URL
2) Said agent has an action that reads:
   $URL.contains("^(https?://)?[^/.]+(\.[^/.]+)+");$Subtitle=$0;

I have a few problems with this:
It gives me the protocol (http://) before the domain. Unfortunately as it is, the regex spits out the protocol as $1, and the extension as $2; the meat of the domain is only included in $0, the fully matched expression.
I've tried parenthesising like this:
   "^(https?://)?([^/.]+)(\.[^/.]+)+"

but It doesn't seem to work (and I think I've run into a v6 bug where a malformed regex will crash TBX).

My second issue is that it feels inelegant: is there anyway to directly assign the regex matched string to $Subtitle?

Title: Re: Using a regex to get the domain name of a URL
Post by Mark Anderson on Jul 13th, 2014, 8:47am

(Testing in v6.0.2) Rule:

MyString = $URL.contains("^https?://([^/:?#]+)(?:[/:?#]|$)");MyStringA = $1;

If the URL is the source of this regex, http://stackoverflow.com/questions/8498592/extract-root-domain-name-from-string, $1 is 'stackoverflow.com'. I don't think you can do the whole process in one expression. As explained here .contains() returns the string offset of the first match in the argue attribute - which you don't want.

Noting you're using the outcome in a display expression, don't put the .contains call in the $DisplayExpression's expression itself as that has an effect on performance. Better is to 'calculate' the extracted domain string and place it in a user attribute, then use that value ($MyStringA in the above example) into your display expression.

My solution won't strip subdomains, so a URL for a page at stuff.example.com will be returned as 'stuff.example.com' and not just 'example.com' as some might presume; regex do exactly what you yell them to.

Kudos for the regex is the answer by 'gilly3' in the above StackOverflow page. As the article context was JavaScript regex, I stripped some unneeded back-slashes from the source answer. URL syntax, once you step away from the basics, can get very complex. Since c.2010 domains don't need to use the Roman alphabet. The above should cope, but bear in mind regex are very precise - sometimes with unenvisaged results.

Title: Re: Using a regex to get the domain name of a URL
Post by Paul Atlan on Jul 14th, 2014, 3:12am

Thanks Mark, I'd found some stack overflow answers, but this one is much better.

I need to get my head around something in TB, though:
Based on your reply, I have this as an AgentAction:
   $URL.contains("^https?://([^/:?#]+)(?:[/:?#]|$)");$Subtitle = $1;

To my mind, here's what happens:
a) the regex matching happens, populating the $0 (whole string) and $1(domain) tokens. We discard the boolean result of the .contains() operator by not setting any variable to it.

b) we set the $Subtitle variable to our required precalculated token, and voila.

I'm not sure what you mean by "putting the .contains call in the $DisplayExpression itself" I didn't think that's what I was doing, and if anything, I thought my solution, by avoiding TWO variable settings, would actually be lighter on performance. Am I missing something?

Title: Re: Using a regex to get the domain name of a URL
Post by Mark Anderson on Jul 14th, 2014, 3:53am

Sorry, I misread your reference to a 'subtitle' as using a display expression. If you're using $Subtitle for output you should be fine as described. A niche issue arises if the target is an evaluated expression (like $DisplayExpression) but don't worry about that until you need to use it.

Tinderbox User-to-User Forum (for formal tech support please email: info@eastgate.com) » Powered by YaBB 2.2.1!
YaBB © 2000-2008. All Rights Reserved.