Welcome, Guest. Please Login
Tinderbox
  News:
IMPORTANT MESSAGE! This forum has now been replaced by a new forum at http://forum.eastgate.com and no further posting or member registration is allowed. The forum is still accessible via read-only access for reference purposes. If you wish to discuss content here, please use the new forum. N.B. - posting in the new forum requires a fresh registration in the new forum (sorry - member data can't be ported).
  HomeHelpSearchLogin  
 
Pages: 1
Send Topic Print
Using a regex to get the domain name of a URL (Read 3765 times)
Paul Atlan
Full Member
*
Offline



Posts: 45
Abu Dhabi
Using a regex to get the domain name of a URL
Jul 13th, 2014, 6:47am
 
I'm trying to get the subtitle of my notes to carry the domain name of the URL they are linked to.
I've hacked together a kludgy way of doing this, by lifting a Regex of the web:

1) I have an agent gather all my notes with a URL
2) Said agent has an action that reads:
   $URL.contains("^(https?://)?[^/.]+(\.[^/.]+)+");$Subtitle=$0;

I have a few problems with this:
It gives me the protocol (http://) before the domain. Unfortunately as it is, the regex spits out the protocol as $1, and the extension as $2; the meat of the domain is only included in $0, the fully matched expression.
I've tried parenthesising like this:
   "^(https?://)?([^/.]+)(\.[^/.]+)+"

but It doesn't seem to work (and I think I've run into a v6 bug where a malformed regex will crash TBX).

My second issue is that it feels inelegant: is there anyway to directly assign the regex matched string to $Subtitle?
Back to top
 
 
WWW   IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Using a regex to get the domain name of a URL
Reply #1 - Jul 13th, 2014, 8:47am
 
(Testing in v6.0.2) Rule:

MyString = $URL.contains("^https?://([^/:?#]+)(?:[/:?#]|$)");MyStringA = $1;

If the URL is the source of this regex, http://stackoverflow.com/questions/8498592/extract-root-domain-name-from-string, $1 is 'stackoverflow.com'. I don't think you can do the whole process in one expression. As explained here .contains() returns the string offset of the first match in the argue attribute - which you don't want.

Noting you're using the outcome in a display expression, don't put the .contains call in the $DisplayExpression's expression itself as that has an effect on performance. Better is to 'calculate' the extracted domain string and place it in a user attribute, then use that value ($MyStringA in the above example) into your display expression.

My solution won't strip subdomains, so a URL for a page at stuff.example.com will be returned as 'stuff.example.com' and not just 'example.com' as some might presume; regex do exactly what you yell them to.

Kudos for the regex is the answer by 'gilly3' in the above StackOverflow page. As the article context was JavaScript regex, I stripped some unneeded back-slashes from the source answer. URL syntax, once you step away from the basics, can get very complex. Since c.2010 domains don't need to use the Roman alphabet. The above should cope, but bear in mind regex are very precise - sometimes with unenvisaged results.
Back to top
 
« Last Edit: Jul 13th, 2014, 8:55am by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Paul Atlan
Full Member
*
Offline



Posts: 45
Abu Dhabi
Re: Using a regex to get the domain name of a URL
Reply #2 - Jul 14th, 2014, 3:12am
 
Thanks Mark, I'd found some stack overflow answers, but this one is much better.

I need to get my head around something in TB, though:
Based on your reply, I have this as an AgentAction:
   $URL.contains("^https?://([^/:?#]+)(?:[/:?#]|$)");$Subtitle = $1;

To my mind, here's what happens:
a) the regex matching happens, populating the $0 (whole string) and $1(domain) tokens. We discard the boolean result of the .contains() operator by not setting any variable to it.

b) we set the $Subtitle variable to our required precalculated token, and voila.

I'm not sure what you mean by "putting the .contains call in the $DisplayExpression itself" I didn't think that's what I was doing, and if anything, I thought my solution, by avoiding TWO variable settings, would actually be lighter on performance. Am I missing something?
Back to top
 
 
WWW   IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Using a regex to get the domain name of a URL
Reply #3 - Jul 14th, 2014, 3:53am
 
Sorry, I misread your reference to a 'subtitle' as using a display expression. If you're using $Subtitle for output you should be fine as described. A niche issue arises if the target is an evaluated expression (like $DisplayExpression) but don't worry about that until you need to use it.
Back to top
 
 

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Pages: 1
Send Topic Print