Welcome, Guest. Please Login
Tinderbox
  News:
IMPORTANT MESSAGE! This forum has now been replaced by a new forum at http://forum.eastgate.com and no further posting or member registration is allowed. The forum is still accessible via read-only access for reference purposes. If you wish to discuss content here, please use the new forum. N.B. - posting in the new forum requires a fresh registration in the new forum (sorry - member data can't be ported).
  HomeHelpSearchLogin  
 
Pages: 1
Send Topic Print
Finding links between AutoFetched HTML pages (Read 1044 times)
Pat Maddox
Full Member
*
Offline



Posts: 66

Finding links between AutoFetched HTML pages
Feb 4th, 2016, 5:23am
 
I want to use Tinderbox to map out the structure of a website. Using AutoFetch, I can get the HTML content. So I'm wondering if Tinderbox can automatically find links for me.

Let's say that http://example.org/page1.html links to http://example.org/page1.html with a <a href='http://example.org/page2.html'>go to page 2</a>

I create two notes, setting the URLs to http://example.org/page1.html and http://example.org/page2.html respectively. Then I set AutoFetch=true to fetch the content.

Now what I would absolutely LOVE is for Tinderbox to somehow tell me that the note for page1 links to the note for page2.

I think it would need to do something like:

* scan the note's Text for HTML links
* look for notes in the Tinderbox document that have a URL matching one of the HTML links

Is it possible? Please say yes... Smiley
Back to top
 
« Last Edit: Feb 04th, 2016, 5:23am by Pat Maddox »  
  IP Logged
Mark Bernstein
YaBB Administrator
*
Offline

designer of
Tinderbox

Posts: 2871
Eastgate Systems, Inc.
Re: Finding links between AutoFetched HTML pages
Reply #1 - Feb 4th, 2016, 9:56am
 
If the links were fully-qualified and canonical URLs, I suppose you could use regex to locate the URLs in the text, and then search for the back-references.

In practice, relative URLs, redirects, and other non-canonical forms are going to be a significant problem. There used to be a lot of interest in automatically-constructed web site maps, back in the day, and I imagine there's decent spidering software you can get off the shelf.  If so, that reduces your problem to translating the spiderís output format to Tinderbox; I bet thatís not too hard.

But much depends on how badly you need this!
Back to top
 
 
WWW   IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Finding links between AutoFetched HTML pages
Reply #2 - Feb 4th, 2016, 10:42am
 
Amen. Besides the issue @MB states, you also need to consider if the site is all static pages or if some/all are created from query (e.g. a Wordpress blog page). It's probably worth clarifying what eventual info you need. It is just a set of notes in an outline reflecting the site and with an internet like for each inter-page link? Links to external pages? Is page content required?

I wonder of the spidering tool mention in this article might help. The spidering tool mentioned there is free for up to 500 URIs (pages) per session and the data can be exported to CSV or Excel meaning it likely can be pulled into TB for your actual analysis (I've taken a quick look at the site but not used the tool). Anyway, that or a similar tool and them importing data to TB should save you a lot of time and regex head-scratching.
Back to top
 
 

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Pat Maddox
Full Member
*
Offline



Posts: 66

Re: Finding links between AutoFetched HTML pages
Reply #3 - Feb 4th, 2016, 12:12pm
 
Okay, thanks. I was hoping Tinderbox would be able to figure out its own links from the HTML (something like outboundWebLinks, which I think is just for export and applies to links created within Tinderbox).

It sounds like I'd need to do a custom AutoFetchCommand to parse the HTML Ė something like nokogiri would give me way more control than regular expressions.

To answer Mark A's questions...

I'm analyzing my own static site (not built with TB). I want to get a better idea of how it's structured right now, and then I'll make changes and additions.

And after messing with Tinderbox for generating a simple site, I'm seriously considering moving my site into Tinderbox... but it seems daunting for right now.
Back to top
 
 
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Finding links between AutoFetched HTML pages
Reply #4 - Feb 4th, 2016, 12:58pm
 
^outboundWebLinks^ will list the external links in a note's $Text, i.e. notes whose target lies outside the Tinderbox document; the internal count of these is $WebLinkCount. In the context of your project, they would represent links outside the site, whereas intra-site links are represented by OutboundLinkCount.

The Mac install instructions for nokogiri place it a bit beyond my comfort zone, but I can see how that might help if passing out $Text to the command line via runCommand. I think it fair to note two things here re analysing auto-fetched data. Firstly, runCommand wasn't originally envisaged for always-on heavy lift, but using it to run across a small set of note's content shouldn't be problematic. Secondly, AutoFetch is one of the older less-used features (judging by how often it's discussed) thus it's not much documented and you may encounter behaviour you didn't expect. (Aside: I do use it OK in a very light way in the aTbRef version checker).

One last thought from a different perspective. I note your site is static and thus likely doesn't have issues of complex paths or duplicate page names (in different parts of the site), etc. †If you can use some tool to list the (unique) target page (file)names or titles and save those per page, then you should be able to import such a table to TB and use linkTo() to create you TB links. Tip: if doing this use rules, or better still, edicts rather than an agent to do this to avoid linking aliases rather than originals.
Back to top
 
« Last Edit: Feb 4th, 2016, 1:02pm by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Pat Maddox
Full Member
*
Offline



Posts: 66

Re: Finding links between AutoFetched HTML pages
Reply #5 - Feb 4th, 2016, 1:47pm
 
Good ideas there... thanks Mark!
Back to top
 
 
  IP Logged
Pages: 1
Send Topic Print