Tinderbox User-to-User Forum (for formal tech support please email: info@eastgate.com)
http://www.eastgate.com/Tinderbox/forum//YaBB.cgi
Tinderbox Users >> Agent, Actions, Rules & Automation >> Agent/Script for annotating by paragraph
http://www.eastgate.com/Tinderbox/forum//YaBB.cgi?num=1346083940

Message started by Peter100 on Aug 27th, 2012, 12:12pm

Title: Agent/Script for annotating by paragraph
Post by Peter100 on Aug 27th, 2012, 12:12pm

I am looking for a way to automate annotations/notes at the paragraph level. I am new to Tinderbox. I have no programming / scripting skills. I am a PhD student trying to finish my thesis.

Scenario: I have ca 4000 docs/pdfs that I've collected over the past three 3 years. Some of these contain the keys to my thesis. These vary in terms of quality and depth. The ideas expressed in them are not always coherent or flow, but there might be some good bits at paragraph level. I could of course search/tag everything but I would still need to hunt through each one separately to find the good bits.

I am inspired by Tom Webster's video on how he uses Tinderbox for qualitative research http://brandsavant.com/processing-qualitative-research-data-with-tinderbox/ - especially how he takes an interview transcription and "explodes" it at the sentence level and then codes/tags these sentences.

Elsewhere on this forum I was advised that I shouldn't use the "explode" feature for other kinds of documents, especially at the sentence level, and I concur. However the possibilities this kind of automation still haunt me. I wonder if it is advisable to "explode" a doc/ocr'd pdf at the paragraph level instead, and use the first 50 (or so) characters of the first sentence as the note title but create a link back to the original doc so the bits can also be viewed in context? Of course I would not want to perform this on all the docs/pdfs but a big handful.

Does this make any sense? I'm grateful for any and all input from those more skilled than I!

(Note: I have left a similar post on the DevonThink forum.)

Title: Re: Agent/Script for annotating by paragraph
Post by Mark Anderson on Aug 27th, 2012, 12:57pm

Explode at paragraph breaks is the default setting for the Explode feature (also see the Explode dialog). so far so good.

Next, you only want the first 50 or so words as the title. I which case I'd choose to use either one or two sentences as your title and let the $Text be the whole paragaph.

Lastly, you want the note to link to source. Do you mean the source TB note - i.e. the one being exploded - or the document from which the text comes?  If the latter do you already have links to these?

In short, the rough process is:
  • Select the note to Explode
  • Note menu -> Explode.
    • Before you ask, there is not automated way to invoke Explode via either a shortcut or action code.
  • On the Explode dialog, set the desired choices (if not already the defaults).
  • Click 'Explode' button.
  • Use TB action code to set the back-links to source (insufficient info as yet to given a more detailed answer).
So, apart from not being able to automate the 4k+ separate explode actions, all this sounds do-able. More info is needed (questions above) re the last linking phase.

Title: Re: Agent/Script for annotating by paragraph
Post by Peter100 on Aug 27th, 2012, 2:43pm

Thanks for the quick reply.


Quote:
Lastly, you want the note to link to source. Do you mean the source TB note - i.e. the one being exploded - or the document from which the text comes?  If the latter do you already have links to these?


I'm still not clear about the import/export process and how it will merge with my workflow. After some initial searching and collating I suspect I'll be importing from DevonThink. It could be nice to have both linking options: a link to the DevonThink doc and a link to the primary TB "note" (i.e. the explode one) but if only one option is possible then please let me know how to set it up (TB action code?). Does this clarify?

I have another q about working with pdf refs stored in a citation manager like Sente. Do people generally just drag and drop these or is there a more sophisticated way of targeting specific passages within the PDF, short of copying all the OCRed text and working with it as a TB note.

Title: Re: Agent/Script for annotating by paragraph
Post by Sumner Gerard on Aug 27th, 2012, 3:29pm


Quote:
link to the primary TB "note" (i.e. the explode one)


You may find some ideas on how this is quite easily done in this thread. Jean Goodwin's cascade of one-off OnAdd actions is probably simpler than the agent/self-canceling rule approach, though Mark A may have more recent thoughts (thread is a bit old but think it still applies.) I found that setting up the linking was much easier than it sounds.

Title: Re: Agent/Script for annotating by paragraph
Post by Mark Anderson on Aug 28th, 2012, 10:06am

DT, when a single item is selected allows you (Edit menu or right click) to export a DT link which will look like:

x-devonthink-item://F2CA8FC3-FD65-43BE-85F7-3572CE530893

If you add such to a URL attribute in Tinderbox then clicking TB's open link button for it will open the item from within DT (the degree of preview depending on the source doc's format - e.g. PDF, TXT, DOC, etc.). Note that these links only work on the Mac with DT installed and the relevant DT database present.

If you have DTPro (and thus access to AppleScript) you should be able to export your 4000+ filenames and their DT links to the clipboard such that pasting to TB gives you 400 notes each named for the document name and DT local link. You'd think such functionality - exporting a tab-delim list of data would be built-in but DT seems to be a roach motel for data: data checks in but has no way to leave. That said I'm not a deep DT user - perhaps one such can step forward and correct me on this.

Not tested, but I assume you can also - with DTPro or higher - export all your source docs' plain text (or the bits of them you want) to TB.

Let's now jump forward. You have 4,000 TB notes, each with some text data and with $URL set to the DT local link. You can explode each note in turn but as at TB v5.x this is a manual process.  This scenario is a good reason why. Let's assume each document has 30 paragraphs. With all exploded, you'll have c.128,000 notes (30 X 400 + 400 existing notes and 1 x "Exploded Text'' container per explode). TB's OK with that though it won't want to try and show all of those in a single view. Just today, I've made a 440k+ TBX and on my fast 2011 MBPro it runs fine except there's way too much data for intensive agent use. So, I dont think you want to assume you can dump every paragraph from 4k+ articles and start analysing it.

You'll need to chunk the data. I'd explode one or a few documents are at time, throw away the obvious rubbish and save the good bits to a single core TBX.

~~~~~~~~~
Separate issue, linking post install.  Assuming the exploded notes are still in their Exploded Text containers, then their grandparent note's $URL will hold the DT source link and the grandparent note $Text will be the immediate text source.

So, make an agent:

Query:   inside("Exploded Text")
Action:   $URL=$URL(grandparent(original)); linkTo(grandparent(original))

The result? Any explode result note will have a TB link to the not from which it was created and have that not'e same DT back link.

Does that help...?

Title: Re: Agent/Script for annotating by paragraph
Post by David Bertenshaw on Aug 28th, 2012, 12:34pm

I have managed to get information semi-automaticaly out of DTP and into Tinderbox, including DTP tags, DTP-links and the text, but it does involve a bit of hacking. It's some time ago now, so the details are a bit hazy but it went something like this:

  • Hack the script Listing (which is included in DT) to include the text of each selected document, its DTP-link and its tags. (By default it only provides a list of the titles of all documents in the database.)

  • Include distinctive markers at the beginning of each DTP document, and round each DTP-link and its tags, so TBX can work on them later.

  • Run the script on the documents you want to export, and save the result text file, which looked something like this:


    Code:
    @@@ <A>First Document name </A>
    <tg>Tag1;Tag2;Tag3</tg>
    <ln>/x-devonthink-item://EF28548A-C596-461D-BA19-D37A80F077C5</ln>
    This is the text of the first document

    @@@ <A>Second Document name </A>
    <tg>Tag1;Tag2;Tag3</tg>
    <ln>/x-devonthink-item://etc</ln>
    This is the text of the second document


  • Import this document into TBX. Explode it using @@@ as the delimiter.

  • Run agents on the exploded documents to strip out the tags and the dtp-link and put them into attributes within each new note. E.g. in the Agent Query field, use

    Code:
    Text((<tg>)(.+)(</tg>))


    And in the Agent Action field use

    Code:
    Tags = $2

    (This procedure is described in the TBX menu.)

  • Eventually you end up with a single note per DTP document, including the content in the note body and the tags and link added automatically into the relevant attributes.


As I said it's a bit clunky, and it only produces plain text, but it seemed to work OK. Unfortunately, I've lost the script which I used to do it, otherwise I'd post it for real Applescript coders to laugh at.

Title: Re: Agent/Script for annotating by paragraph
Post by Sumner Gerard on Aug 28th, 2012, 1:29pm


Quote:
DT seems to be a roach motel for data: data checks in but has no way to leave

Don't know about DT, but DTPro is more like The Eagles' Hotel California than a roach motel. "You can check out any time you like but you can never leave." Lots of ways to check your data out while still keeping your room. One easy way (no AppleScripting needed) to get a list of titles (and text) into TB is to select the items you want in DTPro, choose 'File/Export/as Outliner Processor Markup Language' and save. Open the exported OPML file in TB. That's it.

The URLs brought into the individual TB notes (which will automatically display URL in Key Attributes) are the external URLs of the original sources, not the DT local link. If you have notes you've written yourself in DTPro or items in DTPro for which you haven't captured an external URL (there usually aren't many of these, as DTPro is very good at bringing in URLs automatically when you save things there from the web) you can first populate the URL field in DTPro by selecting each item, choosing 'Edit/Copy Item Link' and pasting that into the URL field in DTPro.  That way the DT local link will then be brought into TB.  You can then just click it in TB to open up the item in DTPro.

I don't speak AppleScript but there's no doubt a way to export the local DT link in the OPML file if the above involves too much manual populating of the URL field in DTPro.

Title: Re: Agent/Script for annotating by paragraph
Post by Mark Anderson on Aug 28th, 2012, 6:31pm

Yeah. Didn't mean to be harsh about DT - amazing app.  I had some success with this:


Code:
set dataString to "Name\tURL\n"

tell application "DEVONthink Pro"
  set itemList to selection of front window
  tell front window
     repeat with anItem in itemList
        set itemName to ""
        set itemLink to ""
        set itemName to name of anItem
        set itemLink to reference URL of anItem
        set dataString to dataString & itemName & "\t" & itemLink & "\n"
       
     end repeat
  end tell
 
end tell
tell application "Finder"
  set the clipboard to dataString as Unicode text
end tell


The DT forum improved it (I've note tested this:


Code:
set dataString to "Name\tURL\n"

tell application "DEVONthink Pro"
  set itemList to selection of front window
  tell front window
     repeat with anItem in itemList
        set itemName to ""
        set itemLink to ""
        set itemName to name of anItem
        set itemLink to reference URL of anItem
        set dataString to dataString & itemName & "\t" & itemLink & "\n"
       
     end repeat
  end tell
 
end tell
tell application "Finder"
  set the clipboard to dataString as Unicode text
end tell


Title: Re: Agent/Script for annotating by paragraph
Post by Peter100 on Aug 29th, 2012, 12:33am

Thanks to all who are jumping in here but I mist admit you've dusted me. I think I'm still back at the hotel in California.

Could someone please recap? I pretty much got lost after...


Quote:
Select the note to Explode
Note menu -> Explode.
Before you ask, there is not automated way to invoke Explode via either a shortcut or action code.
On the Explode dialog, set the desired choices (if not already the defaults).
Click 'Explode' button.
Use TB action code to set the back-links to source (insufficient info as yet to given a more detailed answer).


I might as well throw in my own curve ball: I suppose the other alternative is to do the breaking up (exploding) of the documents/pdfs in DT and then import the best bits to TB. I believe this is possible. I received some feedback here: http://forum.devontechnologies.com/viewtopic.php?f=20&t=15865&p=73527#p73514. This could make use of DT's annotation template to see Humpty Dumpty in one piece - at least in DT - but I'm a newbie with that one too. Perhaps those of you using both apps have the solution?

Cheers!

Title: Re: Agent/Script for annotating by paragraph
Post by Mark Anderson on Aug 29th, 2012, 7:06am

Are you saying you don't understand how Explode works? What happened when you tried. Meanwhile I've made a short tutorial on using Explode- see this thread.

A missing part of the analysis is whether some aspects are practical. In the DT forum you noted this:


Quote:
Let me give an example: Say I have a PDF with 300 paragraphs. 30 of these are mildly interesting, 20 very interesting and 10 are outstanding. The rest I don't think are relevant.


As you go on to point out there's a lot of potential wastage, in terms of making unneeded extra assets. Exploding 4000 300-paragraph notes would generate c.1.2 million notes for you to review. Given the above quote you don't even want most of those. A en masse import/split process will be wasteful and an overload. Therefore I'd consider trialling, using some of your content you know well, either/both of these two methods to compare how well they fit your needs:
  • Create the desired paragraphs in DT and export them with DT back-links**. The TB end is to generate a set of notes that are text paragraphs linked back to source in DT.
  • Use custom OPML export to export whole source doc texts with their DT back-link. You would then explode these manually and as soon as possible delete any obviously unwanted paragraphs. An agent can link these notes both to their TB source note and copy the latter's DT back-link the paragraph notes. If you save the $SiblingOrder of the exploded paragraphs to a custom attribute before weeding, you'll have the source paragraph number in the note.
Once you find a method that's a good fit we can help look at doing things in more volume.

** A cool feature of these I just discovered is that for PDFs you can even add a page number parameter so if the reference is on page #14 of the target doc, the link will open it in DT scrolled to the right page.  nice touch, though note that the user needs to add this extra parameter to the default link.


Title: Re: Agent/Script for annotating by paragraph
Post by Mark Bernstein on Aug 29th, 2012, 9:38am

Stepping back from the mechanics, let's think a bit about how we want to use these notes, and what that suggests for how we want to divide the texts and manipulate them.  Let's take two examples from two fields.

Suppose we are studying medical care in the Tudor era by exploring the account books William Cecil/Lord Burghley during the months of his last illness.  We have a list of expenditures with some annotation; so much to an apothecary, so much to a grocer, so much to an upholsterer At the outset, it's mostly a jumble. But every line once made sense: everything that was bought was bought for a reason. So, we want to keep everything, and maintain sequence and metadata for everything. But we also want to break things down by individual transaction, explore repeated transactions with the same vendors, or for the same things, or for things that turn out to be related.  Explode is our friend here, and we're bound to use maps (for informal clustering) and agents (for more formal groups) as our analysis proceeds.

Alternatively, suppose we have been reading everything we can find on the policy intentions of Nero, starting with Gibbon and proceeding through to the most recent studies. Our interest is not so much in history -- what happened -- as in historiography -- the ways in which "Nero" has been used by political and intellectual movements in the recent past.  Here, we've got thousands of pages of reading.  But much of it is not very much to the point. Our need is not to marshall all the available evidence; rather, we need insight and we need telling examples, chosen from a great array of evidence.  Here, we don't really need or want to Explode; instead, we're probably better off copying specific passages that seem useful and adding commentary or metadata.


Title: Re: Agent/Script for annotating by paragraph
Post by Mark Anderson on Aug 29th, 2012, 10:50am

Building on Mark B's comments. Recalling your 4000 items as being mix or your writing and research, it strikes me you're more likely to want to do a paragraph tear down of your own work whilst using TB's Find (and Find Next) or Agents to do textual analysis of full-text research articles. Thus it is likely you'll want to consider at least 2 primary discrete collections of material coming across from DT: your writing and the research.

Title: Re: Agent/Script for annotating by paragraph
Post by Sumner Gerard on Aug 29th, 2012, 1:24pm

Back on the mechanics, for those (RTF, without too much fancy formatting or extra line returns) items currently in DT that need to be exploded into paragraphs and explored in TB, I've come across a script devised by Korm and Christian Grunenberg and Charles Turner linked to at the bottom of this post that takes selected items from DT and exports them to an OPML file that opens in TB already exploded by paragraph and including the DT back-link and original URL (plus tags and comments, if any). To activate the links in TB change the attribute type for imported user attributes 'DTurl' and 'OriginalURL' to 'url'.

At the risk of overcrowding this particular room in Hotel California as deadlines loom, would love to learn more (either here or in another thread) about:


Quote:
 A cool feature of these I just discovered is that for PDFs you can even add a page number parameter so if the reference is on page #14 of the target doc, the link will open it in DT scrolled to the right page.
 

Title: Re: Agent/Script for annotating by paragraph
Post by Mark Anderson on Aug 29th, 2012, 3:32pm

@Sumner, see this thread re DT syntax for inbound URLs.

Title: Re: Agent/Script for annotating by paragraph
Post by Peter100 on Aug 29th, 2012, 3:59pm

Super thread.. why stop now?

I'm learning from the sidelines ... cheering and experimenting

New theme (thread) song?

http://www.youtube.com/watch?v=KR9Hi4wjC3Y

Explode or implode
We will take care of it...

Title: Re: Agent/Script for annotating by paragraph
Post by Peter100 on Sep 5th, 2012, 10:46am


Quote:
it strikes me you're more likely to want to do a paragraph tear down of your own work whilst using TB's Find (and Find Next) or Agents to do textual analysis of full-text research articles. Thus it is likely you'll want to consider at least 2 primary discrete collections of material coming across from DT: your writing and the research.

Performing an explode on my pdf collection, even at the page level, is BAD idea. I realize that now. I don't know what I was thinking? I suppose I was feeling like a kid in a candy store. The suggestion to focus on my own texts, combined with focused keyword/tag searches in the pdf articles, makes much more sense.

So here is my next quest: automated pdf search/annotation... is there a way to get Tinderbox (or DevonThink) to automatically create highlighted annotations/notes in the pdfs? I would like to create an agent (is that the correct TB term?) to find, for example, all the paragraphs that match a given search criteria (e.g. 5-10 search terms) and then have that paragraph automatically highlighted/annotated with a note that indicates/lists the key terms. This way the same paragraph could get different notes. Perhaps something like this is possible in Scrivener or another app? I could then pull all the annotations together that match into a smart group for each search string. Is this over zealous?

Title: Re: Agent/Script for annotating by paragraph
Post by Mark Anderson on Sep 5th, 2012, 11:21am


Quote:
I would like to create an agent (is that the correct TB term?) to find, for example, all the paragraphs that match a given search criteria (e.g. 5-10 search terms) and then have that paragraph automatically highlighted/annotated with a note that indicates/lists the key terms.

An agent can indeed find the items though using c.10 terms in the query, in a large corpus of notes, might be slow. Action code cannot highlight text, make text links, footnotes or new notes. The latter is deliberate to avoid ill-considered actions trying to generate millions of notes (whereupon at some point TB would get overloaded).

TB queries can't search on rich text features, e.g. bold or highlighted text, so bear that in mind also. TB's Find view - albeit with more restricted query potential - will underline all matching string if the note's window is opened from the find view list (see more).

DEVONThink's 'Pro' and higher versions [sic] have AppleScript support so might be able to highlight text (assuming PDFs aren't un-OCR-ed scans) of matching terms. You'd do better to follow that angle up in the DEVONThink support forums as I suspect you'll need to talk to those expert in the scripting side of DT.

Recalling the point up thread about generating millions of notes, I do wonder if in your quest for a single 'does-everything' feature that you'll generate more data than you'll be inclined to review once done. If a process will create more data than needed, that is a good reason to review one's strategy, even if one then continues; at least that way there are no unpleasant surprises. Thought: perhaps this scaling issue is one possible reason that there aren't lots of previous examples of the workflow being discussed?


Title: Re: Agent/Script for annotating by paragraph
Post by Mark Anderson on Sep 5th, 2012, 11:59am

This post, in another thread here, re DEVONThink might help with your highlighting issue.

Title: Re: Agent/Script for annotating by paragraph
Post by Peter100 on Sep 5th, 2012, 12:08pm

Thanks! I'l have a look!


Quote:
that is a good reason to review one's strategy, even if one then continues; at least that way there are no unpleasant surprises. Thought: perhaps this scaling issue is one possible reason that there aren't lots of previous examples of the workflow being discussed?


Hmm. I need to visualize and understand how I can use an app like TB before I fully embrace it. It's like flying a plane. I would never just hop in and see how it goes! I would be good and ready with the simulator first. I'm curious about TB and trying to develop a mental model of what it can do for me and how I might use it with other apps like DevonThink. In other words, I need to understand the limits before I can judge the appropriate level/scale at which TB works best. This is what drives my more 'hypothetical" questions. I certainly appreciate all the generous feedback!

I am not necessarily interested in "generating millions of notes" (only money ha ha) if there is no way for TB to serve these up in a meaningful way, for example sorting out a few dozen "greatest hits" based on their relevance score. I suppose this is where DT might come in (at least with a couple of thousand). So I am reflecting on how I might turn a million, or probably a few thousand (if it's only my own work) into piles of a few dozen per search string. The suggestion of an initial filtered search seems the most obvious (hopefully scripted). If I did this over on the DT side I could then copy the notes/chunks into TB and then fine-tune their relations/outline.

I suppose this is the workflow you have been suggesting all along. ;) I'll hop over to DT now and see what I find there...

Title: Re: Agent/Script for annotating by paragraph
Post by Mark Anderson on Sep 5th, 2012, 12:59pm


Quote:
I need to visualize and understand how I can use an app like TB before I fully embrace it.

I do understand, but short of doing your PhD analysis or showing someone else's full workings it's rather hard. Instead, we're left with hypothesised questions that are hard to answer. Whilst my earlier answer listed some things TB can't do, in the context of the hypothesised workflow, the app is remarkably capable of text analysis though I think it's design premise starts from a different place.  So far we're trying to develop an automated workflow that finds and creates a n annotation for every instance of the target term. So, you'll have lots of actual note items/annotation/bits-of-data created in TB, DT, etc., that likely you don't need and simply clog up analysis and add to review time. At the same time, you need to allow for word stemming , homonyms, mispellings, indirect references, etc., which this process will get wrong either by annotating incorrect matches or missing correct ones.

A technique I've seen used successfully in a number of different contexts in TB is 'tagging.  Indeed, that's essentially the heart of Tom Webster's process, where this thread started. Tom was exploding data pre-review because his data suited that and was likely written (laid out) with such later use in mind. However, you don't have to Explode everything.

For distinct terms, Agents can rapidly search a whole Tinderbox 00,000s of notes and hold a reference to each one (the aliases 'in' the agent when looking at the UI). The agent's action can them, well, do all sorts of things, including adding a specific terms to an attribute.  As a note's content might refer to more than one topic of interest, use a Set attribute (essentially a de-duped list that will only hold one instance of any value added). Assume you have items of interest to your study A, B and C. You set up your notes (or use prototypes) so your set is shown in the note's Key Attributes table. Now as you read the long form text you can type the 'tag' values into your set  and even use auto-completion of terms. See a word/phrase/pasage warranting a deliberate footnote - select it and use one of the footnote option to make a new footnote (annotation) to the TB note. The footnote is linked by a defined type of link which can be queried by agents too.

Agents don't have to be permanent.  If you don't need them other than to find and review a particular set of notes, that's fine - delete the agent; it and its aliases leave but the notes it matched are intact and still retain the changes (if any) made by the agent.

If you can offer up some specimen data, it would be easy to illustrate this in more concrete terms. In a hypothetical context and with some many unfixed assumptions it's hard to give more detail. Time spent testing workflow is not, as often assumed, time wasted. Rather, it wastes less time downstream and leads to a generally better process as it forces us to see some of the edge cases to our reasoning before they strike at a less opportune moment.

Anyway, we can chip away at this - eventually you'll run out of reasons to not get started.  ;)

[Later] To help you experiment , I've just added an aTbRef article on pre-populating the lists for key attribute values.

Tinderbox User-to-User Forum (for formal tech support please email: info@eastgate.com) » Powered by YaBB 2.2.1!
YaBB © 2000-2008. All Rights Reserved.