Tinderbox User-to-User Forum (for formal tech support please email: info@eastgate.com)
http://www.eastgate.com/Tinderbox/forum//YaBB.cgi
Tinderbox Users >> Tinderbox applications >> Tinderbox for Indexing Work?
http://www.eastgate.com/Tinderbox/forum//YaBB.cgi?num=1349106771

Message started by Marisa Antonaya on Oct 1st, 2012, 11:52am

Title: Tinderbox for Indexing Work?
Post by Marisa Antonaya on Oct 1st, 2012, 11:52am

I've recently started training to become an indexer, and will have to decide on some tools in the near future. While there's dedicated indexing software out there to help indexers with some of the drudgery so they can get down to analyzing the text (Cindex http://indexres.com/soft_feat.php is one such program), as a long-time Tinderbox user I immediately thought about how I could tweak Tb for the job.

Has anyone done this type of work with Tb? If so, I'd love to hear how it worked for you!

Title: Re: Tinderbox for Indexing Work?
Post by Mark Bernstein on Oct 1st, 2012, 2:52pm

One of the world's great indexers, Rosemary Simpson, is a big Tinderbox fan.  She has strong opinions!  I'm not sure that she reads the forum all the time, but I'll call her attention to this thread.

Title: Re: Tinderbox for Indexing Work?
Post by Marisa Antonaya on Oct 1st, 2012, 4:56pm

Thanks, Mark! I have a vague idea of how I can use Tinderbox to help construct an index, but if someone else has already done it I'd be very interested in hearing about it.

Title: Re: Tinderbox for Indexing Work?
Post by Marisa Antonaya on Oct 10th, 2012, 4:21pm

I think I'm making good progress in setting up Tinderbox for creating an index.  :)

So far, I have:

1. A prototype called protoEntry, which in turn has a boolean attribute I created, called Edit. I also wrote a rule so that, when Edit is checked, the red flag badge will appear next to the entry.

2. A separator called Index, which is the main list of entries. It has an "on add" action what assigns the protoEntry prototype to any note I put in it. Notes in this separator are organized alphabetically (using the Name attribute).

3. An agent called "Entries by date of creation," which simply gathers notes with the protoEntry prototype and organizes them according to the Created attribute. This helps me see how I have progressed through a document, in terms of the words/concepts I've chosen to index.

I still have a lot of work to do on this, of course: I need to figure out how to handle cross-references, double-posting, locators, and export formats. But it's a start, and I thought this description might help others doing similar work.

Title: Re: Tinderbox for Indexing Work?
Post by pierfranco on Oct 10th, 2012, 4:57pm

Marisa,

your work sounds very interesting.
I think it would be helpful if you posted a sample file when you think you have completed your prototype.
Pierfranco

Title: Re: Tinderbox for Indexing Work?
Post by Marisa Antonaya on Oct 10th, 2012, 5:01pm

Thank you! I certainly will. I'm at the point in my training where I need to do some small sample indexes, so I'll be tweaking the prototype soon.

Title: Re: Tinderbox for Indexing Work?
Post by Paul Walters on Oct 10th, 2012, 5:44pm

@Marisa Antonaya, I appreciate your description of the work in progress and look forward to more.

I'm curious why you'd put notes in a "separator called Index" rather than a container?  Separators don't appear in some contexts -- maps in particular.

Title: Re: Tinderbox for Indexing Work?
Post by Marisa Antonaya on Oct 10th, 2012, 7:02pm

Hi Paul,

Thanks for your input and interest. I chose a separator for purely visual reasons, since I'm just working in outline view at the moment. In fact, I didn't know separators didn't show up in Map view until you mentioned it; if I do find myself wanting a map view at some point, I might change it to a regular parent note.

Title: Re: Tinderbox for Indexing Work?
Post by Mark Anderson on Oct 11th, 2012, 5:27am

Finding and toggling a lot of separators is actually quite trivial as the $Separator boolean attribute is set to true in any note that is a separator. So, lets's assume you've got to the point of wanting to make a whole load of separators back into normal notes/containers, more than is quicker to do by hand. Let's also assume you don't want to affect all separators, just those in/descended from the root container "Index". Make an agent:

Query: descendedFrom("/Index") & $Separator
Action: $Separator=;

The query finds all (and only) separators in the desired scope within the document. The action then resets the default value for $Separator, which is false - the value for normal notes. Or make a stamp:

Name: Remove Separators
Action: $Separator=;

Thus, separators use isn't an issue to worry about. As a primarily outline-focused user myself, separators in the context make a lot of sense in helping indicate organisation.

Title: Re: Tinderbox for Indexing Work?
Post by Mark Anderson on Oct 11th, 2012, 5:31am

Tools that can help with indexing:
  • Common Words view [Cmd]+[Opt]+[w].
  • Stop word list, for use with common words.
  • Explode the $Text of a note, e.g. a list of common words or manually written index into per-word notes.
I've never done any indexing, but I could see a a possible workflow like this:
  • Import each source page of the document as a separate note within a container
  • Ensure the latter container has no text and has no descendants other than the source page texts
  • Ensure each source page note has the page number stored in the $name or in a user attribute, e.g a number attribute $PageNumber.
  • Select the container holding all the source data and open Common Words, switch it to 'section' scope (i.e. this container and all descendants). Use Cmd+C to copy the words into a new note elsewhere.
  • Review the list and adjust the stop words (see link above) as required.
  • Explode the note on the space character removing the delimiter
  • Add a Set $PageRefs and use this rule on the exploded items (and where /Book is the path of my source page data container):
      $PageRefs = collect_if(find(inside("/Book")),($Text.icontains($Name(that))>0),$PageNumber).unique.nsort

This of course meant I just had to give it a try! (v5.11.2) I didn't look at export but that strikes me as the trivial part - it's mostly a matter of what format you wish to use. I made much use made of prototypes to ease process and saw a few obvious shortcomings. For instance:
  • The recovered common words are all lowercase though using String.capitalize and String.uppercase (for acronyms) on them could help.
  • Process doesn't cope with multi-word phrases though these can be manually added alongside the mechanically generated index words and use the same prototype/rule to recover page references.
  • Page ranges e.g. '72-74' rather than '72, 73, 74' would need to be manually found and edited.
I couldn't use the above rule as an agent action - as first intended - because $Name(that) doesn't work as I intuited**. Still, deploying the code as a rule (via a prototype for ease of tweaking the code) is a perfectly good workaround. I'd tested an agent first as it's easier to turn on off, but an agent could be used to find all the index words and toggle their $RuleDisabled so we can still turn off a lot of action overhead when we don't need it. A current consulting project using (admittedly oversize) TB datasets shows that in very big files turning off rules/actions not needed can help with app performance. Here, once each index word note has found its source page references we don't need to keep running that code. I figure for a real world project the data could be large so it makes sense to take this considerations on board from the outset even though my test only use 3 pages of source data.

Whilst I know nothing about indexing, I'm pretty impressed at where I got to quite quickly with TB. I guess I took about an hour (much less if I discount the $Name(that) issue that had me stumped for a good while). I'd post my test file but it's both too messy and the source text data isn't mine to share. Still, I hope this encourages others. Whilst I'm sure there are specialist indexing tools, it strikes me TB can do a lot of the basics.

[edit][Later] ** That limitation applies as at v5.11.2 but I understand a fix for it is already tested and so subsequent releases should allow agent-based use of this technique.[/edit]

Title: Re: Tinderbox for Indexing Work?
Post by Marisa Antonaya on Oct 11th, 2012, 9:27am

A lot of food for thought there, Mark! Thanks for taking the time to experiment.

In terms of workflow, I'm not really looking at importing documents at this stage. From what I've seen of other indexers, and in my own experience as I practice, it's actually more convenient to have either a print copy of the book, or a PDF that can be highlighted. I prefer a print version myself, so that I can write notes on the margins and highlight as I read.

For page numbers, I'd either just use the text field of the note, or create a string attribute; that way the format would come out the way I wanted.

I'll be doing more practice today, so I'll post further thoughts later on.

And if you're interested in seeing what indexing software actually does, you can download a free trial of Cindex (it will do a maximum of 100 entries), which runs on Mac.

Title: Re: Tinderbox for Indexing Work?
Post by Mark Anderson on Oct 11th, 2012, 10:30am

I'd agree that pure machine indexing won't have the semantic nuance of a human, but reading paper copy doesn't negate doing some heavy lift digitally. My demo above was just an exploration of the possible. Of course, in my earlier method instead of using digital source the per-page notes might just be index terms (perhaps use underscore joins for source phrases - TB can easily automate this). You could then collect all the words/phrases for the notes and use those in the above method instead of a mechanically derived word list. Even then the result only adds to, rather than replaces, manual assignment of index terms.

Title: Re: Tinderbox for Indexing Work?
Post by Rosemary on Oct 11th, 2012, 1:11pm

Hi Marisa,

Sorry to be late jumping into the discussion about using Tinderbox for indexing.  

In the past I've used a Filemaker flat file database to generate and edit entries, exporting the results as tab-separated text to some small programs I wrote for formatting.  Recently, however, I've been experimenting with the use of Tinderbox for index creation - after all I use Tinderbox for everything else in my life.  I find, unsurprisingly, enormous riches in Tinderbox that simply don't exist in the simpler database format.

There's a lot of detail I can go into, but first you might find an article I wrote some years ago about indexing to be useful:
http://www.cs.brown.edu/~rms/IndexingPrinciples.html

Would you describe the parameters of your current indexing job, i.e., size of book, target audience, domain, maximum number of pages for the index, time frame?  That will give me some sense of scope.

Thanks,

Rosemary

Title: Re: Tinderbox for Indexing Work?
Post by Marisa Antonaya on Oct 11th, 2012, 1:50pm

Hi Rosemary,

Thank you for jumping in! I'd actually had a look at your article when Mark Bernstein mentioned your name at the beginning of this thread; lots of things to think about there, especially in the deep nesting section.

My current job is an index I volunteered to do for a regional government agency my husband works for. They recently published a policy document in response to a development plan in the region by the provincial government. The document has no index, and I suggested one because I felt it might make a good resource for researchers interested in the region. The 140-page document discusses the current state of the region, what the provincial plan would mean for its residents, and alternatives to this plan. Nothing too technical, though there are many mentions of previous treaties and agreements.

There's no deadline, since I offered to do it for practice (I'm taking a training course, and have reached the point where I need to do something like this), so I have time to experiment with Tinderbox. I'd checked out Cindex, but decided to use Tb instead.

Tinderbox User-to-User Forum (for formal tech support please email: info@eastgate.com) » Powered by YaBB 2.2.1!
YaBB © 2000-2008. All Rights Reserved.