Tinderbox User-to-User Forum (for formal tech support please email: info@eastgate.com)
Tinderbox Users >> Agent, Actions, Rules & Automation >> Importing Day One data

Message started by Derek Van Ittersum on Oct 30th, 2012, 2:53pm

Title: Importing Day One data
Post by Derek Van Ittersum on Oct 30th, 2012, 2:53pm

I've recently begun using the Day One app as a journal. It exports plain text easily enough, and I've been playing with getting it into Tinderbox. before I dive too deeply in, I'm hoping to find some best practices for tagging and such to make it easy for subsequent Tinderbox analysis.

Here's an example of Day One export:

     Date:      October 15, 2012 2:29 PM


Testing the tagging and such.

     Date:      October 23, 2012 9:34 PM


# A Markdown heading

Day One uses markdown for formatting

     Date:      October 28, 2012 2:30 PM


More testing of tagging

The date line is indented four spaces. The rest of the posts come through as typed. There is no tagging in Day One as of yet, I'm just typing my own tags manually at the top of a post.

I've gotten some luck with explode by using BBEdit to to some light editing of the export file. First, I added a "$$" before every date line and used that as a delimiter to break notes by. I also removed the "date:" text from the file. Then, I added a period to the "PM" or "AM" and selected the title as first line and to remove the  title from the text. That got me individual notes with titles like: "October 15, 2012 2:29 PM." and $Text of the remaining entry.

I tried to create an agent that would copy the $Name to $StartDate, but that ended up with all the notes renamed to "never" (action in the agent was: $Name=$StartDate).

I'd also like to try and pull out the @tag and create a user attribute for that data. I'm only planning to add one tag per post, so it can easily be a string, rather than set, attribute. I've no idea really of how to go about doing this.

Any other people using Day One? In conjunction with Tinderbox?

Any suggestions on working with explode (I'd like to learn more about this for other applications so advice on best practices would be appreciated)? Or pulling out data from the Text and Name?  


Title: Re: Importing Day One data
Post by Mark Anderson on Oct 30th, 2012, 3:07pm

$Name=$StartDate is setting the title to the start date. I think you meant $StartDate=$Name, in which case you're assuming TB will guess the correct date from your $Name string. I'm just heading out so can't test but try:

$StartDate = date($Name")

Where $Name is a proxy string for date(string).

As to agents, earlier today I posted an example in the Actions/Rules forums about using agents to harvest data. Must dash...

Title: Re: Importing Day One data
Post by Derek Van Ittersum on Oct 30th, 2012, 3:30pm

It never fails. Every time I post a question here, at least part of it is just a dumb error on my part. Will review your latest post in the Action forum to see what I can pull from it.

Title: Re: Importing Day One data
Post by Derek Van Ittersum on Oct 30th, 2012, 4:00pm

Ah, Mark you were right. Your explanation of the regex was enough for me to pull out the bit I needed.

So, for others, here's what I've done:

1. Created "date maker" agent with this action:
$StartDate = $Name

This puts the date in there perfect. I found that $StartDate = date("$Name") was altering the date for some reasons.

2. Created "tag maker" agent with this query:

and this action:
$Tags = $1

This takes the "@tag" value from the entry and puts "tag" in the $Tag attribute.

After reading through the post a couple more times, I'm still not sure how to get the agent to then delete the "@tag" text and double line break from the entry, so that the $Text begins with the first line of the entry (in the above example data, I'd like $Text to begin with "Testing the tagging and such" with no spaces or line breaks before it.  I don't quite understand how to use the $1 $2 $3 values. Will review atbref shortly and post back if I have success ....

Title: Re: Importing Day One data
Post by Mark Anderson on Oct 30th, 2012, 8:23pm

A quick stab at this on my way to bed. I used your data, and then exploded it as described, then used an agent with the query:

inside(Exploded Text) & $Text.contains("(\A.*)\n{2}@([^\n]*)\n{2}(.*\z)")

And action:

$Tags = $2; $Text = $1+"\n"+$3

Want multiple tags per note? Separate each with a semi-colon. I changed your first note's tags to "@tag1;tag 4" and $Tags turned up as "tag1;tag 4". So, it works and spaces in tag names survive.

Another route is avoid back-references for the $Tags and use $Text.replace(). Her's a quick stab (with minimal testing. Use the query:


…and then this action:

$Tags = $1; $Text = $Text.replace("\n@[^\n]*\n\n","");

Tip. If manipulating $Text like this be very careful to test first and keep back-ups.  also, having tested your code, stick rigorously to the layouts tested. Regex can be very picky so feeding them differently structured data can have un expected results.

[edit]Admin's note. Moved to the Actions & Rules section[/edit]

Title: Re: Importing Day One data
Post by Mark Anderson on Oct 31st, 2012, 5:48am

It was a late finish last night and I left out an explanation of the replace regex: $Text.replace("\n@[^\n]*\n\n","");

The overal replace() is to replace instances of our regex pattern in $Text with nothing, thus removing it. However, if you remove just the tag data you end up with 3 blank lines - the one the data was on and those before/after it.

Side note: the line-spacing applies to @Derek's data, not all data. As said in my last post this aspect of work needs to tie in closely with your style of data layout. The more flexible you want your regex to be in detecting only the right stuff, the harder & more complex your regex will become (if even possible). This is a place where some self-imposed formatting constraints can make the data transform task much easier. Plus, test and test again before committing code to your live work if it actually removes data as with replace().

Back to the replace. The original match was "@(.*)". This works becasue in this instance the dot  '.' match-all metacharacter doesn't match the line-end ('\n\') character although it does match other non-printing characters. Furthermore, the regex parser (under the hood here) can be set to make dot match an \n. As users we can't see that modifier, if used, and ffrom experimentation I think it's set in some parts of TB but not others (I'd be very happy to have a more authoritative view on that). As a result, it may not be same to assume .* will match just to the end of the current line (paragraph).

There's also an unstated assumption here that tags are on a new line with @ as the first character. Inattention in the source doc and putting a space before the @ would throw out my test replace as written.

So what does "\n@[^\n]*\n\n" match? It looks for:
- a line return
- a single '@' character
- zero or more successive characters that are not a line break
- two successive line returns

Let's address some shortcomings. Let's assume we don't know how many blank lines precede/follow the tag line but do know it may vary. We also want to allow for data entry error causing spaces before our @ tag marker. This we get a replace search pattern:

"\n+ *@[^\n]*\n+"

It looks for:
- one or more successive line returns
- zero or more successive space characters
- a single '@' character
- zero or more successive characters that are not a line break
- one or more successive line returns

You then need to adjust the replace to put back the desired number of line breaks.  For €Derek's line-spaced data, that is two. So (tested in v5.11.2):

$Text = $Text.replace("\n+ *@[^\n]*\n+","\n\n")

We could tidy a little more. The source is letter-spaced as it comes from a plain text environment. But TB $Text allows word-processor-like paragraph spacing so we don't necessarily need a blank line as a paragraph marker. The last search regex has all the lessons we need; we now chin on a second replace to replace all sequences of more than one line break with a single line break:

$Text = $Text.replace("\n+ *@[^\n]*\n+","\n\n").replace("\n{2,}","\n")

The '\n{2,}' bit means match a sequence of two or more of the precending character, in this case a line break. {2} would match a sequence of exactly two. {2,6} would match a sequence of 2-6 inclusive. {0,4} would match zero-4 inclusive, etc.

We can simplify the last replace by moving the \n handling from the first replacement in to the second (tested in v5.11.2):

$Text = $Text.replace(" *@[^\n]*","").replace("\n{2,}","\n")

To keep letter-spaced lines but consistently with one blank lines, you'd make the last replace string "\n\n".

Some more to do, but I'll start a new post.

Title: Re: Importing Day One data
Post by Mark Anderson on Oct 31st, 2012, 6:58am

In the case of @Derek's date (remember your data may differ in structure), I think we can skip the BBEdit step.

An aside: this pre-cleaning in a text editor is a technique I use a lot - doing some cleaning before import can help a lot. Don't have BBEdit or a similar tool? Try TextWrangler, it's free sibling app; it has all the features you might want for this sort of task.

Back, on track, let's try the import without the BBEdit pre-edit. First we need to set up the imported data with our Explode markers (a "$$". You could use a rule or agent, but you only need do this the once and this is a good candidate for a stamp.  So, I made a stamp called "Clean Imported Text" with this action:

$Text = $Text.replace("\tDate:\t","\$\$")

This replaces a tab+"Date:"+tab sequence with "$$". Note in the replacement string you need to use '\$' as you want to insert a listers Dollar-sign. In this context '$' is a regex special character and not what we want.

Now, explode the note, on custom delimiter ($$ - no \ needed), opting to delete the delimiter and to use first paragraph (i.e. line) as the title.  Tip: using 'First sentence' doesn't work as expected if the first paragraph is one sentence and/or doesn't end in a full stop, question mark or exclamation mark.

The data's exploded correctly. We now need to set $StartDate and add the full stop at the end of the title. for good measure we'll set a prototype called "pMyTask" which we've configured to show $StartDate and $Tags as key attributes. To our existing agent's action code (previous post) we now add:

$StartDate=date($Name);$Name=$Name+"."; $Prototype="pMyTask";

We do the title-> date before adding the full stop (that's why @Derek had a problem using date($Name) before. Then we tidy $Name with a full stop at the end. Lastly we set the prototype. Above code tested in v5.11.2.

Important side note. Because we strip the tag data from $Text, the action requlst in the note not longer metting the agent query.  It is thus safe to write $Name=$Name+"." in the action.  If the note were to still meet the query, the title would get an extra full stop every agent cycle - not what you want.

To recap the agent. Query:

inside(Exploded Text) & $Text.contains("@([^\n]*)")


$Tags = $1; $Text = $Text.replace(" *@[^\n]*","").replace("\n{2,}","\n");$StartDate=date($Name);$Name=$Name+"."; $Prototype="pMyTask";

I guess you could add a terminating full stop in $Text as well. In which case the action is:

$Tags = $1; $Text = $Text.replace(" *@[^\n]*","").replace("\n{2,}","\n");$StartDate=date($Name);$Name=$Name+".";$Text = $Text.replace("(AM|PM)\n","$1.\n");$Prototype="pMyTask";

Notice how the extra replace uses a back-reference as with the agent query. It is the only other place TB allows this trick, noting that in replace() the back-reference is to a match in the search string 9the first string in the parentheses). Note also how the $1 back-reference is inserted inside the quoted replace string ("$1.\n").

Another side note. Had you imported/exploded the data leaving the "Date" part still in the source text, you could clean the title with:

$Name = $Name.replace("( |\t*Date:\t","");

The above finds: any number of spaces or tabs+"Date:"+any number of tabs. It then replaces the matched string with nothing, i.e. deletes it.

In closing I'd repeat my previous point that the code here depends on a number of assumptions about the layout of the source data; in this case being from the Day One app. You can do similar to this with data from other sources but almost certainly will need to adjust some regular expressions in the code to match the source.

Title: Re: Importing Day One data
Post by Derek Van Ittersum on Oct 31st, 2012, 8:00am

Wow, this is incredibly comprehensive and useful. No time today to review in full, but will certainly be exploring it in detail soon.

Mark--your explanations of the regex are really helpful. I'm finding it much less intimidating now and even beginning to understand it a bit. It will still be awhile before I'm writing my own, I think, but I've got enough down now I think to liberally grab what others have done and remix it somewhat.

Title: Re: Importing Day One data
Post by Sumner Gerard on Dec 19th, 2012, 11:43am

Really appreciate the valuable detail in this thread. I put the ideas in stamps. I like stamps because I can more easily test them on selected notes first and I only want the code to run once anyway. My import from Day One went well doing the following (I had no tags to extract):

Step 1. Paste the Day One exported text into a note.

Step 2. Prep the Explode
    Stamp action: $Text = $Text.replace("\tDate:\t","\$\$")

Step 3. Explode the note (Note menu)
    opting to delete delimiter $$, set title to first paragraph

Step 4. Set StartDate
    Stamp action: $StartDate=date($Name)

Step 5. Remove leading 2 lines, no longer wanted
    Stamp action: $Text = runCommand("tail -n+3",$Text)

Step 6. Set $Name to first paragraph
    Stamp action: $Text.contains("(^[^\n]*)");$Name=$1;
    Stamp action: $Name=$Text.paragraphs(1).replace("\n+$","")

Step 7. Truncate $Name to first 10 words (where needed)
    Stamp action: $Name = runCommand("cut -d' ' -f1-10",$Name).replace("\n+$","")


-- There *seems* to be a missing closing parentheses in the last expression in Reply #6 above.

-- aTbRef seems to indicate here that regex backreferences for substring extraction or replacement can only be used in the context of an agent query. But they work well in my stamps too, so perhaps a little more elaboration there would help.


-- Are there more Tinderbox "native" ways to do 5 and 7?

-- Is the regex in 6 the way to grab the first line, or was I just lucky? What I believe it is doing is anchoring at the start of the line, then matching characters that aren't linefeeds until it reaches a linefeed character, which it conveniently does not include in the match, with the resulting match then referenced with $1. What is (I think) an equivalent paragraphs() approach leaves a trailing linefeed that has to be removed with .replace ("\n+$",""), which looks for one or more linefeeds (not sure whether really need that $ end-of-line anchor).

Title: Re: Importing Day One data
Post by Mark Anderson on Dec 19th, 2012, 12:39pm

@Sumner. I've not checked this (busy elsewhere) but it looks lie in replay #6 I meant to write:

$Name = $Name.replace(" |\t*Date:\t","");

IOW the second '(' inside the quotes was a typo. If not please give me a summary in the narrowest contest so I can check just that part.

Backrefs in stamps? If they work, that's great but it's not documented and so I can't be sure. I simply can't see that far under the hood! Perhaps Mark B can comment?

Step #5. You want to remove the the two leading blank lines, if present, in $Text?

$Text = $Text.replace("^\n\n","")

In a replace regex, the caret (^) means the start of the string.

Step #7. Same as above - just use a regex starting with ^ and then regex code to define 10 words. Watch for edge cases if your text inlcude things are aren't obvious 'words' - code, acronyms, etc.etc. With the dot operators, .replace() has got pretty powerful.

STep #6. Skimming - yes I think your regex analysis is correct.  paragraphs() includes the trailing line break as in that context you.d normally want it. As so often there's no one right way.

Title: Re: Importing Day One data
Post by Sumner Gerard on Dec 20th, 2012, 11:42am

**meant to write: $Name = $Name.replace(" |\t*Date:\t","");**

That seems right.  I was thinking there needed to be a set of inner parentheses for a back reference, but that's not the case here.

**Backrefs in stamps? If they work, that's great but it's not documented**

Backrefs seem to work in the .contains() and .replace() dot operators, whether the operators are in an agent or in the action code of a stamp. I was trying to find where use of backrefs is documented and found them described (without using the term back reference) in Tinderbox Help/Appendix 2. The example under "Parentheses" features an agent query, but doesn't suggest backrefs can only be used in agents. Are there other places they are documented (other than aTbRef)?

**You want to remove the the two leading blank lines, if present, in $Text?**

Actually the two unwanted leading lines are not both blank. There's always one line with a date followed by a blank line, as in Derek's example. I couldn't figure out the regex to match that so it could be replaced by null.

**Step #7. Same as above - just use a regex starting with ^ and then regex code to define 10 words.**

So I gather a more "native" Tinderbox approach might be to use replace() with regex in the search string and a backreference in the replace string, as in Reply #6. I couldn't figure out the regex to grab the first 10 words of a paragraph or all the words of a paragraph if the paragraph is shorter than 10 words. Is that easy?

Title: Re: Importing Day One data
Post by Mark Anderson on Dec 20th, 2012, 7:08pm

It's late so not a great time for figuring regex, but I think the first 10 words are ^((\w+\s+){,10}) or possibly ^((\w+\s+){10}).

I think back references are discussed in the release notes.

To find the first two lines where one contain data and the second doesn't, I think you want ^.*[^\n]\n\n. Or if the dates in line one follow an exact form, e.g. NN/NN/NNNN, then you could try ^\d{2}/\d{2}/\d{4}\n\n.

Title: Re: Importing Day One data
Post by Sumner Gerard on Dec 21st, 2012, 10:34am

Many thanks for the pointers. They saved me from a lot of stumbling around. Finally beginning to "get" regex a little.

For matching the first 10 words, your second suggestion of ^((\w+\s+){10}) works here within the contains() operator in an agent query. In my stamp action for some reason \w doesn't seem to be recognized in this context. But the following works here, where the uppercase S means any non whitespace character:

Stamp action: $Text.contains("^((\S+\s+){10})");$Name=$1.replace("\n+$","")

To remove the first two unwanted lines by matching them and replacing the match by null your suggestion worked after adding an *.

Stamp action: $Text.contains("(^[^\n]*\n\n)");$Text=$Text.replace($1,"")

I found some interesting things in the release notes by searching for "regular" but did not manage to find anything about back references.

Tinderbox User-to-User Forum (for formal tech support please email: info@eastgate.com) » Powered by YaBB 2.2.1!
YaBB © 2000-2008. All Rights Reserved.