Welcome, Guest. Please Login
Tinderbox
  News:
IMPORTANT MESSAGE! This forum has now been replaced by a new forum at http://forum.eastgate.com and no further posting or member registration is allowed. The forum is still accessible via read-only access for reference purposes. If you wish to discuss content here, please use the new forum. N.B. - posting in the new forum requires a fresh registration in the new forum (sorry - member data can't be ported).
  HomeHelpSearchLogin  
 
Pages: 1 2 
Send Topic Print
Practical tips for spreadsheet import (Read 13946 times)
Sumner Gerard
Full Member
*
Offline



Posts: 359

Practical tips for spreadsheet import
Mar 19th, 2013, 12:37pm
 
[There's an export section but no import section, so I'm guessing this is the appropriate place to post this].

Spreadsheet import is great. Paste data onto a Tinderbox view and Tinderbox automatically creates new notes with key attributes corresponding to the header row labels of the source data.

Take, for example, this data arranged in tab-separated-values (tsv) format:

  Name<tab>MyString<tab>Text<newline>
  Note1<tab>Some value A<tab>This is text of Note1.<newline>
  Note2<tab>Some value B<tab>This is text of note2.<newline>


When typed into an editor and pasted onto a Tinderbox view this results in two notes with $MyString already set up as a key attribute and the values appearing where expected. NB: <tab> and <newline> here mean typing the actual tab and return keys.

What to do, though, if the text (or any other column) contains within it tabs or newlines? Say source text in one or more of the rows is itself two lines, something like:

  This is <tab> nice.<newline>
  But I don't like it.

That confuses the import parser, which will think that the imbedded tab is telling it to go to the next column (attribute) and that the newline within the text means start another row (note). So the result is a mess.

One way to handle this (I'm sure there are others) is to replace the tabs and newlines within the source text column with temporary placeholders, say _HT_ and _LF_, paste the data as usual onto a Tinderbox view, and then, if all has gone well, in Tinderbox replace the placeholders in the text with their "real" equivalents. This is easier than it sounds, quite simple to automate at both ends.

So the source, after swapping in the placeholders, will look something like this:

  Name<tab>MyString<tab>Text<newline>
  Note1<tab>Some value A<tab>This is _HT_ nice._LF_But I don't like it.<newline>
  Note2<tab>Some value B<tab>This is text of note2.<newline>  


And after the import into Tinderbox, the contents of $Text can easily be restored to resemble the original source text with:

  $Text=$Text.replace("_HT_","\t").replace("_LF_","\n")

In some cases source data still won't import in Tinderbox even though the data pastes correctly into Excel and perhaps other programs. One likely cause, as discovered here, is messy free-form text in the source that has typos such as double quotation marks, which Tinderbox in its current version seems to expect to come in matching pairs whereas Excel is more lenient (I'm talking tsv only, not csv). In that case, try replacing all double quotations within free-form text with a placeholder too, say _QQ_.

Then after the import into Tinderbox, restore the text to resemble the source with:

    $Text=$Text.replace("_HT_","\t").replace("_LF_","\n").replace("_QQ_","\x22")


See here for Mark A's explanation of "\x22".
Back to top
 
« Last Edit: Mar 19th, 2013, 9:53pm by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Practical tips for spreadsheet import
Reply #1 - Mar 19th, 2013, 6:34pm
 
Thanks for this, I'm sure this will helps others ploughing this same furrow.

Point taken re an import forum. Actually, his forum is good as any for import and it involves action to do the heavy lift!
Back to top
 
 

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Practical tips for spreadsheet import
Reply #2 - Mar 21st, 2013, 10:44pm
 
Inspired by this forum discussion started by rtalexander, below is a Python script that "web scrapes" Tinderbox forum search results and creates a "tsv" file that can simply be dragged onto a Tinderbox view to automatically create notes. Forum content can then be rearranged, annotated, searched, colored, mapped, timelined and otherwise explored to one's heart's content. Leveraging Tinderbox's spreadsheet import, this general approach can be useful in other situations that require getting data into Tinderbox from web pages, xml files, etc.

# Setup --
1. Install the BeautifulSoup library if you don't already have it. Free and easy. Nicely documented. That's the only "dependency" here, unavoidable. Python itself comes pre-installed on the Mac.
2. Save the Python script with a .py extension where desired.

# Usage --
1. Do a Tinderbox forum search for a desired topic.
2. With the search results displayed in the browser, "Save as" the page source (i.e., not as a web archive), noting the file location.
3. Open the saved source file in an editor like TextEdit, TextWrangler, Taco, etc and (if necessary) resave it as utf-8 encoded plain text.
4. Run the script. I do that from Terminal with $ python ~/path/to/myscriptname.py. The script will prompt for navigation to the source file location.
5. If all goes well the script will create a ".tsv" file in the same folder as the source file. Using the Finder drag that file (no need to open it, but be sure to get the right one, the one with a .tsv extension) onto a Tinderbox view. Multiple notes should appear in Tinderbox with attributes all set up and values in the right place.
6. In Tinderbox run this action code with a stamp (or container $OnAdd, etc):

       $Text=$Text.replace("_HT_","\t").replace("_LF_","\n").replace("_QQ_","\x22")

# What it does --
The script iterates through all tables of the right class (i.e. ignoring those tables in the html that don't contain forum posts) and, with BeautifulSoup doing the heavy lifting, extracts the name, forum location, date, writer, url, and text of each forum post, substituting placeholders for special characters where necessary. Then it puts the data into a Tinderbox-friendly tab-separated-value format using some of the practical tips described here.

# Limitations --
Links *within* the text of forum posts are lost, but can be recovered manually if needed by clicking $URL and checking the original post on the forum. With BeautifulSoup it should be possible to import those links too, but that's above my pay grade.

This particular script only works for pages displaying results of a *search* of the Tinderbox forum. The html tables in pages displaying not search results but instead, say, a particular thread, are slightly different, thus requiring a few scripting changes.

Handling unicode in Python apparently confuses even those with far more experience than a rank novice like me working a simple script like this.  See a veteran's vivid account of the need for a "unicode sandwich" to avoid engaging in whack-a-mole. Lots of moles whacked in this little example!


Hope to hear how this works on other machines, plus questions, corrections, improvements, etc.

Script follows in next post ...
Back to top
 
« Last Edit: Mar 22nd, 2013, 12:40pm by Sumner Gerard »  
  IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Practical tips for spreadsheet import
Reply #3 - Mar 21st, 2013, 10:48pm
 
... And here is the Python script that creates the tsv file for import... If it immediately generates a syntax error (after installing BeautifulSoup) double-check that the indents came over correctly into your editor. Spacing in code posted here looks a little distorted sometimes, and Python can be fussy about indents!

Code:
# -*- coding: utf-8 -*-
# http://www.eastgate.com/Tinderbox/forum//YaBB.pl?num=1363711078/0#3
import codecs # standard Python 2.7.2 default
from Tkinter import Tk  # standard
from tkFileDialog import askopenfilename # standard
from bs4 import BeautifulSoup # download http://www.crummy.com/software/BeautifulSoup/
Tk().withdraw # suppress full gui
path = askopenfilename()
raw = codecs.open(path, encoding='utf8').read()
raw = raw.replace(u'<br />',u'_LF_') # placeholder for YaBB breaks- note space in tag
soup = BeautifulSoup(raw)
def tsvify(s):  # insert placeholders for tsv special chars
	s = s.replace(u'\t',u'_HT_').replace(u'\n',u'_LF_').replace(u'\n',u'_LF_')
	s = s.replace(u'"',u'_QQ_') #  double quotes - Tinderbox-specific
	return s
# col_heads map to Tinderbox attributes - Name must be first:
col_heads = ['Name','ForumLoc','PostDate','Participants','URL','Text']
tsv_data = unicode.join(u'\t',map(unicode,col_heads)) + u'\n'
for a_table in soup('table', class_="bordercolor"):  # list tables filtered by class
	for a_link in a_table('a')[2:3]:                 # slice 3rd link in table
		post_name = a_link.text.strip()              # ---------------> $Name
		post_link = a_link['href']                   # ---------------> $URL
	for a_link in a_table('a')[1:2]:                 # slice 2nd link in table
		forum_loc = a_link.text.rstrip()             # ---------------> $ForumLoc
	table_rows = a_table('tr')                       # list rows in table
	for a_row in table_rows[0:1]:                    #   slice row 1 in table
		row_lines = a_row.text.split('\n')           #   list lines in row 1
		for a_line in row_lines[3:4]:                #   slice line 4
			dt = a_line.strip()[4:]                  #     slice char 5 on
			post_date=dt.replace('th','').replace('rd','')  # --------> $PostDate
	for a_row in table_rows[1:2]:                    #   slice row 2 in table
		post_partic = a_row.text.strip()             # ---------------> $Participants
	for a_row in table_rows[2:3]:                    #   slice row 3 in table
		post_text = tsvify(a_row.text.strip())       # ---------------> $Text
	# in same order as col_heads above:
	col_vals = [post_name, forum_loc, post_date, post_partic, post_link, post_text]
	data_row = unicode.join(u'\t',map(unicode,col_vals)) + u'\n'
	tsv_data = tsv_data + data_row                   # add row, start next table

# write data to 'tsv' file
file_data = tsv_data.encode('utf8','replace')
f = open(path +'.tsv','w')  # in same folder as source file
f.write(file_data)

# print instructions to terminal
print '\n' + '*' * 60 + '\n'
print 'DRAG THIS FILE (NOTE THE TSV EXTENSION) ONTO A TINDERBOX VIEW:\n'
print  path +'.tsv\n'
print 'THIS SHOULD PRODUCE MULTIPLE NOTES IN TINDERBOX.\n'
print 'IF ALL IS WELL, RUN THIS ACTION IN STAMP (OR CONTAINER ON ADD, ETC):\n'
print '$Text=$Text.replace("_HT_","\\t").replace("_LF_","\\n").replace("_QQ_","\\x22")\n'
print '*' * 60 + '\n'

# uncomment the following three lines to send data to Terminal for copy/paste:
# print '\n\n' + '*' * 50 + '\nCOPY TO CLIPBOARD WHAT FOLLOWS\n' + '*' * 50 + '\n'
# print tsv_data
# print '\n' + '*' * 50 + '\nCOPY ALL TEXT FROM INDICATED AREA ABOVE\nTHEN PASTE ONTO TINDERBOX VIEW\n' + '*' * 50 + '\n' 

Back to top
 
« Last Edit: Mar 21st, 2013, 11:19pm by Sumner Gerard »  
  IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Practical tips for spreadsheet import
Reply #4 - Mar 31st, 2013, 11:12pm
 
Here is another script. This one converts the currently visible page of a forum thread (as opposed to forum search results in the previous script) into tab-separated-values (tsv) for Tinderbox "spreadsheet import."

#Usage:
1. Run the script (it assumes BeautifulSoup has been installed).
2. When prompted, drag the Tinderbox forum thread url from the browser onto the Terminal window and hit <return>.
3. Copy the text where indicated from the Terminal window and paste onto a Tinderbox view.
4. After Tinderbox has automatically created the new notes together with any necessary user attributes, run the tsv placeholder cleanup action code in Tinderbox per post upthread.

#Notes:
The date-time format used by the YaBB forum can confuse Tinderbox. Hence the cleandate() function. This removes the 'Reply #d - ' prefix (fairly obvious, done here through regex). But it also must remove the st, rd, or th after the day (I forgot st in the previous script). Otherwise Tinderbox seems to guess that the year is always the current year, e.g. 2013.

#A question:
I haven't had success getting Tinderbox to import the correct forum timestamp. What particular time format does Tinderbox need for spreadsheet import?


Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# http://www.eastgate.com/Tinderbox/forum//YaBB.pl?num=1363711078
import codecs, urllib2, re 		# standard Python 2.7.2 default
from bs4 import BeautifulSoup 	# install from crummy.com/software/BeautifulSoup/
url = raw_input('Type, drag or paste in url of forum thread and hit <return>:\n')
raw = urllib2.urlopen(url).read().decode('latin1','replace') # YaBB charset=ISO-8859-1
raw = raw.replace(u'<br />',u'_LF_') # placeholder for YaBB breaks- note space in tag
soup = BeautifulSoup(raw)
def tsvify(s):  # insert placeholders for tsv special chars
	s = s.replace(u'\t',u'_HT_').replace(u'\n',u'_LF_').replace(u'\n',u'_LF_')
	s = s.replace(u'"',u'_QQ_') #  double quotes - Tinderbox-specific
	return s
def cleandate(s): # remove day suffixes that confuse Tinderbox parser
	s = s.replace('th','').replace('rd,','').replace('nd','').replace('st,','').replace(',','')
	s = re.sub(r'^Reply[^\n]*-\s','',s)   # regex to remove 'Reply #d - ' (if present)
	return s
# Column headers map to Tinderbox attributes - Name must be first:
col_heads = ['Name','ForumLoc','PostDate','Participants','URL','Text']
tsv_data = unicode.join(u'\t',map(unicode,col_heads)) + u'\n'
for a_table in soup ('table')[4:5]: 				# slice 5th table - no filter
	for a_link in a_table('a')[2:3]: 		        # slice 3rd link in table
		forum_loc = a_link.text.strip()				# -------> $ForumLoc
for a_table in soup('table', class_="bordercolor"):	        # list tables filtered by class
	for a_link in a_table('a')[1:2]:			# slice 2nd link in table
		post_partic = a_link.text.strip()			# -------> $Participants
	for a_link in a_table('a')[2:3]:			# slice 3rd link in table
		post_link = a_link['href']			        # --------> $URL
	for a_span in a_table('span')[2:3]:                     # slice 3rd span in table
		for a_bold in a_span('b')[0:1]: 		# slice 1st <b> in that span
			post_name = a_bold.text.strip()			# ---------> $Name
	for a_span in a_table('span')[3:4]:			# slice 4th span in table
		post_date = cleandate(a_span.text.strip())		# ------- > $PostDate
	for a_div in a_table('div')[2:3]:			# slice 3rd div in table
		post_text = tsvify(a_div.text.strip())			# --------> $Text
	# in same order as col_heads:
	col_vals = [post_name, forum_loc, post_date, post_partic, post_link, post_text]
	data_row = unicode.join(u'\t',map(unicode,col_vals)) + u'\n'
	tsv_data = tsv_data + data_row				# add row, start next table
# print instructions and data to terminal for copy/paste:
print '\n\n' + '*' * 50 + '\nCOPY TO CLIPBOARD WHAT FOLLOWS\n' + '*' * 50
print tsv_data
print '*' * 50 + '\nCOPY ALL TEXT AS INDICATED ABOVE'
print 'THEN PASTE ONTO A TINDERBOX VIEW\n' + '*' * 50 + '\n'
print 'After import into Tinderbox, clean up tsv placeholders there with:\n'
print ' $Text=$Text.replace("_HT_","\\t").replace("_LF_","\\n").replace("_QQ_","\\x22")\n' 

Back to top
 
« Last Edit: Apr 1st, 2013, 3:54pm by Sumner Gerard »  
  IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Practical tips for spreadsheet import
Reply #5 - Apr 1st, 2013, 4:47pm
 
Quote:
I haven't had success getting Tinderbox to import the correct forum timestamp. What particular time format does Tinderbox need for spreadsheet import?

Well, in the extremely rare absence of any guidance, more stabs in the dark on my setup *seem* to lead to to the following conclusions:

With original source like this to be imported into a date type attribute:

  Reply #3 - Mar 21st, 2013, 10:48pm

Tinderbox will get the month and day right but can get confused about the year and time when this is trimmed to:

  Mar 21st, 2013, 10:48pm

But will get the year right too when the 'st' is removed while remaining confused on the time:

  Mar 21, 2013, 10:48pm

But will get the time right too when the comma between date and time is removed:

  Mar 21, 2013 10:48pm

And still seems to get everything right when all commas are removed:

  Mar 21 2013 10:48pm

So it *seems* (as I've now incorporated into the script) that when importing date/times into a Tinderbox date-type attribute always removing 'st,' or 'nd' or 'rd,' and all commas is the most reliable approach, and that there is no need to convert yesterday and today as Tinderbox recognizes these.

I looked at such documentation as I could find on date-type data but couldn't really tell how it might apply to tsv import. Are my conclusions here accurate?  
Back to top
 
« Last Edit: Apr 1st, 2013, 4:52pm by Sumner Gerard »  
  IP Logged
Mark Bernstein
YaBB Administrator
*
Offline

designer of
Tinderbox

Posts: 2871
Eastgate Systems, Inc.
Re: Practical tips for spreadsheet import
Reply #6 - Apr 1st, 2013, 4:51pm
 
Removing ordinals (1st, 2nd, 3rd) is indeed essential.

The *best* format will always be the system "short" format as specified in System Preferences.  Tinderbox does its best to part common alternative formats, as well as terms like "today" and "yesterday".
Back to top
 
 
WWW   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Practical tips for spreadsheet import
Reply #7 - Apr 1st, 2013, 5:31pm
 
Thanks for the instantaneous clarification!  In ML 10.8.3 deep in preferences (System Preferences/Language & Text/Region) I see a "Short" within "Customize..." for "Dates", and a separate "Short..."  within "Customize..." for "Times".  There doesn't seem to be displayed anywhere there one "short" format that combines the two.  So the *best* format, I gather, is to use those two "Shorts" (without any ordinals) *and also* be sure to remove any comma that the source (an example being the Tinderbox Forum YaBB software) may have inserted between date and time and replace it with a space if there isn't already a space.

(It turns out my "short" formats are indeed set to something completely different from the Tinderbox forum format but Tinderbox still seems to recognize the source easily, if the ordinals and comma between date and time are first removed).
Back to top
 
« Last Edit: Apr 1st, 2013, 9:55pm by Sumner Gerard »  
  IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Practical tips for spreadsheet import
Reply #8 - Apr 1st, 2013, 9:57pm
 
On my machine the correct date "short" format and time "short" format, if separated by a comma, still result in an unexpected value for time.  

So it seems the most critical datetime formatting steps prior to spreadsheet import are:

1) remove any ordinals (st,nd,rd,th), and
2) have only a space or spaces (no comma allowed) between date and time.

Aside from those two (undocumented, I think) requirements, Tinderbox is very smart and flexible about recognizing when a date type attribute is needed and populating it with the intended date and time.
Back to top
 
« Last Edit: Apr 1st, 2013, 9:58pm by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Practical tips for spreadsheet import
Reply #9 - Apr 2nd, 2013, 6:15am
 
The underlying process can be simplified as this: give TB a format where it doesn't have to guess unduly. As has Mark B has pointed out TB's effective 'default' input is your (i.e. the current user's) Mac's 'short' formats for date and time. As Sumner has noted the date and time are set separately in the OS, but the setting are there and can be verified - or even changed. On my UK Mac, the short format date is dd/mm/yyyy and the short format time is hh:mm. Thus in parsing a string input for a date or date+time I presume it looks for a "dd/mm/yyyy" format sub-string and a possible "hh:mm" one as well. If the date is found but not time, TB inserts Mac's current OS time for the time element.

So the aim in the script, might I suggest, is to pass data to your Mac in the form used by your Mac's 'short' date (and time). You can choose leave TB to guess, but if so don't complain if it guesses wrong! It matters less what the OS data settings are but rather that the Python script knows how to create them. This means, too, that users in different locales may need to adjust part of the script so the date components are correct for their system - e.g. a script written for a mm/dd US system likely won't work for anyone outside the US as their systems will use dd/mm order and TB on their system will assume likewise. Also some countries use the old AM/PM time suffix and others the 24-hr clock.

Thus to take data like [Reply #\d+ - ]Mar 5th, 2013, 10:48pm and put it in a form TB understands I would make my external script:
  • Match and strip the optional replay preamble"^Reply #\n+ - ". This helps set up for the remaining steps.
  • Match exactly three consecutive letters "\w{3}" - the month - and turn them into a zero-padded number, e.g. 'Mar' -> '03'. This is the month. If you've stripped the preamble you could match "^\w{3}" as these now come at the start of the string.
  • Match one/two numbers followed by two letters, strip the letters and zero pad if one digit, e.g. '5th' -> '05'. This is the day.
  • Match 4 consecutive digits, "\d{4}". This is the year.
  • Match the last 7 characters, i.e. ".{7}$" to collect the hours, minutes and AM/PM. If necessary transform the hours to 24-hr clock
  • Reassemble the collected date segments (years, months, days, hours, minutes) for your locale. For my UK system, the above would need to return "05/03/2013 22:48". For a US system it would need to return be something like "05/03/2015 10:48 PM". Note: This is the step users will need to 'localise' for their own OS settings.
This way, there's no ambiguity once the data reaches Tinderbox.

YaBB forums allow for a number of date formats. Having looked, none are any more useful here than the default). The YaBB output date format would only change if the forum admin were to subsequently change the format setting - unlikely for most forums.

Later: as this page shows, date formats are anything but simple in an international context.

Quote:
Typo in suggested regex - I put \n for digit when I meant \d. An \n is in fact a marker for a new line character.
Back to top
 
« Last Edit: Apr 3rd, 2013, 8:27am by Mark Anderson »  

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Practical tips for spreadsheet import
Reply #10 - Apr 2nd, 2013, 1:37pm
 
Thanks for the great pointers on the regex. And what an interesting link on date formats! I now understand the delight and amazement of the Molière character upon realizing he had been speaking "prose" for 40 years without even knowing it. My particular short date format, adopted because it seemed logical and simple, turns out to be a "big-endian" "Basque Country" variation, which happens, it seems, to be "Mongolian" as well.  

And yet on my machine Tinderbox *still* imports correctly the Tinderbox Forum date and time in their original format stripped of the ordinals and the comma between date and time. It also imports "Yesterday at 5:31pm" perfectly (very happy not to have to script that, and instead just let Tinderbox do the work) though to avoid mangling "Yesterday" when replacing the ordinals I found in Python I needed to use:

Code:
.replace('rd,','').replace('st,','').replace('nd','').replace('th','').replace(',','') 



instead of the more obvious:

Code:
.replace('rd','').replace('st','').replace('nd','').replace('th','').replace(',','') 



Unfamiliar with the intricacies of datetime strings and all that, I found that avoiding (in this case removing) a comma between date and time was the least obvious ironclad requirement, deserving mention in documentation somewhere. On my machine, the short date format and short time format themselves can be exactly right, but a comma between them is guaranteed to sow confusion.
Back to top
 
 
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Practical tips for spreadsheet import
Reply #11 - Apr 2nd, 2013, 3:26pm
 
I realised in my quick answer this morning, I forgot the today/yesterday branches.  But, those are easy to test for in your script as string literals and if found to simply replace, via python, as an appropriate day+month+year date after which you just need to pull out the time.

Quote:
I found that avoiding (in this case removing) a comma between date and time…


The risk with that is in making a hard assumption based on no real data as to how TB parses what might be a string containing a date.  In comparison, you know it will expect to do a correct match is the string matches your OS short date & time formats. After all the work put into the Python script, it seems to me to make much more sense to make it generate a date/time string in a format that Eastgate have confirmed will work - i.e. your OS short date and time strings. Otherwise you're doing all this work and then balancing precariously on an assumption that as a user can't reliably prove.

Thus, assuming this is your OS format, to pass a string like "05/03/2015 10:48 PM" which we can then regard as a known, supported format as opposed to "YaBB-minus-the-comma" which seems to work.

I'm not poking fun here, but simply replaying experience learned the hard way from past errors. Lessening the immediate work by using the easy but unproved shortcut, has invariably turned round to bite in inter-app work. IOW, the solution works perfectly … until the day it doesn't, leaving one none the wiser as to why and having to back up and write a better script. The later time spent diagnosing the error is better invested up front on a solution that's more provably correct.

Still, if the script is working for your immediate needs that's great and it only later readers of this thread who might perhaps consider a more worked solution.
Back to top
 
 

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Practical tips for spreadsheet import
Reply #12 - Apr 2nd, 2013, 6:07pm
 
All of these points are great advice. Taking the easy way out in the beginning can turn into the hard way in the end. "YaBB-format-minus-the-ordinals-and-minus-comma-between-date-and-time" *does* work on my current setup but may not on others; I have no data on that.

I now "grok" the regex (thanks for that) and it's a question of strengthening a tenuous grip on Python to implement the rest.  For me converting "Yesterday" and "Today" to the right dates is not as trivial a project as it would seem to experts, but I'm going to give it a stab and, if successful, will post... unless someone else already has a solution.

The "best" approach, I now understand, is to extract the various segments of the date and time and then reassemble the segments into machine-specific short date and short time formats, *and, crucially,* do not have any comma *between* those two known, supported formats before importing them together into a Tinderbox date-type attribute.

Those who truly understand datetime strings (or whatever the correct term is) probably *just know* that a comma has absolutely no business being between date and time even though a comma *can* belong *within* a date (especially if one's short date format has one). But for others like me a pointer to that effect in the documentation could save a lot of time and headscratching.

This has got me thinking. Mark B mentioned upthread that Tinderbox is designed to parse "common alternative formats." Where are those formats described?
Back to top
 
 
  IP Logged
Sumner Gerard
Full Member
*
Offline



Posts: 359

Re: Practical tips for spreadsheet import
Reply #13 - Apr 3rd, 2013, 12:12am
 
Here's a cleandate() function that should be substitutable into the latest script to convert the original "YaBB format" to the "YaBB-format-minus-the-ordinals-and-minus-comma-between-date-and-time" format it already produced then into any machine-specific 'short' date and time formats.

Code:
def cleandate(s):      # remove ordinals, commas; convert to system 'short' formats
	s = re.sub(r'^Reply[^\n]*-\s','',s)   # regex to remove 'Reply #d - ' (if present)
	s = s.replace('th','').replace('rd,','').replace('nd','').replace('st,','').replace(',','')
	if any(substr in s for substr in ['Today','Yesterday']):
		return s
	else:
		s=datetime.datetime.strptime(s,"%b %d %Y %I:%M%p")  # this does not change
		return s.strftime("%Y.%m.%d %I:%M%p") # a machine-specific short date time format
	# 	return s.strftime("%d/%m/%Y %H:%M")   # try this instead for UK-style 'short' formats
	# 	return s.strftime("%m/%d/%Y %I:%M%p") # or this for US-style 'short' formats
  



The % "directives" that need to go into the one "machine-specific" line are explained here.

This needs Python's standard datetime module, so the import line near the top of the script has to be changed to:

import codecs, urllib2, re, datetime

I don't have enough of a grip on Python yet to tackle "Yesterday" and "Today", so the script just passes those on to Tinderbox, which handles them as expected, no matter how I set my 'short' format on my machine.

For some reason I couldn't get the suggestion for the cleaner-looking regex for removing Reply # … to work so I had to leave in my ugly but functioning version. To my surprise, Python's strptime() and strftime() turned out to be so handy at extracting and manipulating the various components that the other regex wasn't needed here.

Would be curious to know if this works properly with British-style 'short' formats when uncommented where indicated in the code; it seemed to do the right thing on my machine when I temporarily changed my format in system preferences.
Back to top
 
« Last Edit: Apr 3rd, 2013, 12:15am by Sumner Gerard »  
  IP Logged
Mark Anderson
YaBB Administrator
*
Offline

User - not staff!

Posts: 5689
Southsea, UK
Re: Practical tips for spreadsheet import
Reply #14 - Apr 3rd, 2013, 8:24am
 
Quote:
into machine-specific short date and short time formats, *and, crucially,* do not have any comma *between* those two known, supported formats before importing

Yes, but no - I think your deduction re commas is wrong and it is the guessed-at assumption I was referring to up-thread. If, as per your last code alteration above, your script outputs a date/time string in the form your Mac uses for its short date and time then TB will (should!) have no problem in setting a Date-attribute value correctly. If your Mac's short date uses commas, then you'd want the script output date string to include a comma. Otherwise, a comma is a red herring.

I think the confusion has crept in because the task started with source data that happened to have a comma in it and the script was effectively trying to do the minimum tweaking to the source string so as to allow Tb to interpret it as desired. Put another way, if the YaBB source didn't have a comma you'd not be inserting one in your Python script.

The date formatting of the (YaBB) source data doesn't matter! You are using only to extract discrete year/month/day/hour/minute components from it so as to be able to generate a completely different format that you know matches your Mac's short date and time formats.  To concatenate the latter, simply use a single space (I suspect using spaces more won't matter but there's little point in doing so). Let's rewrite that as:

Ideally, for an input date/time string of a Tinderbox Date-type attribute, Tinderbox will look for data in the format of your Mac's System 'short' date and time format; such formats will vary by locale and user - so if in doubt check system preferences. If both date and time are supplied, leave a space character between the date and the time.

You ask:

Quote:
…that Tinderbox is designed to parse "common alternative formats." Where are those formats described?

I don't believe they are formally documented, if at all. Looking back to Tinderbox's early days it did far fewer things but on the upside could safely guess more about the input it received. IOW, you just typed in your 'code' and it would figure the result (no need to explicitly quote strings, etc.). Of course the app now does more and the apparent magic of the app guessing your intent - doubtless some fiendish regex work behind the curtain - doubtless becomes harder. Perhaps for v6 the range of 'known' formats should be documented somewhere - I assume they must be known for unit testing for TB builds.

Does this make the underlying issue clearer?



Sidenote. The left/right parentheses, comma, single/double quote and semi-colon all have some degree of delimiter function within TB. If TB is passed a string with comma in it, where is posits none, it might assume the comma is a deliberate delimiter between two parameter values. A semi-colon might imply the string is a list of strings, and so on. The user isn't expected to know or care about this and that is why simply ascribing good/bad to the presence of a comma is a flawed assumption in your script's design.
Back to top
 
 

--
Mark Anderson
TB user and Wiki Gardener
aTbRef v6
(TB consulting - email me)
WWW shoantel   IP Logged
Pages: 1 2 
Send Topic Print