Anyone got a good text-parser for tagged text?

Hey, Lazyweb! I've been making tons of notes from the various books I've been reading in preparation for writing my next novel, For the Win, which is about kids who work in special economic zones as gold-farmers forming a global trade union. Now I need a tool to help me manage the notes, and I figure someone out there must have already built it, though I can't find it.

The notes look like this:

Tuile a house -- knock it down, without the occupant's permission, as when the government takes down a house in punishment for violating the one-child policy. River 354. @china @idiom @corruption @authoritarianism
The form's simple: a note with the book and page reference (this one comes from Peter Hessler's excellent River Town), followed by some tags. Each note is separated by a double carriage-return. All the notes are in a single text file.

I'm looking for something that'll parse out the tags at the end of the lines and then make a tag-cloud out of them, and let me click on tags to retrieve them, as well as searching the fulltext of all the notes.

This is such a bog-standard way of using tags that I figure there must be something on the web that can handle it. Do you know of one? Discuss it in the comments below. Thanks!


Discussion

Take a look at this
#1 posted by jmck , August 12, 2008 5:39 AM

On the Mac, Taskpaper would do exactly what you want. For Linux etc., if you like vim, you could use the Taskpaper.vim syntax plugin which replicates some of the functions of Taskpaper. I haven't used the latter...

Take a look at this

If you find a tool you like which actually manages tag clouds (the database/search/render end of things), I could hack you a Perl script which converts your notes into whatever input format it prefers. (Free, natch) :-)


Take a look at this

Probably not exactly what you are looking for, but this is really simple to do in python and with a bit of trickery you can get it to parse a whole folder for notes. You could even get it to insert the results into a database.

##################################################
# Lets add some strings to the array
notes = ["string -- a piece of thread. River 354. thread @string", "cake -- basically raw awesome. Book of Cake. @jazz @camel"]

for note in notes: # For every one of the notes
for word in note.split(): # For every space separated word in the notes
if word[0] == "@":
print "Tag " + word + " found on note " + note + " which has an index in the array of " + str(notes.index(note))
##################################################

This results in:

Tag @string found on note string -- a piece of thread. River 354. thread @string which has an index in the array of 0

Tag @jazz found on note cake -- basically raw awesome. Book of Cake. @jazz @camel which has an index in the array of 1

Tag @camel found on note cake -- basically raw awesome. Book of Cake. @jazz @camel which has an index in the array of 1

I'm sure heaps of people around here would be able to write such a script for you (myself included).

Take a look at this

This could be just what you need:

http://www.gnome.org/projects/tomboy/

I understand you use Ubuntu. This is available in the repositories.

Take a look at this

This won't help much since you already have created the notes, but you could consider TiddlyWiki in the future. It works with almost every browser and almost every OS.

See http://www.tiddlywiki.com/

Take a look at this

I was thinking about this again. The best way I can think of is to write a script that parses your notes for the tags and uploads the notes to a blank del.icio.us account. That way you get: full-text search, tag listing and tag clouds. Del.icio.us uses a terribly simple web API: http://delicious.com/help/api#posts_add

Such a simple script should only take 15 minutes tops.

Take a look at this

Writing a parser script in python, perl, ruby or whatever is close to trivial and if you settle for a list or a relational database that's easy.

A visual tag cloud is tricky. But if you find some app that specializes in such displays, converting your notes to that format is probably easy.

Take a look at this
#9 posted by Anonymous , August 12, 2008 6:33 AM

This may not be the best lead, but http://todotxt.com/ does something similar to this, and if you felt like hacking yours up custom it could be a good start.

Take a look at this

Well, I've got a tagcloud widget (written with gtkmm) (I haz t3h tagcloudz)

Parsing your text should be also a simple matter, either with Python, or with C++/boost string algorithm libraries. Adding a database for it should be close to trivial next to the parsing, and a GUI for it would take perhaps a (work-)day or so to get right.

Take a look at this

#!/usr/bin/awk -f

function trim(s0) {
s=tolower(s0);
sub(/^ */, "", s);
sub(/ *$/, "", s);
return s
}

{
split($0,tags,"@");

for (tag_index in tags)
{
if (tag_index != 1)
{
tagfilename = trim(tags[tag_index]) ".txt"

print tags[1] >> (tagfilename)
print "\n" >> (tagfilename)
}
}
}

Take a look at this

This will sort your indexed text file into a load of other text files, eg.

china.txt contains all lines with an @china, police.txt contain all lines with an @police

cat input_text_file | ./coryparse

http://dl.getdropbox.com/u/27509/coryparse

I guess it'd be simple enough to make that HTMLified.

Take a look at this
#13 posted by JGB , August 12, 2008 6:56 AM

Years ago....

Mid 90's, I was responsible for taking web generated email and doing much the same. We had something like 500 responses a week and my job was to offer information support.

Unfortunately, my "web guru's" at the time were very smug and resisted the request to offer me data that was clean of tags and put into proper form.

Since my manager felt very insecure about "challenging" our IT department (and I admit I always try to give IT latitude as well) It was my responsibility to find a solution.

I AM NOT A PROGRAMMER, but utilized Word and a few self designed macros. In a matter of an afternoon, I was easily doing what everyone at our facility said was impossible. Made no friends that day.

Wonderful and easy tailor made solution.

Good luck,

-JGB

Take a look at this

@penjuin had an interesting idea w/ posting to a delicious account.

If you already have Wordpress set up somewhere you could use Wordpress instead (google: perl post wordpress). A benefit of using Wordpress instead of delicious is that that data could live on your server instead of yahoo's servers.

Take a look at this
#15 posted by Anonymous , August 12, 2008 7:11 AM

maybe this? - http://chir.ag/phernalia/preztags/

it's time line based, not sure if that will get in the way, but worth a look

Take a look at this

Just learn regular expressions :P
http://xkcd.com/208/

Take a look at this

be glad you don't have to have the system email the data to your own webmail, create a cronjob to copy the whole inbox to a public folder and delete the current contents daily, wget the file back to the same system, then decode the base64 email to ascii so you actually have some plain text to awk/sed/grep through.

still, a modded firmware sorted that lot out. now my call stats and dialplan changer is much more complicated. ...i mean efficient. :o)

Take a look at this

So, umm, does anyone have any suggestions if you have a similar collection of notes, in block print, on a wide variety of bar napkins, coasters, grocery store receipts, old paychecks, and bookmarks? :)

Take a look at this

evernote? devonthink pro (mac)?

sorry...i don't know what goes on in the world of linux...but the above two help me to organize all the web pages, research, and clippings that i use to write.

Take a look at this

i'll second the 'learn regex' suggestion

i haven't had coffee yet, but i think something like this...

(.*)\.\s?([\w\s]*)\s(\d+)\.@(\w*)*

would do
1 - quote
2 - book
3 - page
4+ - tags

Take a look at this
#21 posted by Bloo , August 12, 2008 7:36 AM

I'll barter a conversion effort (your file => html files with the tag cloud links) for a copy of the book when it comes out. I'll give you the source code if you need or want it, but warn you that conversions are usually "one off" (i.e. work is required to repurpose the program later) and sometimes I use Rexx which may not be a language available on your platform although it has been ported to many.

Take a look at this
#23 posted by Anonymous , August 12, 2008 7:45 AM

It may be overkill for just this simple task, but if you're looking for an editor to create organize and search through these sorts of notes, emacs org mode does it out of the box, in a very simple way (the files you end up with are still simple text files, and can be read/edited as such.)

the org mode home page:

http://orgmode.org/

and an excellent tutorial that shows off some of its amazing features at a recent google tech talk:

http://orgmode.org/GoogleTech.html

-- eric casteleijn

Take a look at this

This shouldn't be a very difficult web application. If I have some time this weekend, I'll put something together.

Take a look at this

Penguin @3 is right.
This is child's play in python. You don't even need regexes, just a few split methods coupled with a dictionary. And preexisting HTML renderers exist for displaying the results.

Provide a sample file w/ a few dozen entries and I'd knock out the needed code in a few minutes.

Frankly, for data entry, I'd use Zotero, and then dump to an open format. Zotero handles tags and does great work tracking bibliographic data.

Take a look at this

I use Sandy - http://iwantsandy.com/ - to email myself notes that look exactly like yours, same tagging system. I can then retrieve the notes on the web or via email; if I email Sandy with a subject line of *lookup @corruption*, I get an email back within a minute of all my notes with the @corruption tag. Sandy has lots of other features for calendaring and contacts, but I find it really shines best for miscellaneous short notes. Also, Sandy integrates with Jott, so I can leave myself a voicemail note and it gets typed up and filed for me as tagged text. The Jott transcription isn't perfect, but I find it helpful.

@anotheraaron - I think Evernote - http://www.evernote.com/ - is the only thing that can easily organize your napkins and coasters!

Take a look at this

Interesting.....I'm so playing with Evernote tonight when I get home. Thanks!

Take a look at this

Here's something really cheap and dirty: http://randomcrap42.googlepages.com/cloud.html

It's based on code found here: http://www.tocloud.com/javascript_cloud_generator.html

You paste your tagged text into the text box and generate a (rather rudimentary) cloud. Clicking on a tag displays the associated text. No idea how it'll work for large pieces of text, though. The source is all in the one html file. It's not pretty but it appears to work.

Take a look at this

cory, have you looked into tinderbox? it's pretty powerful and should be able to do what you want.

http://www.eastgate.com/Tinderbox/

Take a look at this

Perfect for perl, and O'Reilly has a nice short PDF on building tag clouds.

http://oreilly.com/catalog/9780596527945/index.html

Take a look at this

Yep, Python to the rescue:

Lets say you have a 'notes' file on your home dir, with one line per note, like this:

Tuile a house -- knock it down, without the occupant's permission, as when the government takes down a house in punishment for violating the one-child policy. River 354. @china @idiom @corruption @authoritarianism
foo bar baxz @quux @china

Just run python (comes by defaut on Ubuntu) and type:

notes = open('notes', 'r')

for note_items in map((lambda n: n.split('@')), notes):
for tag in note_items[1:]:
tag = tag.strip()
try:
tagged_items[tag].append(note_items[0])
except KeyError:
tagged_items[tag] = []
tagged_items[tag].append(note_items[0])

# to see notes tagged 'china':
notes['china']

Take a look at this

Wow. This is a pile of really great suggestions. Don't know if Cory has found what he's looking for, but I definitely just got a lot better organized in my notes.

Take a look at this

Argh, it just ate my identation. I should use the preview... Anyway, here is it:
http://dpaste.com/hold/71022/

Take a look at this

BTW, Cory, I know you want something to organize the notes you *already* have, but in Ubuntu you have the Tomboy applet installed. Just right-click on the desktop panel and choose 'Add to Panel...', and you will find "Tomboy Notes" there.

Take a look at this
#35 posted by Anonymous , August 12, 2008 1:43 PM

Taskpaper on the Mac. Although it's a To Do applicaiton, it stores everything as a text file and then parses tags prefaced with an "@"


Take a look at this
#36 posted by w000t , August 12, 2008 3:27 PM

There are some good suggestions here, but I'll add my 3 cents worth (inflation, dontchaknow) anyway. I doubt there's a ready-made solution that works as desired, so some scripting is going to be needed regardless. Since that's the case, a good idea might be to parse into a SQLite database (or csv to import to SQLite) - FireFox 2 & 3 have SQLite built in and the SQLite Manager add-on is really easy to use:

Manage any SQLite database on your computer. An intuitive heirarchical tree showing database objects. Helpful dialogs to manage tables, indexes, views and triggers. You can browse and search the tables, as well as add, edit and delete the records. Facility to execute any sql query. A dropdown menu helps with the sql syntax thus making writing sql easier. Easy access to common operations through menu, toolbars, buttons and context-menu. Export tables/views in csv or xml format.

Take a look at this

TiddlyWiki will definitely do you right for the simple task(s) you describe, especially as you then want to connect specific notes to each other, independently of the tags they share.

If you want a tool that goes a step farther than anything else, take a look at Ceryle (http://www.altheim.com/ceryle/). It's written by Murray Altheim (who did the modularization of HTML) and is intended not just for notes you've collected, but to establish the connections of a whole lot of people to each other and to events transpiring over the course of four or five books.

The visualization is very nicely done, btw.

Take a look at this

If you want something quick while pondering all these alternatives, your Linux may already include "agrep". It finds matching "records." E.g.,

agrep -d '$$' "@idiom;house" notes.txt

The "-d '$$'" means, "find records separated by double carriage returns." The second part means containing both @idiom and house. Here's a little tutorial that starts, "If you collect bibliographies, addresses, quotations, or other notes which you need to search..."

http://www.math.ufl.edu/help/tips/totw-Feb-23-1997

Sometimes I like really modest tools.

Btw to the people who said these three variations on the same thing:

"Such a simple script should only take 15 minutes tops."
"Parsing your text should be also a simple matter..."
"This is child's play in python."

As far as I can tell, this is the right encouragement to offer to someone who wakes up every morning thinking, "What programming language should I become really fluent in today!?" I.e. a very rare (not to say nonexistent) person.

Take a look at this

@novalis

Very nicely done. tips of the hat

Take a look at this
#41 posted by natch , August 12, 2008 8:34 PM

If you can't get the python one to run, here's a Perl one that actually works! Sorry, we $#@*!! Perl programmers can't avoid getting our digs in. ;-)

Good old Perl. Does more, fewer lines, more readable.

http://sial.org/pbot/31890

You do have to install a Perl module, HTML::TagCloud, as described in the link, to use this.

I love all the comments from people who say this should be easy. Saying and doing are two different things. Enjoy!

Take a look at this
#42 posted by natch , August 12, 2008 8:54 PM

Ouch. Found and fixed a bug - the html anchors had an extra character in one case, which would prevent links from being followed on some browsers.

http://sial.org/pbot/31891

Take a look at this

Natch, I'm an occasional perl hacker myself. But it's not exactly fair to say that your code does more, or that it is more readable.

I'm not sure your full-text search (that is, the browser's), is sufficient. I think it's probably not, since Cory specifically asked for that as a feature. That's where a lot of my code is going.

Your regular expression commenting leaves me a bit cold -- it tells me what each bit of the regex does, but either I already know that, or telling me won't help. I would rather see "# Match some text ($1) followed by a set of @-prefixed tags ($2)." Also, I think your regex mistakenly treats the domain part of email addresses as tags, because it does not require a tag to be preceded by whitespace.

The other missing feature is that you have to re-run it every time you change the text file. That one should be easy to fix -- Linux::Inotify2 will do it for you, more or less.

Take a look at this

You're right Novalis, I missed the search part of the request. Good catches on the other points too.

That verbose regexp commenting isn't my normal style. Just thought I'd put it in for any newbies looking at the code.

Yours is hanging for me. Not sure why. No error message. This is with Python 2.5.1. If I figure it out I'll let you know.

It's great to read your code though and see how things are done in Python - it's sometimes mind-numbing reading about it in a book, and much easier to get something out of it by looking at code that is solving a similar problem to something I've just done. Thanks.

Take a look at this

@Novalis: great piece of code there! Although it is longer, I think that the python version is better than the perl one because it has less ^="/}*;(~{' going on. That is just my opinion though :).

Take a look at this

26 mentions I Want Sandy, which is a great tool, as is Stikkit, made by the same folks, and possibly better suited to your needs. I use both, often! I know Stikkit has an API and if you feel adventurous you could probably hack something up to submit all your notes for you.

Take a look at this

Nice work, Novalis.

Take a look at this

Cory,

Zotero is a fantastic free-beer XUL tool (runs in Firefox as an extension, and can also be run in its own window) for managing all kinds of references. It does web bookmarks, web snapshots as well as standalone notes, which is what you need now.

It also has neat little tricks like "related items". For instance, you could import your bibliography into Zotero, and the notes would have that as a related bibliographical reference. This would be more important if you were writing nonfiction than fiction, but still nifty.

Zotero stores its database in sqlite, and I don't know how to import your format into it, but I am sure some of the other boingers will be able to do that in a couple of extra lines of python.

Take a look at this

My mistake above: I meant Zotero is free-speech (and free-beer also, but that's incidental). I should have stuck to 'libre' and 'gratis'.

Take a look at this
#50 posted by Anonymous , August 13, 2008 5:41 AM

Wow, not only a whole passel of great suggestions here, but many folks offering free purpose-built code!

What a lovely readership you folks have. :)=

This is the kind of thing that might make a good example in a talk, I would think. Gift economies, or crowdsourcing or some such.

Take a look at this

Natch, works for me under 2.5.1. Hate to ask, but are you sure it's hanging and not just serving? (that is, did you point your browser at it? And the throbber just spins? Or something else?). Can you hit ctrl-c, and paste the traceback?

P.S. If anyone downloaded it in the first ten or so minutes it was up, they should re-download, because the early version bound to 0.0.0.0 instead of localhost, and thus anyone who could connect to your machine could read your notes. New versions don't have this problem.

Take a look at this
#52 posted by Anonymous , August 13, 2008 8:27 AM

@Novalis This Python script works great for me on Windows, but on Linux it returns every note regardless of what is clicked (same for the search). Any idea why that might be?

Obviously I'm not good at Python, if I wanted to change the default note delimiter from two carriage returns to something else (for longer, multi-paragraph notes) like "----" where would that go?

Thanks!

Take a look at this

@#52, I hypothesize that you are using your Windows-based note file on your GNU/Linux system. On windows, possibly, Python treats \r\n as \n, but on Unix-based systems, perhaps it does not. This would cause the code to not recognize the delimiter between notes. I have changed the code to recognize any line with only whitespace as a delimiter. See line 160 to change this.

Take a look at this
#54 posted by Anonymous , August 13, 2008 10:07 AM

Thanks for the speedy reply. At first I had copied the Windows file over (as you rightly presumed) but then I got the same idea as you, deleted the file, and built from scratch on Linux.

Then I went back and realized that I hadn't put an extra blank line in. That fixed it, thanks!


Would it be possible to use a different delimiter, such as "----" for multi-paragraph notes? Such as:

"
This is a long note, please use your imagination.

This is the second paragraph. @long @note
----
This is another long note.

With a second paragraph. @example @understood
----
"

Take a look at this

Just replace line 160 with
if not line == "----":

Take a look at this

I'm a bit late to the game but I've wrote a java version that will upload notes to http://del.icio.us

The compilation dependencies are a bit demanding but it's for my own personal learning experience. Hope it may be useful to others.

See screenshot

Thanks

Take a look at this
#57 posted by McD , August 27, 2008 6:43 PM

Seems like everyone has a way to solve this with their favorite tool or language. It took a little while to polish to a final version, but here it is: Perl FTW!

Take a look at this

This might no longer be watched by the person I want to see if, but it's worth a shot!

@Novalis: I tried out your program and it works beautifully for the most part, but I'm having a really curious issue with it. Namely; the search function at the top of the page stops working after the number of entries that fits the search criterion is greater than 17. Is there any way to fix this, do you know?

Thanks in advance,
Insomniac

Post a comment

Anonymous