Thursday, April 24, 2008

Side Project: Automating Document Interestingness

For the last week and a half I have been hacking together a simple "document interestingness" application (while not making corrections my PhD dissertation). Basically, the user outlines a set of sources (directories of PDF's, web pages, blogs, etc), the application builds a probabilistic model of the users interests, which in turn is used to automatically assess the interestingness of artifacts. The application is currently restricted to an academic domain addressing the specific question: is this paper interesting with regard to my research? Primitive, sure, but it's just a proof of concept.

My approach was rudimentary, using context-insensitive whitespace separated symbols (words) extraction and modeling of extracted text. Interestingness involved a series of calculations involving word sets and some vector algebra with weightings. I tested it on myself (naturally) using my own technical report archive. Models were serialised to disk and I created some visualisations (tag clouds in HTML and corpus graphs in JUNG). Relevance measures were as expected across my test datasets (papers from my field, related fields and unrelated fields). It's a minimal solution with all command line/API user interactions with some Swing for visualisations (yes it's written in Java).

My methodology while working on the project was to get something working as quickly as possible (just hack), and I explicitly avoided research into my topic (information retrieval research). NLP is not my area, so I was surprised today when I finally did do a little research to discover that my first pass at the problem was quite reasonable.

The basic modeling approach may be referred to as a Term Count Model (word frequencies) and the general document interaction measures fit nicely into the so-called Vector Space Model and related tf-idf weighting and similar. Basically the approach involves modeling the word frequencies in a vector space and using simple algebra (distances, magnitudes, angles) to assess relevance of indexed documents.

Digging a little deeper reveals a discussion of such classical document retrieval approaches in the context of search and SOE on an interesting site called Information Retrieval Intelligence. The sites provides a detailed look at the term count model, the vector space model, a consideration of word densities in such models, and much more. I also came across a good paper on term weighting titled Term Weighting Approaches in Automatic Text Retrieval (1987). Finally, digging deeper still reveals Lucene, a Java library that does pretty much all of these things in what seems like a best practices manner (Apache products are typically solid).

Although my hacked application and it's rudimentary indexing and interestingness measures is minimal, it is functional. I'm not so sure I would have something functional if I had taken a deep research approach first. Specifically, I doubt a simple VSM approach with word weightings would have satisfied me and I doubt I would have considered a suitable base application (academic research paper relevance). Anyway, the next step for this side project will be to consider whether there is a market (web/desktop) for my chosen problem (or variations there of), or whether I should graft a different problem into my codebase (post filtering application?).

0 comments: