How to build a web App to retrieve and rank scientific RSS feeds : part 2
The advantage of working with scientific RSS feeds relies on the fact that titles and abstracts are highly informative and structured. Ideally we could read just the last sentence of the abstract to get a sense of the background and conclusion of the paper. Embedded in every abstract there is the condensed structure of the paper (introduction, methods, conclusion). Moreover, there is a limited set of sentences, words that scientists use in the abstract to ‘announce’ their conclusion: taken together - in conclusion – these data show - these data suggest...
All these features make text analysis (condensation) surprisingly easy to do! We can write few lines of code to extract a set of features and build a document-term-matrix (a rectangular matrix containing the documents (rows) and the most frequent words (columns) with the relative frequency of each word in each document). Then we generate a n-dimensional space where every document has its own location (Latent Semantic Analysis) and we group the documents together by similarity. This is one of the simplest way of analyzing text information (this google search is enough to get you started with R and its excellent text mining libraries -tm, LSA).
This whole process works pretty well on scientific text! I was amazed by the ability of this clustering technique to group together similar papers. Sometimes it also promotes the discovery of new connections between distant papers.
Ideally we could have a set of interesting papers and build a space out of this set, then every new feed is mapped onto this space. Our intelligent feed reader should visualize only the papers that fit in this semantic space.
One possible approach to discover interesting RSS feeds may also incorporate a “spam-filtering” technique: interesting papers are good (ham), uninteresting papers are bad (spam). … is it really so??? I don’t see uninteresting papers are spam. Especially if we focus our search on highly specialized journals (e.g. neurobiology of learning and memory). 99% of the papers may be of interest to me. A different scenario applies to journals like Nature, Science or Cell: most of the papers are not related to my field.
Right now the code (see it here) ranks papers on the basis of sensitive keywords extracted from a limited set of papers I classified as interesting. Every keyword has its own weight (frequency obtained form the set of interesting papers). The rank is given by the sum of the weight (frequency) of keywords detected in each single feed. This is pretty naive, but it is a first attempt. It works very well when the paper is really interesting (several keywords are detected in the abstract). THe major disadvantage is that we may miss important links: even if just one keyword is detected in a feed, a critical link between this keyword and a new [undetected] keyword may be missed.
The next step is to extract the feeds with the highest rank form each journal: setting a global threshold has the disadvantage of excluding Journals with minimal abstracts (like science) which tend to have very low ranks.
I have several ideas I want to implement in the next future. It is a stimulating playground!
Soon I will also post a clean version of the code I use for extracting keywords and their frequency.
posted by LR