Bob's 2009 Research Blog: 2009

Wednesday, April 8, 2009

New research tack

I've begun working with Dr. Peter Nagy in the Pathology department at UIHC. He's doing CGH analysis on patients who may have degenerative muscular diseases (like ALS) - he compares patient genome expression data against averaged data from "normal" genomes to find possible anomalies, and then manually examines the candidate locations identified. It works pretty well, but instead of using, say, a program to do this analysis, he's got a big honkin' Excel spreadsheet doing the work. I'm helping him to develop a Python program that will do this instead, and teaching him the language at the same time. The code I'm producing isn't very "Pythonic", but it's easy to read and understand from a beginner's standpoint, and is quite liberally commented. If everything goes as planned, Dr. Nagy will publish this and I'll get my name on the publication.

Monday, March 23, 2009

Writing update

Continuing to write up my submission for the Information Retrieval Journal's special issue on learning to rank. Submission will include results from the ECML paper as well as the most recent results from the nonrandom seeding experiments. I'll post the submission later this week.

Tuesday, March 10, 2009

A Meta-Learning Approach for Robust Rank Learning

Back from St. Louis. As promised, paper reviews.

Carvalho et. al. 2008. "A Meta-Learning Approach for Robust Rank Learning". In proceedings of SIGIR 2008 LR4IR - Workshop on Learning to Rank for Information Retrieval. [pdf]

The main crux of interest for me in this paper was its focus on how outliers affect learning ranking functions, specifically the effect of mislabeling on pairwise ranking functions. Mislabeling of a document pair propagates into a quadratic increase in the number of outliers, "outliers" in this case referring to mislabeled and non-relevant documents.

Central to the identification of outliers in this work is the pairwise decision score between two documents. This score is

P_t = z_ql * (s(d_iq) - s(d_jq)

which is difference in score given to a document pair d_i, d_j for a query q from a scoring system s(), multiplied by z_ql which is +1 or -1 depending on the true preference of d_iq to d_jq. P_t is positive if the documents are correctly ranked by s(), negative otherwise. The authors show that by training a model, calculating pairwise decision scores for the training data, and then retraining the model without documents with a P_t below some cutoff value, performance is increased. This is straightforward, and to be expected - if you remove mislabeled instances, the errors they introduce will not propogate to other pairs.

In order to capitalize on this finding, the authors propose a meta-learning method for optimizing linear rankers. After learning a base ranker (e.g. perceptron, RankSVM), the ranker is re-optimized by incorporating a sigmoid loss function in the optimization. This reduces the effects of the outliers on the learned model. By differentiating this new optimization, a gradient descent algorithm can be devised.

Experiments were carried out on a number of standard test corpora, showing performance for perceptron, RankSVM and ListNet, along with performance for these algorithms using the sigmoid meta-learning. The addition of the sigmoid resulted in superior performance in nearly every case, with around two-thirds of the performance gains being statistically significant.

Digging deeper into the results, we see that the significant performance gains come mostly from the perceptron and ListNet. Only half of the SVM gains were statistically significant, and even there the gains were not staggering; the greatest significant gain was from a MAP (mean average precision) of 0.472 to 0.480.

My main takeaway from the paper is that metalearning is a terrific thing to do if you're using a perceptron, a good thing to do if you're using ListNet, and an okay thing to do if you're using RankSVM. This is, of course, assuming that any performance gains are worth the extra time and resources expended in the optimization. I am unconvinced that doing so would be a good idea for my research. I do, however, think that I might use the pairwise decision score to try to identify potentially mislabeled documents for the user to look at again.

Tuesday, February 24, 2009

More of the same

More thesis and article writing this week. In order to make this space less boring, I'll be posting reviews of some papers I'm reading in the coming days as well.

Tuesday, February 17, 2009

Week's work

This week has been/will be spent on writing. I've got three chapters more or less in the can, most of the writing done for the fourth, and the data for the fifth put together. The fifth chapter will be put together as an article for an upcoming special issue of the Information Retrieval Journal on learning to rank for IR. The paper will be based on my most recent results with nonrandom seeding.

Friday, February 13, 2009

Presentation at DMIG

I presented at DMIG today, detailing my recent results and the process I used to get them - specifically, the steps I went through to in order to go from a bunch of weird data that was giving me a headache, to finding a better measure for my data. I modeled the process on the 5 stages of grieving, and even included suggested a drink for each stage, just to help you get through it. PDFs of the slides are here.

Wednesday, February 11, 2009

Argument for IQ Analysis

The argument is pretty simple - means work well for describing trends of normal distributions, and my distributions are not normal. Take, for example, the distribution of NDCGs for L4 at the 0.9 convergence threshold...

Clearly, the distribution is not normal, and has a very long tail. When measuring skew across all thresholds (below), we see that this tendency holds as the majority of the distribution meets the performance ceiling. Hence, an interquartile mean analysis is more appropriate.

Tuesday, February 10, 2009

Interquartile Mean Analysis

Not willing to leave well enough alone, I went ahead and did the analysis of the interquartile mean (IQ) values for rounds and NDCGs, and I'm quite glad I did. While the IQ rounds were not much better than the mean rounds, the IQ NDCGs were consistently around 4% better than the mean NDCGs. The graph below shows the difference between mean and IQ NDCGs vs. the total number of examples seen for LETOR seeding using l10.

The mean over all examples ("All") is clearly dominated by the interquartile means. This bump in performance means that performance using LETOR seeds is now more pronounced, as seen below.

Again, the graph shows NDCGs vs. total examples seen, in this case for IQ NDCGs for l10 and the baseline.

Later today or tomorrow I'm going to do an analysis of the distribution of NDCGs across all queries. I'm hoping to show that there are tight standard deviations with a very large total range, which will allow me to argue that discounting outliers is indeed valid.

Friday, February 6, 2009

More evidence of nada

Per the suggestion of my advisor, I took a look at median rounds instead of mean rounds, and discovered something quite interesting...

Again, I've left the graph at 0.9 larger than the others. It seems that even my baseline jumps around in the rounds department, and the outliers were smoothing out the mean. If I were really ambitious I would take a look at the interquartile means, but I think that for now I'm going to just take the money and run, as it were.

Thursday, February 5, 2009

Much ado about nothing?

So I've plotted the rounds vs. examples per round chart as total examples vs. examples per round, and I think that I can safely claim that the odd behavior of the rounds chart is an artifact of the methods consuming more total examples as the number of examples per round increases. The graphs below show that the difference in total examples between the seeding methods is nearly nonexistent, and that the graphs very closely resemble the baseline graph. So, this may indeed have been much ado about nothing.

Wednesday, February 4, 2009

The Problem

Below you'll see the graphs that detail the oddity that I've uncovered. You'll likely want to read my ECML paper (here) for the background info.

I'm experimenting on using LETOR scores to choose the documents shown to the user in the first round of feedback instead of choosing randomly. I identified four LETOR scores to experiment with, these being the scores that gave the highest NDCG@10 averaged over all queries if one were to rank based solely on that score. All subsequent rounds use top sampling, choosing the top n (1 to 5) examples as ranked by the model built from previous feedback for each new feedback round.

Previous results with randomly seeded top sampling yielded results that showed monotonically increasing NDCGs and monotonically decreasing rounds to convergence as number of examples per round increased. The expectation was that LETOR seeding would result in perhaps slightly better NDCGs, but definitely fewer rounds to convergence (as we're providing a ranking in the first round as opposed to using the first round for ranking). While NDCGs had a moderate increase, the rounds to convergence showed somewhat bizarre behavior. While they were indeed lower than random seeding in all cases, they did not monotonically decrease, and in fact bounced around quite a bit.

The convergence threshold at 0.9 was the most bizarre, and I left its graph slightly larger than the others so you can see just how odd it is. The current thought is that perhaps the LETOR seeds are markedly different from each other. I'm pretty sure that this is not the case, however; I've calculated the pairwise cosine similarity scores between the first five documents for each query for each of the four LETOR scores, and there isn't a huge amount of deviation, certainly not enough to account for this weirdness.

Tuesday, February 3, 2009

New results

New results from running experiments using LETOR scores for initial document selection instead of picking randomly. As expected, using any score is better than choosing randomly; NDCGs were uniformly higher, and number of rounds was uniformly lower, both of these across all numbers of examples per round and stopping criteria. Unexpectedly, not all measures produced a monotonically decreasing number of rounds as examples per round increased. This bears further investigation (graphs posted later).

Monday, January 26, 2009

Last week's work

Last week was spent largely getting ready for the new semester, and applying for postdocs. I'm applying for a number of positions, mostly in bio- and health informatics. My current CV is up at http://www.cs.uiowa.edu/~rarens/cv-us.pdf

Tuesday, January 20, 2009

First post

I intend to blog my daily research activities here for the next semester or so.

Bob's 2009 Research Blog