I'm experimenting on using LETOR scores to choose the documents shown to the user in the first round of feedback instead of choosing randomly. I identified four LETOR scores to experiment with, these being the scores that gave the highest NDCG@10 averaged over all queries if one were to rank based solely on that score. All subsequent rounds use top sampling, choosing the top n (1 to 5) examples as ranked by the model built from previous feedback for each new feedback round.
Previous results with randomly seeded top sampling yielded results that showed monotonically increasing NDCGs and monotonically decreasing rounds to convergence as number of examples per round increased. The expectation was that LETOR seeding would result in perhaps slightly better NDCGs, but definitely fewer rounds to convergence (as we're providing a ranking in the first round as opposed to using the first round for ranking). While NDCGs had a moderate increase, the rounds to convergence showed somewhat bizarre behavior. While they were indeed lower than random seeding in all cases, they did not monotonically decrease, and in fact bounced around quite a bit.

The convergence threshold at 0.9 was the most bizarre, and I left its graph slightly larger than the others so you can see just how odd it is. The current thought is that perhaps the LETOR seeds are markedly different from each other. I'm pretty sure that this is not the case, however; I've calculated the pairwise cosine similarity scores between the first five documents for each query for each of the four LETOR scores, and there isn't a huge amount of deviation, certainly not enough to account for this weirdness.
No comments:
Post a Comment