Introduction

Last post, we already talked about the basic concepts of s2s&s2p tasks. Now I have done some experiments on compared the different methods to verify the effectiveness.

The experiments only test the effectiveness of the inference models without any finetuning

Dataset

Here I used two types of dataset, one is classificaiton dataset, and another is passage retrieval dataset. The details information as follows.

Name Task Size Link
toutiao-text-classification-dataset classification 7k https://github.com/aceimnorstuvwxz/toutiao-text-classfication-dataset/tree/master
DuReaderretrieval passage retrieval q:5k, p:4w https://www.luge.ai/#/luge/dataDetail?id=55

Results

Results of s2s only retrieval

Here I used BGE M3 model to do the cosine similarity match, the results is below.

Results of s2s retrieval(top 5)&rerank

Then I added second stage rerank with retrieval top 5 candidates, the results as follows.

Results of s2s retrieval(top 100)&rerank

I also tried retrieval top 100 candidates and do reranking, the results as follows.

Results of s2p

Method top1 top3 top5 top10
retrieval-only 0.3309 0.5967 0.7151 0.8386
retrieval-5 + rerank 0.4064 0.6530 0.7151 \
retrieval-20 + rerank 0.4266 0.7245 0.8280 0.8992

The table shows after reranking, the retrieval accuracy improved significantly, top 1 accuracy increased 7% after reranking top 5 retrieval results, top3 accuracy increased from 59.67% to 72.45% by reranked top 20.

However, rerank is really consuming, so in the reality it’s not a effective way to do reranking or we use less retrieved results to do reranking.