Recently I evaluated the E5 embeddingds (there’s also a small and large versions). I got impressive results so thought I should read the paper behind it. I’m writing my summary here for myself and anyone else who might find it useful.

Title: Text Embeddings by Weakly-Supervised Contrastive Pre-training

dall-e-pretraining

This image was generated by DALL-E. The following text is human generated for the most part :)

Idea

This is a nice data paper. The main idea of the paper is using data in a wise way to train a strong embeddings model. There are many different approaches and flavors to train embedding models. Here we’re dealing with models that given text produce an embedding vector. For models to do well on semantic similarity tasks, we typically train them by creating text pairs that are relevant in some way. For instance, one method involves selecting a random sentence from a paragraph and pairing it with the rest of the sentences in that paragraph. There’s no shortage of unlabeled data we can use for this to build a strong general purpose model. However, since this data is often “noisy”, the results aren’t always ideal. It’s common practice to fine-tune these embeddings using higher quality labeled text pairs that are more pertinent to the specific task at hand.

In this part I’m focusing on the unsupervised pre-training step, which is also the main contribution of the paper. In Part 2 I’ll go over the supervised fine-tuning part. The idea is simple - how can we improve the quality of text pairs generated from data in an unsupervised (or semi-supervised) way? The authors create <query,passage> pairs by utilizing datasets that contain some useful structure. For example, scientific papers title (q) and abstract (p), Reddit post (q) and comments (p), entity name + section title (q) and passages (q) from Wikipedia, and question (q) and upvoted answer pairs (p) from Stackexchange. This concept is not new. Sites like Reddit and StackOverlow have been widely used for text pairs mining (for example, the universal sentence encoder). The important detail however, which most papers usually gloss over, is the quality of the data pipeline. There is often a lot of effort that goes into the generation of the data. The authors name the dataset CCPairs (Colossal Clean text Pairs). BTW, the name E5 stands for EmbEddings from bidirEctional Encoder rEpresentations, which is a generic name that ignores tha main idea of the paper. (but regardless, E5 is a nice and short name that is easy to remember)

Consistency Filtering

This is an important step to improve the data quality. Once they have the full dataset (1.3B pairs) they train the model and then use it to get a ranking for each <q,p> using 1M random passages and only keep the pair If the “true” passage p falls in the top k with k=2 in their case. They explain that this filtering allows to keep memorized labels that are less noisy. This step reduces the number of pairs in the dataset from 1.3B pairs to ~270M. That’s a signficant reduction.

Training

Training is done first on the ~270M pairs and then finetuned on smaller labeled datasets such as MS MARCO. The finetuning part will be covered in Part 2.

Methodology

Their pre-training methodology is a common one with a contrastive loss called InfoNCR which works well for multiple negatives per example. The goal, as always, is putting similar pairs closer together in the embedding space. (in this case assigning higher probabilities to positive pairs). The common “biencoder” architecture (with a shared encoder) is used to generate the query and passage embeddings, seperately. The CCPairs generated dataset is used for the “positive”" pairs. For each positive pair the negatives are all the other examples in the batch. (each query q has a single positive passage and all other passages in the batch are its negatives)

We start with some pre-trained encoder such as BERT. We pass each text via the model and apply mean pooling to get the embedding. We compute the cosine similarity (scaled by a 0.01 temperature parameter). Now we can compute the loss and udate the parameters of the BERT model. “query: “ and “passage: “ are used as prefixes to distinguish between the two.

Here are the general steps:

  1. Start with a pre-trained transformer encoder
  2. Apply feedforward to generate the embeddings for the query, positive, and negative samples (with mean pooling)
  3. Compute scaled cosine similarity between the query and positive
  4. Compute scaled cosine similarity between the query and all negatives in the batch (computationally expensive with a large batch size)
  5. Compute cross entropy for each example (a positive pair with negatives) and average the loss over the batch

Basically, since we have a single positive example with multiple negatives, we compute the -log of the predicted probabiliy of the positive pair (negative log likelihood) like we often do in multiclass clasification. (Minimizing the negative log-likehood for softmax normalized scores). In this context we do it by applying softmax where the numerator is the positive pair cosine score (exponentiated) and the denominator is the summation of the score of the positive pair and the scores of the query with each negative (exponentiated). The cosine scores serve as the unnormalized logits. See section 4.1 in the paper.

For step #1 they used different encoders to train 3 models. E5-small, E5-base (110M params), and E5-large (330M params), initilized from from MiniLM, [bert-base-uncased], and bert-large-uncased-whole-word-masking, respectively.

As for the “query” and “passage” prefixes, this is not required but often helps in IR settings. Make sure to add the prefixes when using or finetuning the model for better performance.

They report that training took {16, 32, 64} V100 GPUs and {1, 1, 2} days for the {small, base, large} models.

Results

A few tasks are used for evaluation: zero-shot retrieval, zero- and few-shot classification, semantic textual similarity and clustering. I’m focusing here mosly on the retrieval results. One of the key choices of their training configuration is a batch size of 32K. The larger the batch size the more negatives are used for each example. Larger batch size has shown to produce better results when training with inbatch negatives. Here’s Table 5 from the paper that shows a similar trend. It looks like that 8K is not a bad choice either.

Impact of batch size

They also show that filtering helps, especially when training on less data. See Table 7 in the paper.

Retrieval Results

They evaluate on 15 datasets from the BEIR Benchmark (retrieval) and the MTEB benchmark that contains 56 English datasets for various tasks. The MTEB bechmark actually contains all the 15 retireval datasets they evaluate on in BEIR. I’m not a big fan of BEIR since BM25 is a strong baseline on many of its datasets, which is quite different for “in the wild” datasets from my experience. Besides BM25 the other strong unsupervised baseline they comapre against is Contriever (contrastive retriever) which uses random cropping. They show that all thier models beat the Contriever by a margin and also that their base and large models beat BM25. Like in many papers in the field, seeing the results is useful but I wouldn’t put much emphasis on the comparison. It’s often the case where these comparison are not apples to apples. For example, it’s not clear where they get their Contreiver results from. The original Contriever paper is trained with a much smaller batch size for example. The data sources it generates the text pairs from are not the same. Also, the Contriever paper reports higher results on the BEIR dataset but it uses a different subset of the benchmark. This is not to criticise the authors; it’s difficult and expensive to reproduce other systems for such comparisons.

Other observations about their unsupervised results is that E5-base (unsupervised) is competitive with models such as GTR-xxl that has 4.8B parameters (40 times bigger) on clustering and classification. For the full results check Section 5 in the paper.

Fine-tuning on Labeled Data

I will cover this part later. For now I can say that, as epxected, finetuning the model on labeled data boosts its performance nicely. Finetuning can be trained in a similar manner, but since it’s often a much smaller dataset than the pre-training, it’s often common to use more sophiticated and/or expensive techniques to generate negatives, which tends to lead to better performance. For example, using BM25 to rank passages and pick negatives from the top. They do not elaborate on the procedure and cite other papers. The key one seems to be SimLM from the same authors. The finetuning results seem more relevant to the SimLM paper than this one, which is why I’m leaving it for a future post.

That’s it for now. Happy holidays!