From Vision Transformers to innovative large language model finetuning techniques, the AI community has been very active with lots of interesting research this past month.
Here’s a snapshot of the highlights I am covering in this article:
-
In the paper ConvNets Match Vision Transformers at Scale, Smith et al. invest significant computational resources to conduct a thorough comparison between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), challenging the prevailing notion that ViTs outperform CNNs in image classification tasks.
-
The Mistral 7B paper introduces a compact yet powerful language model that, despite its relatively modest size of 7 billion tokens, outperforms its larger counterparts, such as the 13B Llama 2 model, in various benchmarks. This surprisingly good performance may be largely attributed potentially to its unique training data.
-
Zephyr: Direct Distillation of LM Alignment presents a fresh approach to training language models, showcasing the Zephyr 7B model’s remarkable performance in both conversational and knowledge benchmarks. They employed distilled Direct Preference Optimization (DPO), which is much less complex than Reinforcement Learning with Human Feedback (RLHF).
-
In their paper NEFTune: Noisy Embeddings Improve Instruction Finetuning, Jain, Chiang, Wen, Kirchenbauer et al. present a simple and straightforward method to enhance the performance of language models: the injection of uniform random noise into token embeddings. This technique, called NEFTune, has been shown to significantly improve performance in conversational tasks, and it does so without compromising knowledge in question-answer tasks.
PS: Readers of the monthly research highlights may notice that I am changing up the format of the AI research paper highlights, selecting a handful of papers for more detailed summaries and discussions of a handful of papers. In addition, I also included very short summaries of 20+ additional papers that piqued my interest. I hope you like the new format!
In this paper, researchers invested compute budgets of up to 110k TPU hours to do a fair comparison between ViTs and CNNs.
Their findings revealed that when CNNs are pretrained with a compute budget similar to what is typically used for ViTs, they can match the performance of ViTs. For this, they pretrained on 4 billion labeled images from JFT and subsequently finetuned the models on ImageNet.
Personally, I’ve observed that it’s easier to achieve good image classification performance when finetuning ViTs compared to finetuning CNNs. For instance, a small ViT can be finetuned for a few minutes on a single GPU and achieve approximately 96% accuracy on CIFAR-10. In my teaching experience, obtaining such good results was always challenging with pretrained CNNs. In retrospect, this might be because ViTs benefited from a larger pretraining budget.
Inference
One aspect I wish had been addressed in the paper is inference performance. While it’s feasible to match the performance of finetuned ViTs with finetuned CNNs, I wonder about the advantages one might offer over the other regarding memory footprint and inference speed for the exact models they used.
However, this analysis may also be a bit unfair since ViT architectures are relatively new compared to CNNs, which have been heavily optimized over the years. Also, I can understand that such a study is beyond the scope of this work, as it would be relatively comprehensive if both TPUs and GPUs were taken into account. For instance, TPUs are known to be more optimized for matrix multiplications, which are common in ViTs, whereas GPUs are more optimized for convolutions.
Beyond classification
While the main takeaway of this paper — CNNs can match ViTs at scale — is super interesting, the paper only focused on image classification. A natural question is whether this also applies to object detection and image segmentation, which would be interesting follow-up work.
Paper reference
-
ConvNets Match Vision Transformers at Scale by Smith, Brock, Berrada, and De (25 Oct), https://arxiv.org/abs/2310.16764
The Mistral 7B paper introduces a new “small” LLM with 7 billion tokens. The paper is relatively short on technical details, but it is still worth covering here since the openly available Mistral 7B LLM has been among the most popular models in the past few weeks. Moreover, the Mistral 7B base model also forms the basis for finetuning Zephyr 7B, which will be covered in the next section.
Mistral performs beyond its size
The main reason why Mistral 7B was so popular was that it outperforms the 13B Llama 2 model, which is almost twice as large, on most benchmarks.
Why exactly it is so good is unclear, but it might likely be due to its training data. Neither Llama 2 nor Mistral discloses the training data, so we can only speculate.
Architecture-wise, the model shares group-query attention with Llama 2. One interesting addition to the Mistral architecture is sliding window attention to save memory and improve computational throughput for faster training. (Sliding window attention was previously proposed in Child et al. 2019 and Beltagy et al. 2020.)
Sliding window attention
The sliding window attention mechanism used in Mistral is essentially a fixed-sized attention block that allows a current token to attend only a specific number of previous tokens (instead of all previous tokens), which is illustrated in the figure below.
In the specific case of 7B Mistral, the attention block size is 4096 tokens, and the researchers were training the model with up to 100k token context sizes.
To provide a concrete example, in regular self-attention, a model at the 50,000th token can attend all previous 49,999 tokens. In sliding window self-attention, the Mistral model can only attend tokens 45,904 to 50,000 (since 50,000 – 4,096 = 45,904).
However, note that sliding window attention mainly allows it to handle longer sequences, not to improve benchmark performance per se. (Most benchmark tasks are either multiple choice tasks or rely on short answers.)
In other words, sliding window attention is mainly used to improve computational performance. The fact that Mistral outperforms larger Llama 2 models is likely not because of sliding window attention but rather despite sliding window attention.
Paper reference
-
Mistral 7B by Jian, Sablayrolles, Mensch, Bamford et al. (10 Oct), https://arxiv.org/abs/2310.06825
This paper introduces Zephyr 7B, which is currently one of the most exciting open-source LLMs. The reasons for this are twofold:
-
Zephyr 7B outperforms models of similar size as well as several larger models in both conversational and knowledge benchmarks.
-
The authors trained Zephyr using Direct Preference Optimization (DPO) in a fully automated fashion, which is much less complex than Reinforcement Learning with Human Feedback (RLHF).
Zephyr Performance
Let’s start with a discussion of Zephyr’s performance before we take a brief look at the DPO and distillation processes used in this paper.
The authors included a representative mix of LLMs in their benchmarks, ranging from 7B-parameter models trained with distilled supervised learning (more on distillation later) to 70B-parameter models trained with RLHF.
MT-Bench and AlpacaEval are benchmarks that evaluate the conversational abilities of LLMs. As the performance table below reveals, the 7B-parameter Zephyr model outperforms all other models in its size class. Even more impressively, Zephyr surpasses