How long can open-source LLMs truly promise on context length? by dacheng2

Share This Article

Sed ut perspiciatis unde.

In this blogpost, we introduce our latest series of chatbot models, LongChat-7B and LongChat-13B, featuring a new level of extended context length up to 16K tokens.
Evaluation results show that the long-range retrieval accuracy of LongChat-13B is up to 2x higher than other long context open models such as MPT-7B-storywriter (65K), MPT-30B-chat (8K), and ChatGLM2-6B (32k).
LongChat shows promising results in closing the gap between open models and proprietary long context models such as Claude-100K and GPT-4-32K.

Figure 1: Comparing LongChat to other models on the long-range topic retrieval task.

Not only can LongChat models handle such a long context length, but they also precisely follow human instructions in dialogues and demonstrate strong performance in the human preference benchmark MT-Bench.
Their preview versions are available at HuggingFace: lmsys/longchat-13b-16k and lmsys/longchat-7b-16k.
You can try them immediately in CLI or web interface using FastChat:

python3 -m fastchat.serve.cli --model-path lmsys/longchat-7b-16k

There has been a significant surge of interest within the open-source community in developing language models with longer context or extending the context length of existing models like LLaMA.
This trend has led to interesting observations and extensive discussions in various sources, such as Kaiokendev’s blog and this arXiv manuscript;
meanwhile, several notable models have been released claiming to support much longer context than LLaMA, notable ones include:

MPT-7B-storywriter supports 65K context length and extrapolates to 80K.
MPT-30B-chat supports 8K context length.
ChatGLM2-6B supports 32K context.

At LMSYS Org, we have been concurrently exploring various techniques to lengthen the context of our models like Vicuna.
In this blogpost, alongside the release of the LongChat series, we share our evaluation tools to verify the long-context capability of LLMs.

Using our evaluation tools in combination with various academic long-context evaluation benchmarks, we conduct a thorough comparison of several open-source and commercial models that claim to support long context.
Through this analysis, we examine how well these models deliver on their promised context length.
We found that while commercial models like GPT-3.5-turbo performs well on our tests, many open source models do not deliver the expected results on their promised context length.

The data and code used to reproduce the results in the blog post are available in our LongChat repo.
We provide a visualization in this notebook.

LongChat Training Recipe

LongChat is finetuned from LLaMA models, which were originally pretrained with 2048 context length.
The training recipe can be conceptually described in two steps:

Step 1: Condensing rotary embeddings

Rotary position embedding is a type of positional embedding that injects the information of position in Transformer.
It is implemented in Hugging Face transformer by:

query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

Where position_ids are indices such as 1, 2, 3, … that denote the position of a token in the sentence.
For instance, the token “today” in the sentence “today is a good day” has position_ids 1.
The apply_rotary_pos_emb() function then applies a transformation based on the provided position_ids.

The LLaMA model is pre-trained with rotary embedding on sequence length 2048, which means that it has not observed scenarios where position_ids > 2048 during the pre-training phase.
Instead of forcing the LLaMA model to adapt to position_ids > 2048, we condense position_ids > 2048 to be within 0 to 2048.
Intuitively, we conjecture this condensation can maximally reuse the model weights learned in the pre-training stage. See more insights from Kaiokendev’s blog.

We define the term condensation ratio by dividing the target new context length y by 2048. We then divide every position_ids by this ratio and feed it into the apply_rotary_pos_emb() function.

query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids / ratio)

In this release, we fine-tune the model to a context length of 16384, and thus the condensation ratio is 8. For instance, a token with position_ids = 10000 becomes position_ids = 10000 / 8 = 1250, and the neighboring token 10001 becomes 10001 / 8 = 1250.125.
This step requires no training.

Step 2: Finetuning on Curated Conversation Data

After condensing the embedding, we perform the finetuning procedure on our curated conversation dataset.
We reuse our collected user-shared conversations previously used for training Vicuna.
We clean the data using FastChat data pipeline, and truncate these conversations so they are no longer than 16K.
We finetune the model using standard next-token prediction loss. We fine-tune the 7B and 13B models with 80k and 18k conversations, respectively.
To save memory, we use Pytorch FSDP and Flash Attention. Assume A100 is $3/hour on Cloud, the 7B model costs ~$300, and the 13B model costs ~$700.

Evaluation toolkits: LongEval

Recently, commercial and open-source models have continued to tout their abilities to support expanded context length (from 8K, 32K, to 100K) in their latest releases, but how can we verify these claims?
The term “long-context capability” can mean different things for different model providers. For instance, does MPT-7B-StoryWriter’s advertised 65K context length operate at the same capacity as OpenAI’s ChatGPT at 16K?
This issue is also prevalent in our LongChat models development: how do we swiftly and effectively confirm if a freshly trained model can handle the intended context length?

To address this, we can base our evaluations on tasks that necessitate LLMs to process lengthy contexts, such as text generation, retrieval, summarization, and information association in long text sequences.
Inspired by recent discussions, we’ve devised, LongEval, a long context test suite.
This suite incorporates two tasks of varying degrees of difficulty, providing a simple and swift way to measure and compare long-context performance.

Task 1: Coarse-grained Topic Retrieval

In real-world long conversations, users usually talk about and jump between several topics with the chatbot. The Topic Retrieval task mimics this scenario by asking the chatbot to retrieve the first topic in a long conversation consisting of multiple topics. An example task is:

… (instruction of the task)
USER: I would like to discuss 
ASSISTANT: Sure! What about xxx of ?
… (a multi-turn conversation of )
USER: I would like to discuss  
…
USER: I would like to discuss 
… 
USER: What is the first topic we discussed?
ASSISTANT:

This task tests whether the model can locate a chunk of text and associate it with the right topic name. We design a conversation to be 400 ~ 600 tokens long. Thus, this task is considered coarse-grained because the model may give correct predicti

How long can open-source LLMs truly promise on context length? by dacheng2

How long can open-source LLMs truly promise on context length? by dacheng2

Share This Article

Newsletter

LongChat Training Recipe

Step 1: Condensing rotary embeddings

Step 2: Finetuning on Curated Conversation Data

Evaluation toolkits: LongEval

Task 1: Coarse-grained Topic Retrieval

HackTech

Leave a comment Cancel reply

Editor's Choice

How long can open-source LLMs truly promise on context length? by dacheng2

How long can open-source LLMs truly promise on context length? by dacheng2

Share This Article

Newsletter

LongChat Training Recipe

Step 1: Condensing rotary embeddings

Step 2: Finetuning on Curated Conversation Data

Evaluation toolkits: LongEval

Task 1: Coarse-grained Topic Retrieval

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter