Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs by doener
[Submitted on 30 Sep 2024 (v1), last revised 15 Oct 2024 (this version, v2)]
Authors:Mehdi Ali, Michael Fromm, Klaudia Thellmann, Jan Ebert, Alexander Arno Weber, Richard Rutmann, Charvi Jain, Max Lübbering, Daniel Steinigen, Johannes Leveling, Katrin Klug, Jasper Schulze Buschhoff, Lena Jurkschat, Hammam Abdelwahab, Benny Jörg Stein, Karl-Heinz Sylla, Pavel Denisov, Nicolo’ Brandizzi, Qasid Saleem, Anirban Bhowmick, Lennard Helmer, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Alex Jude, Lalith Manjunath, Samuel Weinbach, Carolin Penke, Oleg Filatov, Shima Asaadi, Fabio Barth, Rafet Sifa, Fabian Küch, Andreas Herten, René Jäkel, Georg Rehm, Stefan Kesselheim, Joachim Köhler, Nicolas Flores-Herr
Abstract:We present two multilingual LLMs designed to embrace Europe’s linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models’ development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag
10 Comments
smokel
A paper on languages that begins with a grammatical error in the first sentence does not inspire confidence:
> LLMs represents a disruptive technology
JKolios
More diversity in the LLM space is always good. In my experience though, speaking as a native speaker of one of the less-used European languages, Mistral's models already use it pretty well.
kiru_io
Maybe someone should edit the title to mention this is from 2024: [Submitted on 30 Sep 2024 (v1), last revised 15 Oct 2024 (this version, v2)]
ozgune
I had a related, but orthogonal question about multilingual LLMs.
When I ask smaller models a question in English, the model does well. When I ask the same model a question in Turkish, the answer is mediocre. When I ask the model to translate my question into English, get the answer, and translate the answer back to Turkish, the model again does well.
For example, I tried the above with Llama 3.3 70B, and asked it to plan me a 3-day trip to Istanbul. When I asked Llama to do the translations between English <> Turkish, the answer was notably better.
Anyone else observed a similar behavior?
miros_love
>European versions of ARC
But this is an image-like benchmark. Has anyone looked at the article about the EU-ARC, what is the difference? Why can't you measure it on a regular one?
I glanced through it, didn't find it right away, but judging by their tokenizer, they are learning from scratch. In general, I don't like this approach for the task at hand. For large languages, there are already good models that they don't want to compare with. And for low-resource languages, it is very important to take more languages from this language group, which are not necessarily part of the EU
YetAnotherNick
They compared with Llama 3.1 and found that to be better on average for their tasks like European MMLU. And Llama 3.1 is the worst in the batch with Qwen 2.5 and Gemma 3 being significantly better.
KronisLV
I also quite liked the EuroLLM project: https://huggingface.co/blog/eurollm-team/eurollm-9b
Was pretty good with Latvian (better than other models this size as well as variants of Llama or Qwen that I could run) and I assume probably with other EU languages as well.
tannhaeuser
I mean, Mistral AI is a Paris-based company, and theirs was considered on par or better than other open weight models such as llama3.1 and qwen2.5, and mistral-24b is currently beating oh-so-great gemma3-27b depending on tasks.
Also, Stable Diffusion was originally (and still is I believe) developed in Munich.
It's true though that raising capital and finding investors works wayyy better in the US (kindof needless to say on HN) and so was getting top talent – at least in the past. Don't get me started on energy prices ;) but I don't believe those contribute significantly in the end anyway.
jug
On this topic, don’t miss the quite useful benchmark:
https://euroeval.com
NKosmatos
There is also a Greek LLM from 2024.
Meltemi: A large foundation Language Model for the Greek language
https://huggingface.co/ilsp/Meltemi-7B-v1.5