I’m still hard at work on my in-depth 70B model evaluations, but with the recent releases of the first Yi finetunes, I can’t hold back anymore and need to post this now…
Curious about these new Yi-based 34B models, I tested and compared them to the best 70Bs. And to make such a comparison even more exciting (and possibly unfair?), I’m also throwing Goliath 120B and OpenClosedAI’s GPT models into the ring, too.
Models tested:
- 2x 34B Yi: Dolphin 2.2 Yi 34B, Nous Capybara 34B
- 12x 70B: Airoboros, Dolphin, Euryale, lzlv, Samantha, StellarBright, SynthIA, etc.
- 1x 120B: Goliath 120B
- 3x GPT: GPT-4, GPT-3.5 Turbo, GPT-3.5 Turbo Instruct
Testing methodology
Those of you who know my testing methodology already will notice that this is just the first of the three test series I’m usually doing.
I’m still working on the others (Amy+MGHC chat/roleplay tests), but don’t want to delay this post any longer.
So consider this first series of tests mainly about instruction understanding and following, knowledge acquisition and reproduction, and multilingual capability.
It’s a good test because few models have been able to master it thus far and it’s not just a purely theoretical or abstract test but represents a real professional use case while the tested capabilities are also really relevant for chat and roleplay.
- 1st test series: 4 German data protection trainings
- I run models through 4 professional German online data protection trainings/exams – the same that our employees have to pass as well.
- The test data and questions as well as all instructions are in German while the character card is in English. This tests translation capabilities and cross-language understanding.
- Before giving the information, I instruct the model (in German): I’ll give you some information. Take note of this, but only answer with “OK” as confirmation of your acknowledgment, nothing else. This tests instruction understanding and following capabilities.
- After giving all the information about a topic, I give the model the exam question. It’s a multiple choice (A/B/C) question, where the last one is the same as the first but with changed order and letters (X/Y/Z). Each test has 4-6 exam questions, for a total of 18 multiple choice questions.
- If the model gives a single letter response, I ask it to answer with more than just a single letter – and vice versa. If it fails to do so, I note that, but it doesn’t affect its score as long as the initial answer is correct.
- I sort models according to how many correct answers they give, and in case of a tie, I have them go through all four tests again and answer blind, without providing the curriculum information beforehand. Best models at the top, symbols (✅➕➖❌) denote particularly good or bad aspects.
- All tests are separate units, context is cleared in between, there’s no memory/state kept between sessions.
- SillyTavern v1.10.5 frontend (not the latest as I don’t want to upgrade mid-test)
- koboldcpp v1.49 backend for GGUF models
- oobabooga’s text-generation-webui for HF/EXL2 models
- Deterministic generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
- Official prompt format as noted
1st test series: 4 German data protection trainings
- 1. GPT-4 API:
- ✅ Gave correct answers to all 18/18 multiple choice questions! (Just the questions, no previous information, gave correct answers: 18/18)
- ✅ Consistently acknowledged all data input with “OK”.
- ✅ Followed instructions to answer with just a single letter or more than just a single letter.
- 1. goliath-120b-GGUF Q2_K with Vicuna format:
- ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 18/18
- ✅ Consistently acknowledged all data input with “OK”.
- ✅ Followed instructions to answer with just a single letter or more than just a single letter.
- 1. Nous-Capybara-34B-GGUF Q4_0 with Vicuna format and 16K max context:
- ❗ Yi GGUF BOS token workaround applied!
- ❗ There’s also an EOS token issue but even despite that, it worked perfectly, and SillyTavern catches and removes the erraneous EOS token!
- ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 18/18
- ✅ Consistently acknowledged all data input with “OK”.
- ✅ Followed instructions to answer with just a single letter or more than just a single letter.
- 2. lzlv_70B-GGUF Q4_0 with Vicuna format:
- ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 17/18
- ✅ Consistently acknowledged all data input with “OK”.
- ✅ Followed instructions to answer with just a single letter or more than just a single letter.
- 3. chronos007-70B-GGUF Q4_0 with Alpaca format:
- ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 16/18
- ✅ Consistently acknowledged all data input with “OK”.
- ✅ Followed instructions to answer with just a single letter or more than just