
An LLM Benchmark for Financial Document Question Answering by AnhTho_FR
Generative Artificial Intelligence (GenAI) has unlocked a multitude of new applications across numerous sectors, including financial services. As part of our continuous efforts to explore the potentials and limitations of AI, we at Farsight conducted an experiment to evaluate the capabilities of various popular Large Language Models (LLMs) in the context of financial document analysis, particularly 10-Ks and 10-Qs — a quarterly report filed by public companies regarding their financials.
Our results include three major findings:
-
Out-of-the-box LLMs have far from satisfactory performance: Even when provided the necessary context to answer a financial question, out-of-the-box LLMs spectacularly fail to meet the production standards required in the financial sector, with most models incorrectly answering ~40% of questions.
-
Rising strength of open-source: While GPT-4 remains the top model, GPT-3.5’s performance has been eclipsed by the open-source SOLAR-10.7b model.
-
Calculation tasks remain a major pitfall: 7 of the 8 models evaluated failed on over half of calculation-related tasks they were presented with. GPT-4 led all models with only 57% accuracy on such tasks.
We constructed our benchmark from a 10-K dataset hosted on Hugging Face. Based on contexts drawn from this dataset, we generated questions covering 3 categories of interest (calculations, financial domain knowledge, and regulatory knowledge) using GPT-4-Turbo. We then ask a variety of LLMs to answer these questions given the relevant context, and finally, evaluate each model’s responses using GPT-4-Turbo to determine correctness based on the information provided in the associated context. Unlike existing financial benchmarks, we emphasized design decisions that enable automation and extensibility throughout this process in order to facilitate repurposing this work for benchmarking LLMs in other domains, as well as other specific financial services use cases.
Our main objectives of this benchmark were to:
-
Understand the outright performance of current-generation LLMs on financial text, given golden context. We explicitly include golden context so performance gaps are not confounded with retrieval errors.
-
Compare performance between LLMs (open-source and closed-source) across three types of questions (calculations, financial domain knowledge and regulatory knowledge).
-
Present an efficient and automated method to evaluate various foundation models on domain-specific tasks (our process is 100x more efficient than current practices for evaluation).
Note: Golden context refers to a small snippet of content (2–3 paragraphs) that contains a substantial portion of the information needed to correctly answer a question.
We provide our results and analysis first, then elaborate further on our dataset creation and evaluation process.
Results + Analysis
In order to gauge the current state of LLMs in the domain of financial services, we surveyed a broad range of models, including:
For the remainder of this article, we will refer to each of these models by the names shown in parentheses.
Additionally, we attempted to leverage domain specific models like Finance-Chat, but found that its performance was significantly lower than its peers and decided to exclude it from our results.
Current LLMs Perform Poorly on Financial Text: The table above reports the overall results on our benchmark across all models that we surveyed. GPT-4 significantly outperforms all other models with an overall accuracy of 81.5% while most other models fall somewhere in the 60–70% range. All models struggle mightily with calculation questions, but even within the other buckets of questions, error rates across all models are unacceptable for production use cases within the financial sector. This suggests the need for financial institutions to adopt more sophisticated systems built on top of LLMs which are able to pull-in domain knowledge and perform high-fidelity calculations (likely through code generation/execution) when necessary.
Open-Source Models have Eclipsed GPT-3.5: SOLAR-10.7b is the best open-source LLM for financial text, showing a significant performance improvement over GPT-3.5 (68.3% avg. accuracy vs. 64.6%), particularly when answering questions in the financial domain and regulatory buckets. While there is still a significant gap between open-source models and GPT-4, this indicates that open-source is quickly working