How to balance latency and performance with LLMs by skp1995

Share This Article

Sed ut perspiciatis unde.

Hi all!
I have been using various OpenAI and Palm-2 API in our pipeline. The pipeline by itself works and things are okay until there are issues with OpenAI (very frequent BadGateway issues with GPT4 model, or errors on their side) or there is slowness from these external APIs.
In the ideal scenario reducing external dependencies is the way to go, but the open source models are just not that great compared to private source models.
The usual tricks for getting better performance from LLMs which include COT, SelfConsistency etc also increase latency in the system (longer prompts take longer time to generate the completion)

I would love to know if:

– there are good open source models which folks have fine-tuned for their pipeline

– UX tricks for fighting latency (streaming the output helps, but if the pipeline is made up of a bunch of calls its not that trivial to stream out the results)

– Batching calls (we already do this to an extent, but it also depends on how many calls are coming to the pipeline at the same time)

– any other tricks?

I can't talk much about the pipeline in detail here, but I would assume these are common problems people have while building products using these LLMs, would appreciate any and all insights and tricks which you have used :)

Thanks for your help!

How to balance latency and performance with LLMs by skp1995

How to balance latency and performance with LLMs by skp1995

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

How to balance latency and performance with LLMs by skp1995

How to balance latency and performance with LLMs by skp1995

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter