
2 days ago
Freepik started with a simple, yet powerful mission: to make finding free visual resources easier than ever before. From these humble beginnings, we’ve kept growing thanks to our user’s feedback, by creating new exclusive content and moving into new territories – photos, icons (flaticon.com) and slides (slidesgo.com). The search system remains central to our interface, a vital component of success. In the interviews we share, users always stress how important it is to keep improving the search experience. When the search engine does its job, you forget about it! It’s time to focus on the content you need.
Our last-generation search engine was text-based. What does this mean? It means that every image has text describing it: a title and a list of tags. In essence, you type what you want to find, we split the words of your search, and we look for images containing these terms. Simple, isn’t it?
A decade of Improvements
Over the years, search processes became more complex, and more importance was given to words that work well for certain images. We “lemmatized” these words, meaning we normalized them through an analysis of the vocabulary and its morphology, restoring them to their most basic form (unconjugated, singular, unifying their gender, etc.)
User searches were augmented with the most common “next search” available. In languages like Japanese, that don’t have distinct word divisions, we had to learned how to separate words. And in order to provide our users the best possible experience, we continually monitor which tags are most popular in each country, for example, by prioritizing content with the “Asian” tag for Japanese users. There is a long list of improvements over the last 10 years that increased our main KPI: percentage of searches that end up with a download (SDR).
Despite our best efforts, there are still some outcomes that have yet to fall in favor.
The AI Era
As often happens, big improvements require different approaches. After years of struggling with “embedding”– lists of numbers that were the translation of text and images, thanks to neural networks – 2020 brought a breakthrough: OpenAI’s CLIP model. With this model, both texts and images now share the same embedding space, meaning that the text “dog” and a photo of a dog would share the same sequence of numbers – the embedding – that represents them in that space. Thus, this embedding represents the concept of “dog.”
This opened the door to new and exciting possibilities.
As an example, when adding a decoder that can convert an embedding to text, you can input an image and will automatically get a title for it
(image -> embedding -> text).
Besides, with the ability to turn text embedding into visual representations, you can now build a system that generates images from text descriptions, and that’s exactly how new AI image generators work, like the one on Wepik. But let’s not forget – the very first application we were interested in was using it for a search engine, where you could convert text into an embedding, and search through a collection of images linked to those with the closest embedding.
AI-based Search Engine
My first job when I joined Freepik was just that — to explore and improve CLIP to substitute our existing search engine. To set the scene – Just like in Asian nations is expected to find Asian people in pictures without explicitly mentioning it, Freepik users have some implicit preferences when they search for content. As CLIP had been trained with texts and images extracted from the internet, unfiltered—so to say — we needed to fine-tune it precisely to answer our users’ needs.
Our first task was to create a metric, the SSET – Self Search Error by Text – a metric that measures success in search engine processes. It’s a window into how effectively users can find what they’re looking for, while helping us compare different search engines performances. It measures how close an image was to being the first result when searching for it using its own title. We verified that a lower SSET correlated with a higher quality in the search results. In short, a lower SSET indicated an important success in the results returned by the search.
The new metric was used to evaluate the standard CLIP and we found some weaknesses: the model was pretty good in English, adequate in Spanish and Portuguese, but unusable in languages like Japanese or Korean. Complex searches weren’t a problem, but the simple ones seemed to stump it. It even showed up results that included the search words written inside the images, which could be solved thanks to further fine-tuning with our data.
Leveraging our Data
The training with different models began with CLIP, and later on, we switched to the fabulous OpenCLIP models. We fine-tuned these models with the texts our users had searched when an image was downloaded, which served to increase performance across all languages in use. That is, the words associated with a successful download were the best choice to train the model.
Our next step was to fine-tune the system using the images and their titles. This showed an improvement in English, but it suggested even better results in other languages.
That was when we did our first live test, using the brand-new search engine to serve up to 5% of Freepik’s traffic. Although we had made some progress, it was clear that our search engine still needed a little more fine-tuning for users giving short prompts. It wasn’t all bad news, as we realized that targets with longer inputs brought up excellent results!
The quality of