A new search for a new world

Freepik started with a simple, yet powerful mission: to make finding free visual resources easier than ever before. From these humble beginnings, we’ve kept growing thanks to our user’s feedback, by creating new exclusive content and moving into new territories – photos, icons (flaticon.com) and slides (slidesgo.com). The search system remains central to our interface, a vital component of success. In the interviews we share, users always stress how important it is to keep improving the search experience. When the search engine does its job, you forget about it! It’s time to focus on the content you need.

Our last-generation search engine was text-based. What does this mean? It means that every image has text describing it: a title and a list of tags. In essence, you type what you want to find, we split the words of your search, and we look for images containing these terms. Simple, isn’t it?

A decade of improvements

Over the years, search processes became more complex, and more importance was given to words that work well for certain images. We “lemmatized” these words, meaning we normalized them through an analysis of the vocabulary and its morphology, restoring them to their most basic form (unconjugated, singular, unifying their gender, etc.)

User searches were augmented with the most common “next search” available. In languages like Japanese, that don’t have distinct word divisions, we had to learned how to separate words. And in order to provide our users the best possible experience, we continually monitor which tags are most popular in each country, for example, by prioritizing content with the “Asian” tag for Japanese users. There is a long list of improvements over the last 10 years that increased our main KPI: percentage of searches that end up with a download (SDR).

Despite our best efforts, there are still some outcomes that have yet to fall in favor. 

The AI era

As often happens, big improvements require different approaches. After years of struggling with “embedding”– lists of numbers that were the translation of text and images, thanks to neural networks – 2020 brought a breakthrough: OpenAI’s CLIP model. With this model, both texts and images now share the same embedding space, meaning that the text “dog” and a photo of a dog would share the same sequence of numbers – the embedding – that represents them in that space. Thus, this embedding represents the concept of “dog.”

This opened the door to new and exciting possibilities. 

As an example, when adding a decoder that can convert an embedding to text, you can input an image and will automatically get a title for it 

Besides, with the ability to turn text embedding into visual representations, you can now build a system that generates images from text descriptions, and that’s exactly how new AI image generators work, like the one on Wepik. But let’s not forget – the very first application we were interested in was using it for a search engine, where you could convert text into an embedding, and search through a collection of images linked to those with the closest embedding. 

AI-based search engine

My first job when I joined Freepik was just that — to explore and improve CLIP to substitute our existing search engine. To set the scene – Just like in Asian nations is expected to find Asian people in pictures without explicitly mentioning it, Freepik users have some implicit preferences when they search for content. As CLIP had been trained with texts and images extracted from the internet, unfiltered—so to say — we needed to fine-tune it precisely to answer our users’ needs.

Our first task was to create a metric, the SSET  – Self Search Error by Text  – a metric that measures success in search engine processes.  It’s a window into how effectively users can find what they’re looking for, while helping us compare different search engines’ performances. It measures how close an image was to being the first result when searching for it using its own title. We verified that a lower SSET correlated with a higher quality in the search results. In short, a lower SSET indicated an important success in the results returned by the search. 

The new metric was used to evaluate the standard CLIP and we found some weaknesses: the model was pretty good in English, adequate in Spanish and Portuguese, but unusable in languages like Japanese or Korean. Complex searches weren’t a problem, but the simple ones seemed to stump it. It even showed results that included the search words written inside the images, which could be solved thanks to further fine-tuning with our data. 

Leveraging our data

The training with different models began with CLIP, and later on, we switched to the fabulous OpenCLIP models. We fine-tuned these models with the texts our users had searched when an image was downloaded, which served to increase performance across all languages in use. That is, the words associated with a successful download were the best choice to train the model.

Our next step was to fine-tune the system using the images and their titles. This showed an improvement in English, but it suggested even better results in other languages.

That was when we did our first live test, using the brand-new search engine to serve up to 5% of Freepik’s traffic. Although we had made some progress, it was clear that our search engine still needed a little more fine-tuning for users giving short prompts. It wasn’t all bad news, as we realized that targets with longer inputs brought up excellent results!

The quality of the results was increased by adding more signals to the search: user’s country, time of the year, global quality of the images, etc. Every time we got a model that we felt excited about, we AB tested it with real traffic. A few months and around 100 training later, we got to the point where the new search engine exceeded the performance of the previous search engine, with a sole exception: searches that consist of a single word.

Exploring the benefits of multilanguage searches

With a late adjustment, we adopted the OpenCLIP model furnished with XLM-Roberta to our arsenal. XLM-Roberta is a model pre-trained with many languages, and it made our first layer of fine-tuning redundant. Little did we know that by training with titles of images, we were only teaching OpenCLIP foreign languages, and not so much about how to improve the search itself.

The ability to solve searches in dozens of languages was the most significant advantage of this new model, meaning that Freepik had just opened the gateway to worldwide success.

Enhancing user experience with localized results

People around the world perform similar searches, but the results they expect vary greatly depending on their location. We noticed this and decided to do something about it – adding ‘country’ as a localization signal in all searches would become extremely relevant in those like “Independence Day”, “woman”, “food”, or even “flag”. Now everyone everywhere had results tailored to their location right up front.

These are two examples of results for “Independence Day” from the US and India

And it did improve the searches! But we soon noticed that the quality wasn’t bettered as much for countries we didn’t have enough data. A decision was taken to split continents into subcontinents following the UN sub-regional division and incorporate this as a signal. The results were even further improved.

There are certain searches that clearly require diverse outcomes for different countries, like “map” or “flag”. We found that these are the ones that improved the most, way above the average improvement.

Lost searches – finding opportunities to rescue them

The previous search system relied solely on the words used in the titles and tags assigned to each image. More specifically, each image was represented by a bunch of words, and the task of the search consisted on determining which images matched all the words of the search. But languages are very rich and full of synonyms and metaphors, so when it comes to words, the possibilities are endless. As we know, the same item can be described in a variety of ways. 

It was not unusual to come across search expressions that are perfectly valid, but were composed of a combination of words that didn’t match with any existing resources in our catalog, even if we had images relevant for that search. In such cases, we stumbled upon 0 results being returned. We could face similar situations when users made a typo while writing their search query.

We had around 5% of searches with this condition – no results shown at all. The new AI Search demonstrated the power of technology when it heroically rescued them! A typo? Don’t worry, the model still understands you. Writing a complex search, including metaphors? Consider it done. Our new system rose to the challenge – with just one tweak, we experienced a total download increase of 1%!

The importance of the relevance

The reason behind CLIP’s reasoning is simple – CLIP architecture accurately pairs images with their descriptions, but what if you have a jaw-dropping amount of photos, and they all fit the same caption? Our catalog is bursting at the seams with over 7 million “backgrounds”, how do we know which ones are relevant for the user looking for a “background”? CLIP doesn’t enforce any particular return ordering when there are several candidates for a search. However, the order in which the results are shown to the user is vital.

We got down to work. First of all, we needed to know how popular an image was for a certain search. The number of downloads can be used as a simple “proxy” for this popularity parameter. We didn’t go for the bells and whistles right away – instead, we kept it simple. 

Secondly, we needed an evaluation metric, and Spearman’s coefficient was chosen. Long story short, it calculates the correlation between two ranks to see whether the search ordering is similar to the ground truth ordering. If Spearman’s is close to 1, it means that the new search results resemble the optimal ordering, which in our case was based on historical data for the given search. Normalized Discounted Cumulative Gain (nDCG) was also being used as an alternative.

In the previous example, the first column shows the optimal ranking and the second column shows the ordering returned by the model. We can compare the positions of each image in each ranking and represent them in a graph, as we have done below. The Spearman coefficient is nothing more than the correlation coefficient that results from that comparison.

The Spearman’s of these two ranks is 0.257. If the model was so good that it generated a rank equal to the optimal rank, then the Spearman’s would be 1. The closer the Spearman’s is to 1, the better the relationship between optimal position and position according to model.

At this point, we already had two essential ingredients for any machine learning project: an objective signal to refine and optimize (the downloads), and an evaluation metric that would measure the effectiveness with which the model captures it (the Spearman’s). It was time to move on to model training.

Learning the relevance

We modified the loss function of CLIP to enable learning of the relevance. The default contrastive loss function considers all pairs of image-title with the same title within a batch as corresponding to a different entity (even though they are not) and tries to push them apart.

Let’s see how this behaves when we have several images corresponding to the same search within a batch. 

Five searches ended up in a download, hurray! The matrix on the right represents the relationship between searches and images that CLIP would consider on its loss. A value of 0 states that the image is not related to the text, whereas the value 1 represents it is related. Picture it as a game of pairing cards, with one column being the texts and the other representing images, and every time you can pair two of them, you have a match!  

A brief look reveals that the signal is misleading: the matrix states that each car is only related to “red car” once. Contradictory information, as all of them are “red cars”! Modifying the matrix to represent the proportion of each image belonging to “red car” could be used as a solution, and the downloads can be used as a way to compute this proportion.

A car has been downloaded two more times than the others, and so, it has twice the weight now for the search “red car”. By modifying the CLIP matrix in this way, the model started to learn the relevance. Another important factor was to craft the batches with care, given that the model could only learn the relevance if several images of the same search concur in a batch. To put it bluntly, the model can only put the pieces of the puzzle together if it has several search-related images to work with.

But this event is not so likely to happen when you have millions of distinct searches. The shuffling function was modified to ensure the desired amount of images belonging to the same search were put together within a batch. Witness firsthand how this change can improve the relevance for the search “cheese”. 

Left panel: before, right panel: after.

A tradeoff was found between relevance and accuracy. The more images belonging to the same search you introduce in your batch, the fewer concepts you are contrasting. Thus, the model becomes very good at differentiating between “red cars”, but also becomes less discriminative between different concepts. There is a simple solution: increase the batch size and train longer. To overcome the challenge, we scaled up our computing power to epic proportions!

A member of the Search team discovered a surprising way to improve the relevance – scaling the image embeddings to be proportional to their global relevance. But this alone is worth another blog post!

Assessing the impact on the service 

Although there’s still plenty of room for improvement and issues to fix, we’re already seeing the rewards of our hard work: the AI search system capabilities have taken off, and there are still exciting potentials waiting to be explored. The probability of success when searching is now approximately 5% higher than before for our registered users. We’re witnessing a significant 84% rise in the variety of daily searches that end up as a download. In addition, there was a 43% rise in the amount of distinct assets downloaded each day, which made the existing library more useful. This opened a new door to users, as identifying relevant content that was impossible to find before, was now a reality.

Conclusions & future work

To wrap up, Freepik’s mission was to make finding visual resources a breeze, so after 10 years of an ever-evolving high-quality search engine, they upped their search game — introducing smarter signals like “next searches” and user location & time of year. Then again, a disruptive change was needed, thus artificial intelligence was approached. The team developed SSET metrics and adopted XLM-Roberta models for worldwide use and trained dozens of models using internal data. And talking about improvements: after modifications to CLIP loss, batch sizes & scaling of embeddings… boom! A 5% increase in the probability of downloads per search? Now that’s what we call taking illustration searching up a notch, or rather, five! Therefore, nowadays, artificial intelligence has helped us in developing some ground-breaking technology. 

But this is only the beginning – More challenges are in store for us, and we are determined to take them head-on. Things that will be improved and further researched lie ahead, such as:

  • The system it’s still not good at discriminating between cities that are very similar visually — think of images of Granada and Málaga, and how tough it can be to tell them apart – We need to incorporate the information in the tags to answer these queries correctly.
  • Add search by image. Users will provide an image, and in return they’ll receive all similar pictures in our collection.
  • The history of the user is not yet taken into account. But the more we understand you, the better we can assist you. We want to do it so that consumers can opt out and have a neutral search experience.
  • Re-ranking: Today, we give the closest visuals to a text, but people want diversity in their results. Not showing similar results to those already shown throughout the results can be an improvement. A re-ranking model can help us in this task. 

The Artificial Intelligence improvements and challenges have just begun, and we are delighted to have you along for the ride, stay tuned!