Visual Search has four primary stages.
For a practical demonstration check out this
YouTube video
1. Understanding images
This is the most crucial stage and requires developing an automated process for taking an image
and generating a textual representation of what that image is about.
To do so, we leverage a state of the art deep learning architecture that is trained
on 20M publicly available image-text pairs. We then efficiently batch generate bespoke solutions for
a company by applying this model to their dataset of photos.
At the end of this stage a company has their images and text representing what is in each image.
2. Storing understanding
The generations from stage 1 must then be stored efficiently without degrading semantic richness.
To do so, we leverage a second state of the art model that is trained on 1B text pairs with a
self-supervised objective of minimizing the contrastive loss. That sentence is quite technical but it
can be interpreted casually as a model that has been trained to group text pairs with similar meaning close
together and dissimilar far apart. This is exactly what we want, namely a way to represent the image such that
if a user searches for something similar we can accurately identify and return it.
We then store the numerical representations for each of the images in a data structure that can
facilitate fast search and retrieval on machines with low RAM and compute resources.
This enables real time search and is coupled with solving a challenging optimization
problem, to do so we leverage current state of the art solutions.
At the end of this stage a company has their images and a numerical representation that captures
what is in each image contained within an efficient and lightweight data structure.
3. Facilitating search
This stage involves transforming a user's search query into a meaningful numeric representation
such that we can use this to find the most relevant results. By doing this conversion we remove all
need for a company to manually categorise their images, maintain keywords or filters.
This task is very similar to what was required in stage 2. In fact, we use the same model to generate
a meaningful representation of a user's search query.
At the end of this stage a company has their images, the data structure storing the understanding
and a numeric representation of a user's search query.
4. Generating results
The numeric representation of the search query is now compared to the numeric representation of our
understanding of the images. We define a relevance metric that scores how similar the search query
is to the images.
There are several possible choices, the metric we employ for Visual Search means that the highest possible
relevance is 1.00 and if it's 0.00 then the image is not at all relevant to the query.
Because of the choices we made in the earlier stages we can calculate this score in real time for our
images and return the most relevant ones.
At the end of this stage Visual Search is complete and we have the most relevant images for a user's
query.