Facebook computer vision AI model self-supervised billion of images

Facebook today announced an trained on a billion images that ostensibly achieves state-of-the-art results on a range of computer vision benchmarks. Unlike most computer vision models, which learn from labeled datasets, Facebook’s generates labels from data by exposing the relationships between the data’s parts — a step believed to be critical to someday achieving human-level intelligence.

The future of AI lies in crafting systems that can make inferences from whatever information they’re given without relying on annotated datasets. Provided text, images, or another type of data, an AI system would ideally be able to recognize objects in a photo, interpret text, or perform any of the countless other tasks asked of it.

Facebook claims to have made a step toward this with a computer vision model called SEER, which stands for Self-supervised. SEER contains a billion parameters and can learn from any random group of images on the internet without the need for curation or annotation. Parameters, a fundamental part of , are the part of the model derived from historical training data.


Self-supervision for vision is a challenging task. With text, semantic concepts can be broken up into discrete words, but with images, a model must decide for itself which pixel belongs to which concept. Making matters more challenging, the same concept will often vary between images. Grasping the variation around a single concept, then, requires looking at a lot of different images.

Facebook researchers found that scaling AI systems to work with complex image data required at least two core components. The first was an algorithm that could learn from a vast number of random images without any metadata or annotations, while the second was a convolutional network — ConvNet — large enough to capture and learn every visual concept from this data. Convolutional networks, which were first proposed in the 1980s, are inspired by biological processes, in that the connectivity pattern between components of the model resembles the visual cortex.

In developing SEER, Facebook took advantage of an algorithm called SwAV, which was borne out of the company’s investigations into self-supervised learning. SwAV uses a technique called clustering to rapidly group images from similar visual concepts and leverage their similarities, improving over the previous state-of-the-art in self-supervised learning while requiring up to 6 times less training time.

Training models at SEER’s size also required an architecture that was efficient in terms of runtime and memory without compromising on accuracy, according to Facebook. The researchers behind SEER opted to use RegNets, or a type of ConvNet model capable of scaling to billions or potentially trillions of parameters while fitting within runtime and memory constraints.

Facebook software engineer Priya Goyal said SEER was trained on 512 NVIDIA V100 GPUs with 32GB of RAM for 30 days.

The last piece that made SEER possible was a general-purpose library called VISSL, short for VIsion library for state-of-the-art Self Supervised Learning. VISSL, which Facebook is open-sourcing today, allows for self-supervised training with a variety of modern machine learning methods. The library facilitates self-supervised learning at scale by integrating algorithms that reduce the per-GPU memory requirement and increase the training speed of any given model.

You might also like More from author

Comments are closed.