Breaking the data bottleneck: Salesforce’s ProVision speeds multimodal AI training


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


As enterprises around the world double down on their AI projects, the availability of high-quality training data has become a major bottleneck. While the public web has largely been exhausted as a data source, major players like OpenAI and Google are securing exclusive partnerships to expand their proprietary datasets, further limiting access for others.

To address this growing concern, Salesforce has taken a major step in the arena of visual training data. The company has just introduced ProVision, a novel framework that programmatically generates visual instruction data. These datasets are systematically synthesized to enable the training of high-performance multimodal language models (MLMs) that can answer questions about images.

The company has already released the ProVision-10M dataset with this approach and is employing it to boost the performance and accuracy of various multimodal AI models.

For data professionals, this framework represents a significant advancement. By programmatically generating high-quality visual instruction data, ProVision alleviates the dependency on limited or inconsistently labeled datasets, a common challenge in training multimodal systems.

Moreover, the ability to systematically synthesize datasets ensures better control, scalability and consistency, enabling faster iteration cycles and reducing the cost of acquiring domain-specific data. This work complements ongoing research in the synthetic data generation domain and comes just a day after Nvidia’s launch of Cosmos, a suite of world foundation models purpose-built for generating physics-based videos from a combination of inputs, like text, image and video, for physical AI training.

Visual instruction data: a key ingredient for multimodal AI

Today, instruction datasets are the core of AI pre-training or fine-tuning. These specialized datasets help models follow and effectively respond to specific instructions or queries. In the case of multimodal AI, the models get the ability to analyze content such as images after learning from a swathe of different data points, accompanied by question-answer pairs — or visual instruction data — describing them.

Now, here’s the thing: Producing these visual instruction datasets is quite a hassle. If an enterprise creates the data manually for each training image, it ends up wasting a lot of time and human resources to complete the project. On the other hand, if it chooses to use proprietary language models for the task, it has to deal with high computational costs and the risk of hallucinations, where the quality and accuracy of the question-answer pairs may not be good enough.

Further, using proprietary models is also a black-box mechanism as it makes it difficult to interpret the process of data generation and control or customize outputs precisely.

Enter Salesforce ProVision

To address these gaps, the AI research team at Salesforce has come up with ProVision, a framework that employs scene graphs in conjunction with human-written programs to systematically synthesize vision-centric instruction data.

At the core, a scene graph can be described as a structured representation of image semantics, where the objects in the content are represented as nodes. The attributes of each object — like color or size — are directly assigned to their respective nodes, while the relationships between these objects are depicted as directed edges connecting the corresponding nodes. These representations can be sourced from manually annotated datasets such as Visual Genome, or they can be generated with the help of a scene graph generation pipeline that combines various state-of-the-art vision models covering various aspects of image semantics, from object and attribute detection to depth estimation.

Once the scene graphs are ready, they power programs written using Python and textual templates that serve as full-fledged data generators capable of creating question-and-answer pairs for AI training pipelines.

“Each [data] generator utilizes hundreds of pre-defined templates, which systematically integrate these annotations to produce diverse instruction data. These generators are crafted to…compare, retrieve, and reason about basic visual concepts of objects, attributes, and relations based on the detailed information encoded in each scene graph,” the researchers behind the framework wrote in a paper.

Instruction data generation with Salesforce ProVision

ProVision-10M dataset for AI training

In its work, Salesforce used both approaches — augmentation of manually annotated scene graphs and generation from scratch — to set up scene graphs powering 24 single-image data generators and 14 multi-image generators. 

“With these data generators, we can automatically synthesize questions and answers given an image’s scene graph. For example, given an image of a busy street, ProVision can generate questions such as, “What is the relationship between the pedestrian and the car?” or “Which object is closer to the red building, [the] car or pedestrian?” lead researchers Jieyu Zhang and Le Xue noted in a blog post.

The data generators with the first approach, augmenting Visual Genome’s scene graphs with depth and segmentation annotation from Depth Anything V2 and SAM-2, helped them create 1.5 million single-image instruction data points and 4.2 million multi-image instruction data points. Meanwhile, the other, using 120,000 high-res images from the DataComp dataset and models such as Yolo-World, Coca, Llava-1.5 and Osprey, generated 2.3 million single-image instruction data points and 4.2 million multi-image instruction data points. 

In all, the four splits combined make up ProVision-10M, a dataset with more than 10 million unique instruction data points. It is now available on Hugging Face and already proving to be very effective in AI training pipelines.

Specifically, when the company incorporated ProVision-10M in multimodal AI fine-tuning recipes — LLaVA-1.5 for single-image instruction data and Mantis-SigLIP-8B for multi-image instruction data — it saw notable improvements, with the average performance of the models being higher than with fine-tuning without ProVision data.

“When adopted in the instruction tuning stage, our single-image instruction data yields up to a 7% improvement on the 2D split and 8% on the 3D split of CVBench, along with a 3% increase in performance on QBench2, RealWorldQA, and MMMU. Our multi-image instruction data leads to an 8% improvement on Mantis-Eval,” the researchers noted in the paper.

Fintuning with ProVision dataset
Fine-tuning with ProVision dataset

Synthetic data is here to stay

While there are several tools and platforms, including the new Cosmos world foundation models from Nvidia, for generating different modalities of data (from images to videos) that can used for multimodal AI training, only a handful have looked at the problem of creating the instruction datasets that pair with that data. 

Salesforce is addressing that bottleneck with ProVision, giving enterprises a way to go beyond manual labeling or black-boxed language models. The approach of generating instruction data programmatically ensures interpretability and controllability of the generation process and scales efficiently while maintaining factual accuracy. 

In the long run, the company hopes researchers can build on this work to enhance the scene graph generation pipelines and create more data generators covering new types of instruction data, such as those for videos.



Source link

Share

Latest Updates

Frequently Asked Questions

Related Articles

TikTok returns on Apple, Google app stores as Donald Trump delays ban

TikTok returned on the US app stores of Apple and Google on Thursday,...

Confused Senator Rages That Self-Driving Cars Are Woke

Senator Ted Cruz (R-TX) believes that topics as diverse as solar eclipses and self-driving...

AI’s biggest obstacle? Data reliability. Astronomer’s new platform tackles the challenge

Join our daily and weekly newsletters for the latest updates and exclusive content...
SULTAN88
SULTANSLOT
RAJA328
JOIN88
GFC88
HOKIBET
RUSIASLOT88
TAHU69
BONANZA99
PRAGMABET
MEGA55
LUXURY777
LUXURY333
BORJU89
QQGAMING
KEDAI168
MEGA777
NAGASLOT777
TAKSU787
KKSLOT777
MAS77TOTO
bandar55
BOS303
HOKI99
NUSA365
YUHUSLOT
KTP168
GALAXY138
NEXIA138
PETIR33
BOOM138
MEGA888
CABE888
FOSIL777
turbospin138
KAPAKBET
SUPERJP
sultankoin99
dragon88
raffi888
kenzobet
aladin666
rgo365
ubm4d
GERCEP88
VIVA99
CR777
VOXY88
delman567
intan69
CABE888
RNR303
LOGO303
PEMBURUGACOR
mpo383
cermin4d
bm88
ANGKA79
WOWHOKI
ROKET303
MPOXL
GURITA168
SUPRASLOT
SGCWIN
DESA88
ARWANA388
DAUNEMAS
ALADDIN666
BIOWIN69
SKY77
DOTA88
NAGA138
API5000
y200m
PLAYBOOK88
LUXURY12
A200M
MPO700
KENANGAN4D
cakrabola
PANDAGENDUT
MARVEL77
UG300
HOKI178
MONTE77
JASABOLA
UNTAR4D
LIDO88
MAFIABOLA77
GASPOL189
mpo999
untung138
TW88
JAGUAR33
MPOBOS
SHIO88
VIVO4D
MPOXL
JARISAKTI
BBO303
AONCASH
ANGKER4D
LEVIS4D
JAGO88
REPUBLIK365
BOSDEAL88
BOLA168
akunjp
WARTEGBET
EZEBET
88PULSA
KITAB4D
BOSDEAL88
STUDIOBET
MESINKOIN
BIMA88
PPNUSA
ABGBET88
TOP77
BAYAR77
YES77
BBTN4D
BBCA4D
VSLOTS88
MPO800
PAHALA4D
KPI4D
JURAGAN77
QQ188
BOLAPELANGI
C200M
QQ998
GWKTOGEL
MEGABANDAR
COLOWIN
VIP579
SEVEN4D
MPO188
DEWATA88
SURAT4D
SINAR123
LAMBO77
GUDANG4D
AWAN4D
PLANETLIGA
GT88
ROYALSPIN88
MAMAJITU
MITO99
PEDIA4D
WIBU69JP
333HOKI
SIDARMA88
NAGAEMAS99
HOLA88
CAKAR76
KINGTOTO
RATUGAMING
SSI168
PILAR168
ACTOTO
EYANGTOGEL
KAISAR328
SLOT628
KAISAR88
DOTA88
MAXWIN369
ALIBABA99
MM168
SQUAD777
NAGABET88
JAYABOLA
SEMPATIGAME
PANDAJAGO
PIKAT4D
SINGA77
YUYU33
MASTERPLAY99
VICTORY39
NASA4D
PERMATA55
SAKAUSLOT
CK303
MPOTOWER
CIPUTRABET
WINJUDI
DEWI5000
IYA777
MAHIRTOTO
GOSLOT88
TIPTOP4D
RAJA787
JBO680
JOKER188
EPICPLAY88
TRIVABET
KAISAR189
JOKER81
JPSPIN88
MAYORA4D
DJARUMPLAY
OVO88
BAKTI78
WINGSLOT77
ICAFE4D
PDTOTO
JETPLAY88
PORN VIDEO
https://link.space/@Hikaribet
https://bio.site/Hikaribet
https://heylink.me/Hikaribet39
CMBET88
CMBET88
didascaliasdelteatrocaminito.com
glenellynrent.com
gypsumboardequipment.com
realseller.org
https://harrysphone.com/upin
gyergyoalfalu.ro/tokek
vipokno.by/gokil
winjospg.com
winjos801.com/
www.logansquarerent.com
internationalfintech.com/bamsz
condowizard.ca
jawatoto889.com
hikaribet3.live
hikaribet1.com
heylink.me/hikaribet
www.nomadsumc.org
condowizard.ca/aromatoto
euro2024gol.com
www.imaracorp.com
daftarsekaibos.com
stuffyoucanuse.org/juragan
Toto Macau 4d
Aromatoto
Lippototo
Mbahtoto
Winjos
152.42.229.23
bandarlotre126.com
heylink.me/sekaipro
www.get-coachoutletsonline.com
wholesalejerseyslord.com
Lippototo
Zientoto
Lippototo
Situs Togel Resmi
Fajartoto
Situs Togel
Toto Macau
Winjos
Winlotre
Aromatoto
design-develop-test.com
winlotre.online
winlotre.xyz
winlotre.us
winlotrebandung.com
winlotrepalu.com
winlotresurabaya.shop
winlotrejakarta.com
winlotresemarang.shop
winlotrebali.shop
winlotreaceh.shop
winlotremakmur.com
Dadu Online
Taruhantoto
a Bandarlotre
bursaliga
lakitoto
aromatoto
Rebahin
untungslot.pages.dev
slotpoupler.pages.dev
rtpliveslot88a.pages.dev
tipsgameslot.pages.dev
pilihslot88.pages.dev
fortuertiger.pages.dev
linkp4d.pages.dev
linkslot88a.pages.dev
slotpgs8.pages.dev
markasjudi.pages.dev
saldo69.pages.dev
slotbenua.pages.dev
saingtoto.pages.dev
markastoto77.pages.dev
jowototo88.pages.dev
sungli78.pages.dev
volatilitas78.pages.dev
bonusbuy12.pages.dev
slotoffiline.pages.dev
dihindari77.pages.dev
rtpdislot1.pages.dev
agtslot77.pages.dev
congtoto15.pages.dev
hongkongtoto7.pages.dev
sinarmas177.pages.dev
hours771.pages.dev
sarana771.pages.dev
kananslot7.pages.dev
balitoto17.pages.dev
jowototo17.pages.dev
aromatotoding.com
unyagh.org
fairparkcounseling.com/gap/
impress-newtex.com/ajax/
SULTAN88
SULTANSLOT
RAJA328
JOIN88+
HOKIBET
GFC88
RusiaSlot88
Tahu69
BONANZA99
Pragmabet
mega55
luxury777
luxury333
borju89
qqgaming
KEDAI168
mega777
nagaslot777
TAKSU787
kkslot777
MAS77TOTO
BANDAR55+
BOS303
Login-HOKI99/
NUSA365
YUHUSLOT
ktp168
GALAXY138