RESEARCH
A CHAMPAGNE modell nagyméretű internetes videókból tanulja meg a valós beszélgetéseket
Researchers with Seoul National University, the Allen Institute for Artificial Intelligence, the University of Washington, and Yonsei University have built 'CHAMPAGNE', a multimodal dialog model. "CHAMPAGNE takes in video frames, a video title, and a dialogue context as input and returns a dialogue response as output."
The idea is that by giving the model access to the visual as well as verbal context from a scene, it'll be better able to generate dialogue that feels intuitive. In evaluations, this seems to work quite well, with CHAMPAGNE models doing better on a range of open-domain text conversations, and benchmarks involving understanding social interactions.
- Built using the large-scale YTD-18M dataset
- YTD-18M contains data from 20 million YouTube videos
- Uses a language model to convert noisy YouTube transcripts into formatted dialogues
- Associates dialogues directly with specific video frames
Miért fontos?
Models like CHAMPAGNE show that the silent social cues in conversation are, much like every other fuzzy pattern, something that you can teach a machine to understand given a large enough dataset. It also suggests some of the more tantalizing and weird things we can look forward to in the future - AI models that observe you, trying to predict what will satisfy you not only by modeling you as an emitter-of-text, but as an organic form.