A CHAMPAGNE modell nagyméretű internetes videókból tanulja meg a valós beszélgetéseket
Researchers with Seoul National University, the Allen Institute for Artificial Intelligence, the University of Washington, and Yonsei University have built 'CHAMPAGNE', a multimodal dialog model. "CHAMPAGNE takes in video frames, a video title, and a dialogue context as input and returns a dialogue response as output."
The idea is that by giving the model access to the visual as well as verbal context from a scene, it'll be better able to generate dialogue that feels intuitive. In evaluations, this seems to work quite well, with CHAMPAGNE models doing better on a range of open-domain text conversations, and benchmarks involving understanding social interactions.
- Built using the large-scale YTD-18M dataset
- YTD-18M contains data from 20 million YouTube videos
- Uses a language model to convert noisy YouTube transcripts into formatted dialogues
- Associates dialogues directly with specific video frames
Models like CHAMPAGNE show that the silent social cues in conversation are, much like every other fuzzy pattern, something that you can teach a machine to understand given a large enough dataset. It also suggests some of the more tantalizing and weird things we can look forward to in the future - AI models that observe you, trying to predict what will satisfy you not only by modeling you as an emitter-of-text, but as an organic form.