Imagine a plaid elephant sitting in a teacup. A silly image, to be sure, but you can picture it all the same. This is just one of many differences between our abilities as humans and a machine’s ability to understand language and language use — people are good at understanding the meanings of sentences that involve words they’ve never seen combined together before to compose new meanings that are completely novel.

Professor Jacob Andreas of MIT CSAIL is trying to close this gap between current machine learning techniques and human abilities to learn language and learn from language about the rest of the world. His research focuses on natural language processing and building intelligent systems that can communicate effectively, using language to learn from human guidance.

In the context of ML models, the other remarkable thing about human language use is our ability to learn new words very quickly from a single exposure. For example, if someone tells you that they’re about to “dax” and they make a whistling noise right after, you can already figure out what the word “dax” means if they ask you to now “dax” three times.

“The important thing there is that for all their current effectiveness, standard machine learning models — and neural networks acutely — just cannot do this well at all,” says Prof. Andreas. “They need many, many examples to learn new words, and even more examples to figure out how to use those words in contexts that are different from contexts that they ever saw before.”

In a recent project sponsored by the CSAIL Alliances MachineLearningApplications@CSAIL research initiative, Prof. Andreas and the Embodied Intelligence Community of Research are working on building new kinds of neural language processing models that maintain all of the advantages, flexibility, and expressiveness of current models, but that are also better able to effectively learn and have more compositional language understanding of new words put together in new ways.

Another way that humans learn words and language is contextually — we learn words in social context, and our learning of language often involves visual stimuli of some kind. Prof. Andreas says that in NLP, one of the big open research problems is how to use information from vision to help us build better models for language understanding by using training data and paradigms that put those words into real-world context.

“One of the cool things about this Embodied Intelligence research community that I’m in at MIT is that everybody is working on the boundaries of these problems,” he says. “We’re working toward models that have general-purpose concepts and reasoning skills that you can learn from any kind of supervision, whether it’s language, vision, or action.”

A language, vision, and action project that Prof. Andreas and his team recently worked on was a collaboration with a research group at Facebook. The project involved using language to help people do things in the real world, and helping autonomous agents take their own actions.

“One thing we’ve been looking at with Facebook is video games and building models that can more effectively play real-time strategy games with help from language. We’ve shown that if you train these things not just on examples of people playing these games but people playing these games while explaining, ‘Here’s the high-level action that I’m going to take right now,’ or ‘Here’s my long-term goal,’ you can much more efficiently and effectively train models to play these games,” Prof. Andreas says.

He adds that while this is nice to be able to show from a general scientific standpoint, it may also help game developers make better AI opponents to play against in video games. In the long term, it also may help us build robots that can acquire skills more quickly with help from language.

Other applications for NLP that Prof. Andreas is currently working on involve using language to help people build programs and better learning libraries for solving programming tasks using language, training models to learn from text alone, and using language to do image synthesis. Prof. Andreas says that in the future we may be able to do things like “talk to Photoshop to change the images we’re looking at, or even generate new images.”