Debidatta Dwibedi

I am a Senior Research Scientist in Google Deepmind. I completed my Masters in Robotics from the Robotics Institute at CMU, where I was advised by Martial Hebert. Prior to that I completed my undergrad from IIT Kanpur, where I worked with Amitabha Mukerjee.

Email  /  Google Scholar  /  GitHub /  LinkedIn  /  Twitter

Publications  /  Patents  /  Talks /  Theses /  Misc

Research

My research lies at the intersection of machine learning, computer vision and robotics. Presently, I am working on improving vision language models.

Publications

FlexCap: Generating Rich, Localized, and Flexible Captions in Images
Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar


Extract diverse information from images using VLMs by spatial, length and prefix conditioning.

paper | project | abstract | bibtex

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi


Robots learn to imitate humans from observation.

paper | project | abstract | bibtex

ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation
Jorge Aldaco, Travis Armstrong, Robert Baruch, Jeff Bingham, Sanky Chan, Kenneth Draper, Debidatta Dwibedi, Chelsea Finn, Pete Florence, Spencer Goodrich, Wayne Gramlich, Torr Hage, Alexander Herzog, Jonathan Hoech, Thinh Nguyen, Ian Storz, Baruch Tabanpour, Leila Takayama, Jonathan Tompson, Ayzaan Wahid, Ted Wahrburg, Sichun Xu, Sergey Yaroshenko, Kevin Zakka, Tony Z. Zhao


Next-gen low-cost bi-arm manipulation system.

paper | project | abstract

RT-H: Action Hierarchies Using Language
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, Dorsa Sadigh


If a robot describes what they are going to do in words first they do the task better.

paper | project | abstract | bibtex

AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Sean Kirmani, Isabel Leal, Edward Lee, Sergey Levine, Yao Lu, Isabel Leal, Sharath Maddineni, Kanishka Rao, Dorsa Sadigh, Pannag Sanketi, Pierre Sermanet, Quan Vuong, Stefan Welker, Fei Xia, Ted Xiao, Peng Xu, Steve Xu, Zhuo Xu


Orchestrating a robot fleet with LLMs and VLMs.

paper | project | abstract | bibtex

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics
Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman, Izhak Shafran, Brian Ichter, Yuan Cao
International Conference on Robotics and Automation (ICRA) 2024

Scaling up embodied VQA data with robots and humans.

paper | project | dataset | abstract | bibtex

Q-Match: Self-Supervised Learning by Matching Distributions Induced by a Queue
Tommy Mulc, Debidatta Dwibedi
Asian Conference on Machine Learning (ACML) 2023

Improved version of contrastive loss to learn features on tabular data.

paper | code | abstract | bibtex

Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations
Negin Heravi, Ayzaan Wahid, Corey Lynch, Pete Florence, Travis Armstrong, Jonathan Tompson, Pierre Sermanet, Jeannette Bohg, Debidatta Dwibedi
International Conference on Robotics and Automation (ICRA) 2023

Use slot attention to learn unsupervised features for robotics.

paper | project | abstract | bibtex

With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
International Conference on Computer Vision (ICCV) 2021

Improve contrastive losses used in self-supervised learning with nearest-neighbors.

paper | abstract | bibtex

XIRL: Cross-embodiment Inverse Reinforcement Learning
Kevin Zakka, Andy Zeng, Pete Florence, Jonathan Tompson, Jeannette Bohg, Debidatta Dwibedi
Conference on Robot Learning (CoRL) 2021

Use visual rewards learned by aligning videos to train robots.

paper | abstract | bibtex | project | code

Counting Out Time: Class Agnostic Video Repetition Counting in the Wild
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
Computer Vision and Pattern Recognition (CVPR) 2020

Count repetitions in videos in a class-agnostic manner.

paper | abstract | bibtex | project | teaser video | Google AI blogpost | colab

Temporal Cycle-Consistency Learning
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
Computer Vision and Pattern Recognition (CVPR) 2019

Self-supervised representation learning based on temporal alignment for fine-grained video understanding tasks.

paper | interactive paper | abstract | bibtex | project | poster | Google AI blogpost | code | colab

Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning
Ilya Kostrikov, Kumar Krishna Agrawal , Debidatta Dwibedi, Sergey Levine , and Jonathan Tompson
International Conference on Learning Representations (ICLR) 2019

Sample efficient imitation learning using off-policy updates and proper handling of terminal states.

paper | abstract | bibtex | code

Learning Actionable Representations from Visual Observations
Debidatta Dwibedi, Jonathan Tompson, Corey Lynch and Pierre Sermanet
International Conference on Intelligent Robots (IROS) 2018

Control agents from pixels by learning self-supervised representations from videos.

paper | abstract | bibtex | project

Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection
Debidatta Dwibedi, Ishan Misra and Martial Hebert
International Conference on Computrer Vision (ICCV) 2017

Generate synthetic data for detecting objects in scenes.

paper | abstract | bibtex | code | poster

Deep Cuboid Detection: Beyond 2D Bounding Boxes
Debidatta Dwibedi, Tomasz Malisiewicz, Vijay Badrinarayanan and Andrew Rabinovich
Arxiv Preprint, 2016

Cuboid detector using deep learning: finds cuboids in scenes and localizes their corners.

paper | abstract | bibtex

Characterizing Predicate Arity and Spatial Structure for Inductive Learning of Game Rules
Debidatta Dwibedi and Amitabha Mukerjee
ECCV 2014 Workshop on Computer Vision + Ontology Applied Cross-Disciplinary Technologies 2014

Represent videos as dynamic graphs. Learn rules of games from observing people play games in Kinect videos.

paper | abstract | bibtex | videos

Patents

Deep learning system for cuboid detection

Talks

Temporal Cycle-Consistency Learning
Learning from Unlabeled Videos at CVPR 2019

Temporal Reasoning in Videos Using Convolutional Gated Recurrent Units
2nd Workshop in Brave New Ideas in Video Understanding at CVPR 2018

paper | slides | poster | bibtex

Self-Supervised Representation Learning for Continuous Control
3rd Workshop in Machine Learning in the Planning and Control of Robot Motion at ICRA 2018

Theses

Synthesizing Scenes for Instance Detection
How can we create annotated datasets without humans for tasks like object detection and pose estimation?

Observational Learning of Rules of Games
Can we learn the rules of a game by observing people playing them?

Miscellaneous

Some other unpublished work:

Playing Games with Deep Reinforcement Learning

Towards Pose Estimation of 3D Objects in Monocular Images via Keypoint Detection

HandNet: Using Faster R-CNN to Detect Hands in Egocentric Videos

A Grounded Framework for Gestures and its Applications


this guy's webpage is awesome