Debidatta Dwibedi

Debidatta Dwibedi

I am a Senior Research Scientist in Google Deepmind. I completed my Masters in Robotics from the Robotics Institute at CMU, where I was advised by Martial Hebert. Prior to that I completed my undergrad from IIT Kanpur, where I worked with Amitabha Mukerjee.

Email / Google Scholar / GitHub / LinkedIn / Twitter

Publications / Patents / Talks / Theses / Misc

Research

My research lies at the intersection of machine learning, computer vision and robotics. Presently, I am working on improving vision language models.

Publications
	FlexCap: Generating Rich, Localized, and Flexible Captions in Images Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar Extract diverse information from images using VLMs by spatial, length and prefix conditioning. paper \| project \| abstract \| bibtex @misc{dwibedi2024flexcap, title={FlexCap: Generating Rich, Localized, and Flexible Captions in Images}, author={Debidatta Dwibedi and Vidhi Jain and Jonathan Tompson and Andrew Zisserman and Yusuf Aytar}, year={2024}, eprint={2403.12026}, archivePrefix={arXiv}, primaryClass={cs.CV} }
	Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi Robots learn to imitate humans from observation. paper \| project \| abstract \| bibtex @misc{jain2024vid2robot, title={Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers}, author={Vidhi Jain and Maria Attarian and Nikhil J Joshi and Ayzaan Wahid and Danny Driess and Quan Vuong and Pannag R Sanketi and Pierre Sermanet and Stefan Welker and Christine Chan and Igor Gilitschenski and Yonatan Bisk and Debidatta Dwibedi}, year={2024}, eprint={2403.12943}, archivePrefix={arXiv}, primaryClass={cs.RO} }
	ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation Jorge Aldaco, Travis Armstrong, Robert Baruch, Jeff Bingham, Sanky Chan, Kenneth Draper, Debidatta Dwibedi, Chelsea Finn, Pete Florence, Spencer Goodrich, Wayne Gramlich, Torr Hage, Alexander Herzog, Jonathan Hoech, Thinh Nguyen, Ian Storz, Baruch Tabanpour, Leila Takayama, Jonathan Tompson, Ayzaan Wahid, Ted Wahrburg, Sichun Xu, Sergey Yaroshenko, Kevin Zakka, Tony Z. Zhao Next-gen low-cost bi-arm manipulation system. paper \| project \| abstract
	RT-H: Action Hierarchies Using Language Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, Dorsa Sadigh If a robot describes what they are going to do in words first they do the task better. paper \| project \| abstract \| bibtex @inproceedings{rth2024arxiv, title={RT-H: Action Hierarchies using Language}, author={Suneel Belkhale and Tianli Ding and Ted Xiao and Pierre Sermanet and Quon Vuong and Jonathan Tompson and Yevgen Chebotar and Debidatta Dwibedi and Dorsa Sadigh}, booktitle={https://arxiv.org/abs/2403.01823}, year={2024} }
	AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Sean Kirmani, Isabel Leal, Edward Lee, Sergey Levine, Yao Lu, Isabel Leal, Sharath Maddineni, Kanishka Rao, Dorsa Sadigh, Pannag Sanketi, Pierre Sermanet, Quan Vuong, Stefan Welker, Fei Xia, Ted Xiao, Peng Xu, Steve Xu, Zhuo Xu Orchestrating a robot fleet with LLMs and VLMs. paper \| project \| abstract \| bibtex @misc{gdm2024autort, title={AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents}, author={Michael Ahn and Debidatta Dwibedi and Chelsea Finn and Montse Gonzalez Arenas and Keerthana Gopalakrishnan and Karol Hausman and Brian Ichter and Alex Irpan and Nikhil Joshi and Ryan Julian and Sean Kirmani and Isabel Leal and Edward Lee and Sergey Levine and Yao Lu and Isabel Leal and Sharath Maddineni and Kanishka Rao and Dorsa Sadigh and Pannag Sanketi and Pierre Sermanet and Quan Vuong and Stefan Welker and Fei Xia and Ted Xiao and Peng Xu and Steve Xu and Zhuo Xu}, year={2024}, eprint={2401.12963}, archivePrefix={arXiv}, primaryClass={cs.RO} }
	RoboVQA: Multimodal Long-Horizon Reasoning for Robotics Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman, Izhak Shafran, Brian Ichter, Yuan Cao International Conference on Robotics and Automation (ICRA) 2024 Scaling up embodied VQA data with robots and humans. paper \| project \| dataset \| abstract \| bibtex @inproceedings{robovqa2023arxiv, title={RoboVQA: Multimodal Long-Horizon Reasoning for Robotics}, author={Pierre Sermanet and Tianli Ding and Jeffrey Zhao and Fei Xia and Debidatta Dwibedi and Keerthana Gopalakrishnan and Christine Chan and Gabriel Dulac-Arnold and Sharath Maddineni and Nikhil J Joshi and Pete Florence and Wei Han and Robert Baruch and Yao Lu and Suvir Mirchandani and Peng Xu and Pannag Sanketi and Karol Hausman and Izhak Shafran and Brian Ichter and Yuan Cao}, booktitle={arXiv preprint arXiv:2311.00899}, year={2023} }
	Q-Match: Self-Supervised Learning by Matching Distributions Induced by a Queue Tommy Mulc, Debidatta Dwibedi Asian Conference on Machine Learning (ACML) 2023 Improved version of contrastive loss to learn features on tabular data. paper \| code \| abstract \| bibtex @misc{mulc2023qmatch, title={Q-Match: Self-Supervised Learning by Matching Distributions Induced by a Queue}, author={Thomas Mulc and Debidatta Dwibedi}, year={2023}, eprint={2302.05444}, archivePrefix={arXiv}, primaryClass={cs.LG} }
	Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations Negin Heravi, Ayzaan Wahid, Corey Lynch, Pete Florence, Travis Armstrong, Jonathan Tompson, Pierre Sermanet, Jeannette Bohg, Debidatta Dwibedi International Conference on Robotics and Automation (ICRA) 2023 Use slot attention to learn unsupervised features for robotics. paper \| project \| abstract \| bibtex @inproceedings{heravi2023visuomotor, title={Visuomotor control in multi-object scenes using object-aware representations}, author={Heravi, Negin and Wahid, Ayzaan and Lynch, Corey and Florence, Pete and Armstrong, Travis and Tompson, Jonathan and Sermanet, Pierre and Bohg, Jeannette and Dwibedi, Debidatta}, booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)}, pages={9515--9522}, year={2023}, organization={IEEE} }
	With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman International Conference on Computer Vision (ICCV) 2021 Improve contrastive losses used in self-supervised learning with nearest-neighbors. paper \| abstract \| bibtex @InProceedings{Dwibedi_2021_ICCV, author = {Dwibedi, Debidatta and Aytar, Yusuf and Tompson, Jonathan and Sermanet, Pierre and Zisserman, Andrew}, title = {With a Little Help From My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {9588-9597} }
	XIRL: Cross-embodiment Inverse Reinforcement Learning Kevin Zakka, Andy Zeng, Pete Florence, Jonathan Tompson, Jeannette Bohg, Debidatta Dwibedi Conference on Robot Learning (CoRL) 2021 Use visual rewards learned by aligning videos to train robots. paper \| abstract \| bibtex \| project \| code @article{zakka2021xirl, title = {XIRL: Cross-embodiment Inverse Reinforcement Learning}, author = {Zakka, Kevin and Zeng, Andy and Florence, Pete and Tompson, Jonathan and Bohg, Jeannette and Dwibedi, Debidatta}, journal = {Conference on Robot Learning (CoRL)}, year = {2021} }
	Counting Out Time: Class Agnostic Video Repetition Counting in the Wild Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman Computer Vision and Pattern Recognition (CVPR) 2020 Count repetitions in videos in a class-agnostic manner. paper \| abstract \| bibtex \| project \| teaser video \| Google AI blogpost \| colab @InProceedings{Dwibedi_2020_CVPR, author = {Dwibedi, Debidatta and Aytar, Yusuf and Tompson, Jonathan and Sermanet, Pierre and Zisserman, Andrew}, title = {Counting Out Time: Class Agnostic Video Repetition Counting in the Wild}, booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2020} }
	Temporal Cycle-Consistency Learning Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman Computer Vision and Pattern Recognition (CVPR) 2019 Self-supervised representation learning based on temporal alignment for fine-grained video understanding tasks. paper \| interactive paper \| abstract \| bibtex \| project \| poster \| Google AI blogpost \| code \| colab @InProceedings{Dwibedi_2019_CVPR, author = {Dwibedi, Debidatta and Aytar, Yusuf and Tompson, Jonathan and Sermanet, Pierre and Zisserman, Andrew}, title = {Temporal Cycle-Consistency Learning}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2019} }
	Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning Ilya Kostrikov, Kumar Krishna Agrawal , Debidatta Dwibedi, Sergey Levine , and Jonathan Tompson International Conference on Learning Representations (ICLR) 2019 Sample efficient imitation learning using off-policy updates and proper handling of terminal states. paper \| abstract \| bibtex \| code @inproceedings{ kostrikov2018discriminatoractorcritic, title={Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning}, author={Ilya Kostrikov and Kumar Krishna Agrawal and Debidatta Dwibedi and Sergey Levine and Jonathan Tompson}, booktitle={International Conference on Learning Representations}, year={2019}, url={https://openreview.net/forum?id=Hk4fpoA5Km}, }
	Learning Actionable Representations from Visual Observations Debidatta Dwibedi, Jonathan Tompson, Corey Lynch and Pierre Sermanet International Conference on Intelligent Robots (IROS) 2018 Control agents from pixels by learning self-supervised representations from videos. paper \| abstract \| bibtex \| project @inproceedings{dwibedi2018learning, author = {Dwibedi, Debidatta and Tompson, Jonathan and Lynch, Corey and Sermanet, Pierre}, title = {Learning Actionable Representations from Visual Observations}, booktitle = {2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, pages = {1577--1584}, year = {2018}, organization = {IEEE}, url = {https://arxiv.org/abs/1808.00928} }
	Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection Debidatta Dwibedi, Ishan Misra and Martial Hebert International Conference on Computrer Vision (ICCV) 2017 Generate synthetic data for detecting objects in scenes. paper \| abstract \| bibtex \| code \| poster @InProceedings{Dwibedi_2017_ICCV, author = {Dwibedi, Debidatta and Misra, Ishan and Hebert, Martial}, title = {Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection}, booktitle = {The IEEE International Conference on Computer Vision (ICCV)}, month = {Oct}, year = {2017} }
	Deep Cuboid Detection: Beyond 2D Bounding Boxes Debidatta Dwibedi, Tomasz Malisiewicz, Vijay Badrinarayanan and Andrew Rabinovich Arxiv Preprint, 2016 Cuboid detector using deep learning: finds cuboids in scenes and localizes their corners. paper \| abstract \| bibtex @article{dwibedi2016deep, title={Deep cuboid detection: Beyond 2d bounding boxes}, author={Dwibedi, Debidatta and Malisiewicz, Tomasz and Badrinarayanan, Vijay and Rabinovich, Andrew}, journal={arXiv preprint arXiv:1611.10010}, year={2016} }
	Characterizing Predicate Arity and Spatial Structure for Inductive Learning of Game Rules Debidatta Dwibedi and Amitabha Mukerjee ECCV 2014 Workshop on Computer Vision + Ontology Applied Cross-Disciplinary Technologies 2014 Represent videos as dynamic graphs. Learn rules of games from observing people play games in Kinect videos. paper \| abstract \| bibtex \| videos @inproceedings{dwibedi2014characterizing, title={Characterizing Predicate Arity and Spatial Structure for Inductive Learning of Game Rules}, author={Dwibedi, Debidatta and Mukerjee, Amitabha}, booktitle={European Conference on Computer Vision}, pages={323--338}, year={2014}, organization={Springer} }

Patents

Deep learning system for cuboid detection

Talks

Temporal Cycle-Consistency Learning
Learning from Unlabeled Videos at CVPR 2019

paper | slides

Temporal Reasoning in Videos Using Convolutional Gated Recurrent Units
2nd Workshop in Brave New Ideas in Video Understanding at CVPR 2018

paper | slides | poster | bibtex

Self-Supervised Representation Learning for Continuous Control
3rd Workshop in Machine Learning in the Planning and Control of Robot Motion at ICRA 2018

paper | slides | poster

Theses

Synthesizing Scenes for Instance Detection
How can we create annotated datasets without humans for tasks like object detection and pose estimation?

thesis | slides

Observational Learning of Rules of Games
Can we learn the rules of a game by observing people playing them?

thesis | slides

Miscellaneous

Some other unpublished work:

	Playing Games with Deep Reinforcement Learning report \| video
	Towards Pose Estimation of 3D Objects in Monocular Images via Keypoint Detection
	HandNet: Using Faster R-CNN to Detect Hands in Egocentric Videos
	A Grounded Framework for Gestures and its Applications

this guy's webpage is awesome