Satwik Kottur bio photo

Satwik Kottur

Research Scientist
Facebook AI
Menlo Park, CA

Email Facebook LinkedIn Github


My interests broadly lie in the fields of computer vision, machine learning and natural language processing. I recently got interested in understanding semantics from vision and languague by solving multimodal AI tasks like visual dialog, visual question answering, conversational text generation, etc., using deep learning tools.


  1. Seungwhan Moon*, Satwik Kottur*, Paul A. Crook^, Ankita De^, Shivani Poddar^, Theodore Levin, David Whitney, Daniel Difranco, Ahmad Beirami, Eunjoon Cho, Rajen Subba, Alborz Geramifard
    *,^ equal contribution
    Situated and Interactive Multimodal Conversations
    International Conference on Computational Linguistics (COLING), 2020
    Oral Presentation

  2. Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe Morency, Satwik Kottur
    On Emergent Communication in Competitive Multi-Agent Teams
    International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 2020
    Oral Presentation

  3. Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach
    CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
    Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019
    Oral Presentation

  4. Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach
    Visual Coreference Resolution in Visual Dialog using Neural Module Networks
    European Conference on Computer Vision (ECCV), 2018.

  5. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabás Póczos, Ruslan Salakhutdinov, Alex Smola
    Oral Presentation
    Conference on Neural Information Processing Systems (NIPS), 2017.

  6. Satwik Kottur, José M. F. Moura, Stefan Lee, Dhruv Batra
    Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog
    Best Short Paper Award
    Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017.

  7. Abhishek Das*, Satwik Kottur*, José M. F. Moura, Stefan Lee, Dhruv Batra
    * equal contribution
    Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
    Oral Presentation
    International Conference on Computer Vision (ICCV), 2017.

  8. Manzil Zaheer*, Satwik Kottur*, José M. F. Moura, Amr Ahmed, Alex Smola
    * equal contribution
    Canopy – Fast Sampling with Cover Trees
    International Conference on Machine Learning (ICML), 2017.

  9. Satwik Kottur, Vitor Carvalho, Xiaoyu Wang
    Exploring Personalized Neural Conversational Models
    Internation Conference on Artificial Intelligence (IJCAI), 2017.

  10. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra
    Visual Dialog
    Spotlight, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

  11. Satwik Kottur*, Ramakrishna Vedantam*, José M. F. Moura, Devi Parikh
    * equal contribution
    Visual Word2vec (vis-w2v): Learning Visually grounded Word Embeddings Using Abstract Scenes
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

  12. Manzil Zaheer, Micheal Wick, Satwik Kottur, Jean-Baptiste Tristan
    Comparing Gibbs, EM and SEM for MAP Inference in Mixture Models
    OPT: NIPS Workshop on Optimization for Machine Learning, 2015.

  13. Evgeny Toropov, Liangyan Gui, Shanghang Zhang, Satwik Kottur, José M. F. Moura
    Traffic Flow from a Low Frame Rate City Camera
    Big Data Processing and Analysis (special session) in IEEE International Conference on Image Processing (ICIP), 2015.

Other projects

  • Spoken Dialog System with Audio and Text
    Topics in Deep Learning (10-807), Fall 2016
    Instructor: Prof. Ruslan Salakhutdinov
    Growth of technology places a lot of importance in human-machine interaction. With the advent of deep learning, conversational agents that engage with human through free form natural language have become popular. However, most of these attempts are purely text-based and ignore audio cues. Emotional information in human-human conversations, encoded in audio cues such as intonation and pitch, often plays an important role that is ignored the text-based approaches. In other words, the responses depend on not only what you say but also how you say. Thus, we explore generative models that jointly train on audio and text in this work.

  • Stochastic Expectation Maximization for Latent Variable Models
    Convex Optimization (10-725), Fall 2015
    Instructor: Prof. Ryan Tibshirani
    In this project, we want to implement and study a type of stochastic optimization. This optimization method based on expectation-maximization will be asynchronous & embarrassingly parallel and thus is useful for inference of latent variable models. The motivation for this stochastic optimization problem comes from a want to directly design a inference procedure from a “comptastical” (computational + statistical) perspective capable of leveraging modern computational resources like GPUs or cloud computing offering massive parallelism. We also find some interesting connection between stochastic expectation-maximization and stochastic gradient descent strengthening validity of proposed method.

  • Non-smooth Stochastic Optimization for MCMC
    Probabilistic Graphical Models (10-708), Spring 2015
    Instructor: Prof. Eric Xing
    How do we sample efficiently from the Bayesian Lasso in a high dimensional problem with a large dataset? Hybrid Monte Carlo (HMC) has grown in popularity because it enables more efficient exploration of the state space in high-dimensional problems. Also, Stochastic Gradient-HMC has been proposed to enable application of HMC to large datasets. However, these methods apply to sampling from smooth energy functions only. We propose two ways of dealing with this: (1) SPG-HMC: Stochastic Proximal Gradient-HMC, to enable sampling from non-smooth energy functions without losing the benefits of stochasticity, and (2) Smoothing-SG-HMC. Further, we analyze its properties theoretically and empirically.

  • Movie Recommendation based on Collaborative Topic Modeling
    Machine Learning (10-701), Fall 2014
    Instructor: Prof. Geoff Gordon and Prof. Aarti Singh
    Traditional collaborative filtering relies on ratings provided by viewers in the movie-watching community to make recommendations to the user. In this project, we attempt to combine this approach with probabilistic topic modeling techniques to make recommendations that consist not only of movies that are popular in the community, but also those that are similar in content to movies that the user has enjoyed in the past.

  • Detecting Text in Natural Images
    Computer Vision (16-720), Fall 2014
    Instructor: Prof. Martial Hebert
    Intelligent systems often need to read text in their surroundings. There are multiple aspects that make this a challenging problem. For instance, locating and identifying the part of image containing text is in itself difficult. We study a recent approach that uses stroke width transform, and analyse the success and failure cases to get a clearer understanding.

  • Static Vehicle Detection and Analysis in Aerial Imagery using Depth
    Internship at IRIS, University of Southern California, Summer 2013
    Guide: Prof. Gerard Medioni
    This report proposes an approach to automatically detect static vehicles in an outdoor parking space using depth. The relevant 3D information is generated from a Digital Surface Model (DSM), which is a result of a novel and existing technique to solve camera pose estimation and dense reconstruction simultaneously. Validation using local 2D features, based on existing methods, is then done to ensure better detection rates. Further, performance of the detection system is evaluated by changing the internal parameterization of 3D model generation and the dependence is analyzed.

  • Human Activity Recognition
    B.Tech project-I, IIT Bombay, Fall 2013
    Guide: Prof. Subhasis Chaudhuri
    Human activity recognition is gaining importance, not only in the view of security and surveillance but also due to psychological interests in understanding the behavioral patterns of humans. This project is a study on various existing techniques that have been brought together to form a working pipeline to study human activity in social gatherings. Humans are first detected with Deformable part models and tracked as a feature point in 2.5D co-ordinate system using Lucas-Kanade algorithm. Linear cyclic pursuit model is then employed to predict short-term trajectory and understand behavior.

  • Autonomous Underwater Vehicle (AUV-IITB)
    AUVSI and ONR’’s International Robosub Competition, San Diego, USA
    Vision (Spring 2012 - Spring 2013)
    Guides: Dr. Hemendra Arya and Dr. Leena Vachhani
    Designing and developing an unmanned autonomous underwater vehicle (AUV) that localizes itself and performs realistic missions based on feedback from visual, inertial, acoustic and depth sensors using thrusters and pneumatic actuators.
    Matsya (sanskrit word for fish) is the AUV from IIT Bombay to participate in the International Robosub competition, San Diego which sees teams of different universities from countries all over the world.

  • Parallel Simulation of Verilog HDL designs
    Internship, IIT Bombay, Summer 2012
    Guide: Prof. Sachin Patkar
    Digital designs, before synthesis, are simulated on a computer platform to test their efficiency. Maximizing the performance and minimizing the overheads is, therefore, a vital area of research. The main focus of this work is to parallelize the simulation of single clock structural/behavior hardware designs without any time or resource conflict. Thus, resulting in a multi-fold in reduction in execution time. I was awarded Undergraduate Research Award (URA 01) for contribution to research at IIT Bombay.

You can find my other projects from undergraduate here.

Here is a list of all the courses I have taken, both during graduate and undergraduate studies.