End-to-end Parsing of Procedural Text into Flow Graphs
Dhaivat J. Bhatt, Seyed Ahmad Abdollahpouri Hosseini, Federico Fancellu, Afsaneh Fazly
LREC-COLING 2024
We focus on the problem of parsing procedural text into fine-grained flow graphs that encode actions and entities, as well as their interactions. Specifically, we focus on parsing cooking recipes, and address a few limitations of existing parsers. Unlike SOTA approaches to flow graph parsing that work in two separate stages identifying actions and entities (tagging) and encoding their interactions via connecting edges (graph generation). we propose an end-to-end multi-task framework that simultaneously performs tagging and graph generation. In addition, due to the end-to-end nature of our proposed model, we can unify the input representation, and moreover can use compact encoders, resulting in small models with significantly fewer parameters than SOTA models. Another key challenge in training flow graph parsers is the lack of sufficient annotated data, due to the costly nature of the fine-grained annotations. We address this problem by taking advantage of the abundant unlabelled recipes, and show that pre-training on automatically-generated noisy silver annotations (from unlabelled recipes) results in a large improvement in flow graph parsing.
Visual Semantic Parsing: From Images to Abstract Meaning Representation
Mohamed Ashraf Abdelsalam, Zhan Shi, Federico Fancellu, Kalliopi Basioti, Dhaivat J Bhatt, Vladimir Pavlovic, Afsaneh Fazly
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)
The success of scene graphs for visual scene understanding has brought attention to the benefits of abstracting a visual input (e.g., image) into a structured representation, where entities (people and objects) are nodes connected by edges specifying their relations. Building these representations, however, requires expensive manual annotation in the form of images paired with their scene graphs or frames. These formalisms remain limited in the nature of entities and relations they can capture. In this paper, we propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR), to address these shortcomings. Compared to scene graphs, which largely emphasize spatial relationships, our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input. Moreover, they allow us to generate meta-AMR graphs to unify information contained in multiple image descriptions under one representation. Through extensive experimentation and analysis, we demonstrate that we can re-purpose an existing text-to-AMR parser to parse images into AMRs. Our findings point to important future research directions for improved scene understanding.
Flow graph to video grounding for weakly-supervised multi-step localization
Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, Allan D Jepson
ECCV 2022
In this work, we consider the problem of weakly-supervised multi-step localization in instructional videos. An established approach to this problem is to rely on a given list of steps. However, in reality, there is often more than one way to execute a procedure successfully, by following the set of steps in slightly varying orders. Thus, for successful localization in a given video, recent works require the actual order of procedure steps in the video, to be provided by human annotators at both training and test times. Instead, here, we only rely on generic procedural text that is not tied to a specific video. We represent the various ways to complete the procedure by transforming the list of instructions into a procedure flow graph which captures the partial order of steps. Using the flow graphs reduces both training and test time annotation requirements. To this end, we introduce the new problem of flow graph to video grounding. In this setup, we seek the optimal step ordering consistent with the procedure flow graph and a given video. To solve this problem, we propose a new algorithm - Graph2Vid - that infers the actual ordering of steps in the video and simultaneously localizes them. To show the advantage of our proposed formulation, we extend the CrossTask dataset with procedure flow graph information. Our experiments show that Graph2Vid is both more efficient than the baselines and yields strong step localization results, without the need for step order annotation.
f-Cal: Aleatoric uncertainty quantification for robot perception via calibrated neural regression
Dhaivat Bhatt, Kaustubh Mani, Dishank Bansal, Krishna Murthy, Hanju Lee, Liam Paull
ICRA 2022
While modern deep neural networks are performant perception modules, performance (accuracy) alone is insufficient, particularly for safety-critical robotic applications such as self-driving vehicles. Robot autonomy stacks also require these otherwise blackbox models to produce reliable and calibrated measures of confidence on their predictions. Existing approaches estimate uncertainty from these neural network perception stacks by modifying network architectures, inference procedure, or loss functions. However, in general, these methods lack calibration, meaning that the predictive uncertainties do not faithfully represent the true underlying uncertainties (process noise). Our key insight is that calibration is only achieved by imposing constraints across multiple examples, such as those in a mini-batch; as opposed to existing approaches which only impose constraints per-sample, often leading to overconfident (thus miscalibrated) uncertainty estimates. By enforcing the distribution of outputs of a neural network to resemble a target distribution by minimizing an f-divergence, we obtain significantly better-calibrated models compared to prior approaches. Our approach, f-Cal, outperforms existing uncertainty calibration approaches on robot perception tasks such as object detection and monocular depth estimation over multiple real-world benchmarks.
Probabilistic object detection: Strengths, weaknesses, opportunities
Dhaivat Bhatt, Dishank Bansal, Gunshi Gupta, Hanju Lee, Krishna Murthy Jatavallabhula, Liam Paull
Workshop on AI for Autonomous Driving at the International Conference on Machine Learning (ICML) - 2020
Deep neural networks are the de-facto standard for object detection in autonomous driving applications. However, neural networks cannot be blindly trusted even within the training data distribution, let alone outside it. This has paved way for several probabilistic object detection techniques that measure uncertainty in the outputs of an object detector. Through this position paper, we serve three main purposes. First, we briefly sketch the landscape of current methods for probabilistic object detection. Second, we present the main shortcomings of these approaches. Finally, we present promising avenues for future research, and proof-of-concept results where applicable. Through this effort, we hope to bring the community one step closer to performing accurate, reliable, and consistent probabilistic object detection. A project page for this work can be found at montrealrobotics.ca/probod
Maplite: Autonomous intersection navigation without a detailed prior map
Teddy Ort, Krishna Murthy, Rohan Banerjee, Sai Krishna Gottipati, Dhaivat Bhatt, Igor Gilitschenski, Liam Paull, Daniela Rus
IEEE Robotics and Automation Letters
In this work, we present MapLite: a one-click autonomous navigation system capable of piloting a vehicle to an arbitrary desired destination point given only a sparse publicly available topometric map (from OpenStreetMap). The onboard sensors are used to segment the road region and register the topometric map in order to fuse the high-level navigation goals with a variational path planner in the vehicle frame. This enables the system to plan trajectories that correctly navigate road intersections without the use of an external localization system such as GPS or a detailed prior map. Since the topometric maps already exist for the vast majority of roads, this solution greatly increases the geographical scope for autonomous mobility solutions. We implement MapLite on a full-scale autonomous vehicle and exhaustively test it on over 15 km of road including over 100 autonomous intersection traversals. We further extend these results through simulated testing to validate the system on complex road junction topologies such as traffic circles.
Probabilistic obstacle avoidance and object following: An overlap of gaussians approach
Dhaivat Bhatt, Akash Garg, Bharath Gopalakrishnan, K Madhava
2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
Autonomous navigation and obstacle avoidance are core capabilities that enable robots to execute tasks in the real world. We propose a new approach to collision avoidance that accounts for uncertainty in the states of the agent and the obstacles. We first demonstrate that measures of entropy—used in current approaches for uncertainty-aware obstacle avoidance—are an inappropriate design choice. We then propose an algorithm that solves an optimal control sequence with a guaranteed risk bound, using a measure of overlap between the two distributions that represent the state of the robot and the obstacle, respectively. Furthermore, we provide closed form expressions that can characterize the overlap as a function of the control input. The proposed approach enables model-predictive control framework to generate bounded-confidence control commands. An extensive set of simulations have been conducted in various constrained environments in order to demonstrate the efficacy of the proposed approach over the prior art. We demonstrate the usefulness of the proposed scheme under tight spaces where computing risk-sensitive control maneuvers is vital. We also show how this framework generalizes to other problems, such as object-following.
Deep active localization
Sai Krishna, Keehong Seo, Dhaivat Bhatt, Vincent Mai, Krishna Murthy, Liam Paull
IEEE Robotics and Automation Letters
Active localization is the problem of generating robot actions that allow it to maximally disambiguate its pose within a reference map. Traditional approaches to this use an information-theoretic criterion for action selection and hand-crafted perceptual models. In this work we propose an end-to-end differentiable method for learning to take informative actions that is trainable entirely in simulation and then transferable to real robot hardware with zero refinement. The system is composed of two modules: a convolutional neural network for perception, and a deep reinforcement learned planning module. We introduce a multi-scale approach to the learned perceptual model since the accuracy needed to perform action selection with reinforcement learning is much less than the accuracy needed for robot control. We demonstrate that the resulting system outperforms using the traditional approach for either perception or planning. We also demonstrate our approaches robustness to different map configurations and other nuisance parameters through the use of domain randomization in training. The code is also compatible with the OpenAI gym framework, as well as the Gazebo simulator.
Have I reached the intersection: A deep learning-based approach for intersection detection from monocular cameras
Dhaivat Bhatt, Danish Sodhi, Arghya Pal, Vineeth Balasubramanian, Madhava Krishna
2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Long-short term memory networks (LSTM) models have shown considerable performance on a variety of problems dealing with sequential data. In this paper, we propose a variant of Long-Term Recurrent Convolutional Network (LRCN) to detect road intersections. We call this network IntersectNet. We pose road intersection detection as a binary classification task over a sequence of frames. The model combines deep hierarchical visual feature extractor with recurrent sequence model. The model is end-to-end trainable with the capability of capturing the temporal dynamics of the system. We exploit this capability to identify road intersections in a sequence of temporally consistent images. The model has been rigorously trained and tested on various different datasets. We think that our findings could be useful to model the behavior of autonomous agents in the real world.
CRF based method for curb detection using semantic cues and stereo depth
Danish Sodhi, Sarthak Upadhyay, Dhaivat Bhatt, K Madhava Krishna, Shanti Swarup
Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing
Curb detection is a critical component of driver assistance and autonomous driving systems. In this paper, we present a discriminative approach to the problem of curb detection under diverse road conditions. We define curbs as the intersection of drivable and non-drivable areas which are classified using dense Conditional random fields (CRF). In our method, we fuse the output of a neural network used for pixel-wise semantic segmentation with depth and color information from stereo cameras. CRF fuses the output of a deep model and height information available in stereo data and provides improved segmentation. Further, we introduce temporal smoothness using a weighted average of SegNet output and output from a probabilistic voxel grid as our unary potential. Finally, we show improvements over the current state of the art neural networks. Our proposed method shows accurate results over a large range of variations in curb curvature and appearance, without the need of retraining the model for the specific dataset.