ICGVIP'20 Invited Talk: Overview of Multimodal Embeddings and its Application in Vision-Language Tasks and Beyond

This page provides information about the ICGVIP'20 Invited talk I will be presenting on 19th December, 2020.

Bio: Dr. Karan Sikka is an Advanced Computer Scientist at Center for Vision Technologies, SRI International in Princeton, USA. He graduated with a PhD degree in 2016 from Machine Perception Lab at UCSD and was advised by Dr. Marian Bartlett. Before joining UCSD, he completed his bachelor's in ECE at Indian Institute of Technology Guwahati in 2010. His current research is focused on solving some fundamental problems in Computer Vision and Machine Learning such as learning with multiple modalities (incl. vision and language), learning under weak supervision, few/zero-shot learning. He has successfully applied these methods to multiple problem such as facial expression recognition, action recognition, object detection, visual grounding, visual localization. The underlying theme in his research has been to improve the generalization of Computer Vision models by providing useful inductive bias either in the model design (e.g. better features or interactions), data (augmentation with knowledge or multimodality) or the loss function (e.g. weakly supervised learning) and that is also applicable across multiple domains. His work has been published at high-quality venues such as CVPR, ECCV, ICCV, PAMI etc. He has won a best paper honorable mention award at IEEE Face and Gesture 2013, and a best paper award at the Emotion Recognition in the Wild Workshop at ICMI 2013. He serves as a reviewer/program-committee for venues such as ECCV, CVPR, ICCV, ICML, NIPS, AAAI, ACCV, IJCV, IEEE TIP, IEEE TAC, IEEE TM, ICMI, AFGR and also as an AC for ACMM-19,20
At SRI he is a co-PI for several Govt. funded programs (ONR CEROSS, DARPA M3I, and AFRL Mesa) related to understanding and analyzing social media content in multiple modalities and user structures. His work has resulted in MatchStax API that allows seamless matching of content with users and content with content through unsupervised embeddings (paper and demo) He has also been working on inducting human knowledge, expressed in first-order logic, in neural networks (here). You can get a glimpse about his work in a short interview recorded at SRI. He actively collaborates with other researchers from industries as well as universities and excellent research cannot be done in isolation.

Abstract: We have witnessed significant progress in learning tasks involving vision and language such as Visual Question Answering and Phrase Grounding, in the last few years. This growth has been fueled by advances in deep learning in vision and language domains and availability of large scale multimodal data. In this talk I will focus on building learning models that can jointly understand and reason about multiple modalities. I will begin this talk with an introduction to multimodal embeddings, where the key idea is to align the two modalities by embedding them in a common vector space. This alignment ensures that similar entities in both modalities are closer e.g. word "cow" and images of concept "cow". I will then discuss some of our past work on using these embeddings to solve problems such as such as zero-shot recognition [1, 2] and phrase grounding [3]. I will then cover application of multimodal embeddings to tasks other than vision-language such as social media analysis [4], visual localization [5, 6], etc. I will then briefly discuss recent works on multimodal transformer [7, 8] that have shown large performance improvements in vision-language tasks by relying on the ability of transformers to learn strong correlations through multi-head attention, large capacity, large scale data and pre-training strategies. I hope this talk will provide a basic introduction to related concepts and encourage researchers to work in this field.

Please check this page for updates as I will be providing more information about the relevant topics and papers.

  1. Sikka, K., Huang, J., Silberfarb, A., Nayak, P., Rohrer, L., Sahu, P., Byrnes, J., Divakaran, A., Rohwer, R. (2020). Zero-Shot Learning with Knowledge Enhanced Visual Semantic Embeddings. [arXiv preprint]
  2. Bansal, A., Sikka, K., Sharma, G., Chellapa, R., Divakaran, A. (2018). Zero-Shot Object Detection. European Conference on Computer Vision (ECCV). [arXiv preprint] [project webpage]
  3. Datta, S., Sikka, K., Roy, A., Ahuja, K., Parikh, D., Divakaran, A. (2019). Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment. International Conference on Computer Vision (ICCV). [arXiv preprint]
  4. Sikka, K., Bramer L., Divakaran, A. (2019). Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks. arXiv. [arXiv preprint]
  5. Seymour, Z., Sikka, K., Chiu, H., Samarasekera, S., Kumar, T. (2019). Semantically-Aware Attentive Neural Embeddings for Image-based Visual Localization. British Machine Vision Conference. (BMVC). [arXiv preprint]
  6. Mithun, N., Sikka, K., Chiu, H., Samarasekera, S., Kumar, R. (2020). RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localizations (ACMM). (Best Paper Candidate) [Paper] [Slides]
  7. Tan, H. and Bansal, M., 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
  8. Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F. and Dai, J., 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.