Multimodal Learning | Karan Sikka

Multimodal learning explores how different types of data (vision, language, etc.) can be combined to enhance machine learning systems. This research focuses on developing approaches that leverage multiple modalities to solve challenging computer vision and understanding tasks.

Research Areas

Zero-Shot Object Detection

Developing methods that can detect objects from categories not seen during training, leveraging semantic relationships and language descriptions to generalize to new concepts.

Key Publications:

Zero-Shot Object Detection (ECCV 2018)

Visual Grounding

Creating systems that can localize objects and regions in images based on natural language descriptions, bridging the gap between vision and language understanding.

Key Work:

Align2Ground: Weakly Supervised Phrase Grounding (ICCV 2019)

Applying multimodal learning to diverse domains:

Social Media Analysis: Understanding content by combining visual and textual signals
Geo-localization: Using multiple modalities to determine geographic locations from images
Few-Shot Learning: Enabling models to learn from limited examples by leveraging cross-modal knowledge

Technical Approach

Our research employs:

Semantic Embeddings: Learning joint representations across modalities
Weak Supervision: Developing methods that reduce annotation requirements
Knowledge Transfer: Leveraging pre-trained models and external knowledge sources
Attention Mechanisms: Focusing on relevant cross-modal correspondences

Impact

This work has contributed to:

More flexible object detection systems
Improved image-text understanding
Practical applications requiring minimal training data
Cross-domain knowledge transfer techniques

Research Areas

Zero-Shot Object Detection

Visual Grounding

Cross-Modal Applications

Technical Approach

Impact

References