Multimodal Learning

Zero-shot object detection, visual grounding, and cross-modal understanding

Multimodal learning explores how different types of data (vision, language, etc.) can be combined to enhance machine learning systems. This research focuses on developing approaches that leverage multiple modalities to solve challenging computer vision and understanding tasks.

Research Areas

Zero-Shot Object Detection

Developing methods that can detect objects from categories not seen during training, leveraging semantic relationships and language descriptions to generalize to new concepts.

Key Publications:

  • Zero-Shot Object Detection (ECCV 2018)

Visual Grounding

Creating systems that can localize objects and regions in images based on natural language descriptions, bridging the gap between vision and language understanding.

Key Work:

  • Align2Ground: Weakly Supervised Phrase Grounding (ICCV 2019)

Cross-Modal Applications

Applying multimodal learning to diverse domains:

  • Social Media Analysis: Understanding content by combining visual and textual signals
  • Geo-localization: Using multiple modalities to determine geographic locations from images
  • Few-Shot Learning: Enabling models to learn from limited examples by leveraging cross-modal knowledge

Technical Approach

Our research employs:

  1. Semantic Embeddings: Learning joint representations across modalities
  2. Weak Supervision: Developing methods that reduce annotation requirements
  3. Knowledge Transfer: Leveraging pre-trained models and external knowledge sources
  4. Attention Mechanisms: Focusing on relevant cross-modal correspondences

Impact

This work has contributed to:

  • More flexible object detection systems
  • Improved image-text understanding
  • Practical applications requiring minimal training data
  • Cross-domain knowledge transfer techniques

References