Multimodal Learning
Zero-shot object detection, visual grounding, and cross-modal understanding
Multimodal learning explores how different types of data (vision, language, etc.) can be combined to enhance machine learning systems. This research focuses on developing approaches that leverage multiple modalities to solve challenging computer vision and understanding tasks.
Research Areas
Zero-Shot Object Detection
Developing methods that can detect objects from categories not seen during training, leveraging semantic relationships and language descriptions to generalize to new concepts.
Key Publications:
- Zero-Shot Object Detection (ECCV 2018)
Visual Grounding
Creating systems that can localize objects and regions in images based on natural language descriptions, bridging the gap between vision and language understanding.
Key Work:
- Align2Ground: Weakly Supervised Phrase Grounding (ICCV 2019)
Cross-Modal Applications
Applying multimodal learning to diverse domains:
- Social Media Analysis: Understanding content by combining visual and textual signals
- Geo-localization: Using multiple modalities to determine geographic locations from images
- Few-Shot Learning: Enabling models to learn from limited examples by leveraging cross-modal knowledge
Technical Approach
Our research employs:
- Semantic Embeddings: Learning joint representations across modalities
- Weak Supervision: Developing methods that reduce annotation requirements
- Knowledge Transfer: Leveraging pre-trained models and external knowledge sources
- Attention Mechanisms: Focusing on relevant cross-modal correspondences
Impact
This work has contributed to:
- More flexible object detection systems
- Improved image-text understanding
- Practical applications requiring minimal training data
- Cross-domain knowledge transfer techniques