SRI International

Center for Vision Technologies

Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, Ajay Divakaran

Paper Video
Some examples of intent classification of Instagram media.

Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image. For example a caption might reflect ironically on the image, so neither the caption nor the image is a mere transcript of the other. Instead they combine -- via what has been called meaning multiplication -- to create a new meaning that has a more complex relation to the literal meanings of text and image. Here we introduce a multimodal dataset of 1299 Instagram post labeled for three orthogonal taxonomies: the authorial intent behind the image-caption pair, the contextual relationship between the literal meanings of the image and caption, and the semiotic relationship between the signified meanings of the image and caption. We build a baseline deep multimodal classifier to validate the taxonomy, showing that employing both text and image improves intent detection by 9.6% compared to using only image modality, demonstrating the commonality of non-intersective meaning multiplication. Our dataset offers an important resource for the study of the rich meanings that results from pairing text and image.

This work was presented at EMNLP 2019.

Comparisons of DCNN model experiments using the MDID Dataset

The table above shows results with different DCNN models in experiments with the MDID dataset– image-only (Img), text-only (Txt-emb and TxtELMo), and combined model (Img + Txt-emb and Img + Txt-ELMo). Here emb is the model using standard word (token) based embeddings, while ELMo is the pre-trained ELMo based word embeddings (Peters et al., 2018)

MDID Dataset

Our dataset, MDID (the Multimodal Document Intent Dataset) consists of 1299 Instagram posts that we collected with the goal of developing a rich and diverse set of posts for each of the eight illocutionary types in our intent taxonomy. For each intent we collected at least 16 hashtags or users that would be likely to yield a high proportion of posts that could be labeled by that heading, the goal of populating each category with a rich and diverse set.

These are the three taxonomies used to classify the Instagram media in this dataset. The primary taxonomy is that of intent, the two secondary taxonomies evaluate the contextual and the semiotic relationship between the image and text modalities within the post.

Intent Taxonomy

  • advocative: advocate for a figure, idea, movement, etc.
  • promotive: promote events, products, organizations etc.
  • exhibitionist: create a self-image for the user using selfies, pictures of belongings etc.
  • expressive: express emotion, attachment, or admiration at an external entity or group.
  • informative: relay information regarding a subject or event using factual language.
  • entertainment: entertain using art, humor, memes etc.
  • provocative/discrimination: directly attack an individual or group.
  • provocative/controversial: be shocking.

Semiotic Relationship

  • A divergent relationship occurs when the image and text semiotics pull in opposite directions, creating a gap between the meanings suggested by the image and text.
  • A parallel relationship occurs when the image and text independently contribute to the same meaning.
  • An additive relationship occurs when the image and text semiotics amplify or modify each other.

Contextual Relationship

  • Minimal Relationship: The literal meanings of the caption and image overlap very l ittle. For example, a selfie of a person at a waterfall with the caption “selfie”.
  • Close Relationship: The literal meanings of the caption and the image overlap considerably. For example, a selfie of a person at a crowded waterfall, with the caption “Selfie at Hemlock falls on a crowded sunny day”.
  • Transcendent Relationship: The literal meaning of one modality picks up and expands on the literal meaning of the other. For example, a selfie of a person at a crowded waterfall with the caption “Selfie at Hemlock Falls on a sunny and crowded day. Hemlock falls is a popular picnic spot. There are hiking and biking trails, and a great restaurant 3 miles down the road ...”.

The dataset can be downloaded here