Using Deep Neural Networks to Disentangle Visual and Semantic Information in Human Perception and Memory

Differentiating Visual and Semantic Information in Human Perception and Memory Using Deep Neural Networks

Introduction

In cognitive science, the study of how humans recognize individuals and objects during perception and memory processes has long been of interest. Successful recognition of people and objects relies on matching representations generated by the perceptual system with those stored in memory. However, these mental representations are not exact replicas of the external world but are reconstructions by the brain. Understanding the content and process of this reconstruction has been a longstanding challenge. This paper attempts to reveal the content of mental representations during the perception and memory of familiar faces and objects by utilizing deep neural networks (DNNs).

Paper Source

This paper is written by Adva Shoham, Daniel Sidan Grossbard, Or Patashnik, Daniel Cohen-Or, and Galit Yovel, all from Tel Aviv University. The paper was published online in Nature Human Behaviour on February 8, 2024.

Research Background and Objective

Human mental representations consist of visual and semantic information. However, distinguishing the contributions of these types of information is challenging because they are usually mixed in mental representations. In recent years, deep neural networks trained on images or text can generate pure visual or pure semantic representations, providing a new method to separate this information. This study aims to use these neural networks to quantify the contributions of visual, visual-semantic, and pure semantic information of familiar stimuli in perception and memory.

Research Methods

Experimental Design

The study employed four kinds of neural network models: a visual model (VGG-16), a visual-semantic model (CLIP), and a semantic model (SGPT) to predict human mental representations in perception and memory. The experiment is divided into the following steps:

  1. Selection of Study Subjects:

    • Faces: 20 internationally known individuals, including political figures and entertainment celebrities.
    • Objects: 20 familiar objects.
  2. Model Training and Adjustment:

    • Visual Model (VGG-16): Training on VGGFace2 dataset, fine-tuned to 20 familiar identities.
    • Visual-Semantic Model (CLIP): Jointly trained on 400 million images and descriptions from the internet.
    • Semantic Model (SGPT): Based on natural language processing algorithms, processing the first paragraph text descriptions from Wikipedia.
  3. Similarity Scoring by Participants:

    • Visual Similarity: Human participants provided visual similarity scores for images of faces and objects.
    • Memory Reconstruction: Participants recalled faces or objects based on names and provided similarity scores.
  4. Data Analysis and Geometrical Construction:

    • Calculating similarities between different representations using cosine distance.
    • Creating representative dissimilarity matrices (RDMs) and visualizing them using t-SNE.

Detailed Experimental Procedure

  1. Facial Representation in Perception and Memory:

    • 20 internationally known political figures and entertainment celebrities.
    • Training and validating visual neural network models, extracting facial image feature vectors, and calculating similarities.
  2. Object Representation in Perception and Memory:

    • Selecting object images, calculating their dissimilarity under visual, visual-semantic, and semantic neural networks.
    • Human participants provided visual memory similarity scores for these objects, followed by statistical analysis and validation.

Research Results

Representations of Facial Perception and Memory

  1. High Correlation Between Perception and Memory: The visual representations generated by participants were highly correlated between memory reconstruction and perception (r = 0.77, p < 0.001).
  2. Independent Contributions of Visual and Semantic Information:
    • Visual information contributed more in perception (r = 0.37, t = 11.5, p < 0.001).
    • Semantic information was significant in memory (r = 0.41, t = 6.42, p < 0.001).
    • Unique Visual-Semantic Contribution of the New Model (CLIP): The visual-semantic model made a significant contribution in both perception and memory.

Representations of Object Perception and Memory

  1. High Correlation in Object Perception and Memory: Objects showed a high correlation during image presentation and recall (r = 0.78, p < 0.001).
  2. Independent Contributions of Three Types of Information:
    • Visual, visual-semantic, and semantic models contributed to memory (vgg: r = 0.15, t = 3.01, p = 0.007; clip: r = 0.21, t = 10.9, p < 0.001; sgpt: r = 0.43, t = 7.43, p < 0.001).

Conclusions and Value

Conclusions

The study finds that visual, visual-semantic, and semantic information have unique and complementary contributions to human perception and memory. Visual information dominates the perception process, while semantic information is more critical in memory reconstruction. Additionally, the CLIP model shows its unique visual-semantic integration performance, better predicting human mental representations, providing new cognitive model insights.

Research Significance

  • Scientific Value: Reveals the independent and interactive contributions of visual and semantic information in memory and perception, challenging current cognitive models regarding facial and object recognition.
  • Application Value: Offers algorithms to simulate human mental representations, with potential applications in improving intelligent systems and cognitive training programs.

Research Highlights

  • Innovation: First to comprehensively separate and quantify the independent contributions of visual and semantic information using DNNs.
  • Methodology: Combining multiple models verified the integrative contributions of multimodal information in mental representations.

These findings enrich our understanding of human mental representation mechanisms and provide references for enhancing AI models to predict human behavior. Further research can use these algorithms to study mental representations across more categories and fields, promoting the continuous development and integration of computer and human general intelligence.