Exploratory Tasks for LLM Encoders: A Comprehensive Guide

The Hong Kong Polytechnic University
Part of MM'24 Paper (LLM Encoder Prior Knowledge Integration via LLM Encoding and Pseudo-Event Regulation for Video Moment Retrieval)

Abstract

This webpage serves as a comprehensive guide for exploring and evaluating tasks leveraging Large Language Model (LLM) encoders. It aims to provide insights into the application of LLM encoders across various domains, including video moment retrieval, image classification, 2D/3D visual question answering (VQA), point cloud recognition, action recognition, and motion forecasting. Each section presents detailed analyses and experimental findings, highlighting the effectiveness of LLM encoders in solving these tasks. By integrating our own research alongside findings from other academic papers, this webpage offers a robust reference point for researchers and practitioners interested in the capabilities and versatility of LLM encoders across multimodal scenarios.

Video Moment Retrieval

It demonstrates state-of-the-art VMR performance through experimental validation

To demonstrate the process of relation refinement in the VMR task, we break down each query into individual concepts and utilize them as separate queries for VMR. We visualize the model’s attention maps, showcasing its focus on both single-concept queries and the original queries (composed of combined concepts). It is evident that with the LLM encoder, the model has demonstrated a better understanding on the composition of the concepts

Image classification

Experiments on ImageNet and its variants (ImageNet-C, ImageNet-A, ImageNet-SK, ImageNet-R) using ViT models demonstrate that adding a single transformer block from LLaMA significantly enhances both accuracy and robustness.

MY ALT TEXT

Point Cloud Classification

In 3D point cloud classification, adding the LLaMA transformer after the final attention block of the Point-BERT model improves accuracy on ScanObjectNN and ModelNet40 datasets, validating the applicability of LLM transformers across different modalities.

MY ALT TEXT

Action Recognition

Applying the pre-trained LLaMA transformer block in video action recognition, specifically on the SSv2 dataset, shows advantages in multi-frame video understanding, enhancing the accuracy of ViT-S and ViT-B models.

MY ALT TEXT

Motion Forecasting

In motion forecasting tasks, incorporating the LLaMA transformer in VectorNet and mmTransformer models improves trajectory predictions on the Argoverse dataset. Though the improvement is less pronounced compared to semantic tasks, it demonstrates potential in dynamic information understanding.

MY ALT TEXT

2D and 3D Visual Question Answering (VQA)

The LLaMA transformer is also applied to 2D and 3D visual-language tasks, including VQAv2 and SQA3D datasets. By integrating the LLaMA block after visual-language fusion, the models show enhanced question-answering capabilities and improved multimodal understanding.

 
Reference: Frozen Transformers in Language Models Are Effective Visual Encoder Layers

BibTeX

@inproceedings{
        jiang2024prior,
        title={Prior Knowledge Integration via {LLM} Encoding and Pseudo Event Regulation for Video Moment Retrieval},
        author={Yiyang Jiang and Wengyu Zhang and Xulu Zhang and Xiaoyong Wei and Chang Wen Chen and Qing Li},
        booktitle={ACM Multimedia 2024},
        year={2024},
        url={https://arxiv.org/abs/2407.15051}
        }