This webpage serves as a comprehensive guide for exploring and evaluating tasks leveraging Large Language Model (LLM) encoders. It aims to provide insights into the application of LLM encoders across various domains, including video moment retrieval, image classification, 2D/3D visual question answering (VQA), point cloud recognition, action recognition, and motion forecasting. Each section presents detailed analyses and experimental findings, highlighting the effectiveness of LLM encoders in solving these tasks. By integrating our own research alongside findings from other academic papers, this webpage offers a robust reference point for researchers and practitioners interested in the capabilities and versatility of LLM encoders across multimodal scenarios.
To demonstrate the process of relation refinement in the VMR task, we break down each query into individual concepts and utilize them as separate queries for VMR. We visualize the model’s attention maps, showcasing its focus on both single-concept queries and the original queries (composed of combined concepts). It is evident that with the LLM encoder, the model has demonstrated a better understanding on the composition of the concepts
Experiments on ImageNet and its variants (ImageNet-C, ImageNet-A, ImageNet-SK, ImageNet-R) using ViT models demonstrate that adding a single transformer block from LLaMA significantly enhances both accuracy and robustness.
In 3D point cloud classification, adding the LLaMA transformer after the final attention block of the Point-BERT model improves accuracy on ScanObjectNN and ModelNet40 datasets, validating the applicability of LLM transformers across different modalities.
Applying the pre-trained LLaMA transformer block in video action recognition, specifically on the SSv2 dataset, shows advantages in multi-frame video understanding, enhancing the accuracy of ViT-S and ViT-B models.
In motion forecasting tasks, incorporating the LLaMA transformer in VectorNet and mmTransformer models improves trajectory predictions on the Argoverse dataset. Though the improvement is less pronounced compared to semantic tasks, it demonstrates potential in dynamic information understanding.
The LLaMA transformer is also applied to 2D and 3D visual-language tasks, including VQAv2 and SQA3D datasets. By integrating the LLaMA block after visual-language fusion, the models show enhanced question-answering capabilities and improved multimodal understanding.
@inproceedings{
jiang2024prior,
title={Prior Knowledge Integration via {LLM} Encoding and Pseudo Event Regulation for Video Moment Retrieval},
author={Yiyang Jiang and Wengyu Zhang and Xulu Zhang and Xiaoyong Wei and Chang Wen Chen and Qing Li},
booktitle={ACM Multimedia 2024},
year={2024},
url={https://arxiv.org/abs/2407.15051}
}