Workshop Schedule
Morning Session
- --:-- - Opening Remark
- --:-- - Keynote Talk: Dr. Md. Mamunur Rashid (The King Abdulaziz Center for World Culture - Ithra) Cross-Modal Trust: Evaluating LVLMs for Safeguarding Health Information
- --:-- - Janak Kapuriya: Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning
- --:-- - Jiadong Yan (online): Few-shot Anomaly Detection based on Long Short Text Interactive Contrastive Learning
- --:-- - Tun-Yuan Chang: Harvesting Temporal Correlation in Large Vision-Language Models: Using Pose Estimation as a Case Study
- --:-- - Nam Nguyen Xuan (online): StructCon-ST: Connectivity-Aware Spatio-Temporal Fine-Grained Image Analysis
- --:-- - Jun Wan (pre-recorded video): Hierarchical Temporal Views for Policy Optimization in Multimodal Video Reasoning
- --:-- - Keynote Talk: Dr. Seitaro Shinagawa (SB Intuitions, online) Sarashina2-Vision: Toward Vision -- Language Models for Understanding Japanese Figures and Conceptual/Explanatory Diagrams
- --:-- - Daichi Sato: LAVA Grand Challenge Introduction
- --:-- - SYSUpporter team: HEAR: A Holistic Extraction and Agentic Reasoning Framework for Document Understanding
- --:-- - Woof team: AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings
- --:-- - nsbsk team: Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering
- --:-- - char team: Two-Stage Approach Using a Pretrained Language Model for Question Answering on Japanese Document Images
Accepted Papers
Workshop Proceedings
- Jiadong Yan, Quan Zhang, Yifan Zhou, Tianle Yang, Ke Zhang: Few-shot Anomaly Detection based on Long Short Text Interactive Contrastive Learning
- Anwar Dilawar Shaikh, Janak Kapuriya, Arnav Goel, Medha Hira, Apoorv Singh, Jay Saraf, Sanjana Sanjeev, Vaibhav Nauriyal, Avinash Anand, Zhengkui Wang, Rajiv Ratn Shah: Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning
- Tun-Yuan Chang, Kenneth Chandra, Cheng-Hsin Hsu: Harvesting Temporal Correlation in Large Vision-Language Models: Using Pose Estimation as a Case Study
- Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo: Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models
- Song-Li Wu: LAMDA: Leveraging Multi-Scale and Dynamic Alignment for Robust Referring Video Object Segmentation
- Jun, Kexin Lv, An Guo: Hierarchical Temporal Views for Policy Optimization in Multimodal Video Reasoning
- Song-Li Wu: Bridging the Modal Gap: A Targeted Patch Refinement and Residual Preservation Framework for Efficient Referring Expression Segmentation
Call for Papers
We welcome people to submit papers about large vision-language models (LVLMs) to The Second Workshop on Large Vision – Language Model Learning and Applications (LAVA 2025). Accepted papers will be presented in our workshop and will be published in the ACM MM 2025 workshop proceeding. We accept short papers (non-archived) which are up to 4 pages in ACM MM format, excluding references; and long papers (archived) which are up to 8 pages in ACM MM format, excluding references. Submission policies adhere to the ACM MM submission policies.
The topics in this workshop will include but are not limited to:
- Data preprocessing and prompt engineering in LVLMs
- Training/Compressing LVLMs
- Self-supervised and/or unsupervised, few-/zero-shot learning in LVLMs
- Generative AI
- Trust-worthy/Explainable LVLMs learning
- Security and privacy in LVLMs
- LVLMs evaluation and benchmarking
- LVLMs for downstream tasks
- LVLMs in virtual reality, mixed reality
- Applications of LVLMs
- LVLMs and other modalities
- LVLMs for low resources
Important Dates
- Paper submission deadline: 2025/6/15 2025/7/11 (Extended)
- ACM MM fast track submision: 2025/7/11 For fast track, please include main conference reviews, your response and meta review in your submission.
- Acceptance notification: 2025/8/1
- Camera-ready: 2025/8/11
- Workshop date: 2025/10/27-28
Organizers

Duc Minh Vo
SB Intuitions, Japan

Huy H. Nguyen
SB Intuitions, Japan

Trung-Nghia Le
University of Science, Vietnam

Akihiro Sugimoto
National Institute of Informatics, Japan

Hideki Nakayama
University of Tokyo, Japan

Minh-Triet Tran
University of Science, Vietnam

Trung-Hieu Hoang
University of Illinois Urbana-Champaign, US
Contact: lava-workshop(at)googlegroups.com