ACM MM 2026 Grand Challenge
LAVA
Large Vision – Language Model Learning and Applications
Rio de Janeiro, Brazil  ·  10–14 November 2026
🎉

The competition is now live!

The LAVA Challenge 2026 Kaggle page is officially open. Join now and compete!

Join on Kaggle →

Overview

We are pleased to announce the 3rd LAVA Grand Challenge, to be held in conjunction with ACM Multimedia 2026. Building on the success of our previous challenges in 2024 and 2025, this year's edition introduces two major extensions:

  • Multilingual Expansion: While previous editions focused on Japanese, the 2026 challenge expands coverage to a broader set of languages, with a particular emphasis on low-resource and underrepresented languages.
  • Evidence-Grounded Answering: In addition to selecting the correct answer, participants are now required to provide evidence for their answer — such as the page number(s) where the supporting information can be found. This reflects the growing real-world demand for AI systems that can not only answer questions but also justify their responses with traceable references.

The challenge targets the document understanding capabilities of Vision-Language Models (VLMs) on multilingual PDF documents and invites researchers, engineers, and practitioners worldwide to participate.

Register your team for LAVA Challenge 2026 here!

Task Details

This competition is a multilingual Document Visual Question Answering (Document VQA) task with evidence grounding. Given a PDF document and a question about its content, participants must:

  1. Answer the question by reading and understanding the document.
  2. Ground the answer — identify the page(s) of the PDF that contain the evidence needed to answer the question.

Each question requires reading one or more pages of the PDF and interpreting a variety of elements such as text paragraphs, tables, figures, and photographs. The dataset contains questions in Japanese and Vietnamese, reflecting the multilingual focus of this challenge.

Document VQA is a challenging task because it demands both visual understanding (interpreting the layout and structure of a rendered page) and language understanding (comprehending the question and formulating a correct answer). The evidence grounding requirement adds a further layer of difficulty: models must not only produce a correct answer but also justify it by pinpointing the exact page(s) from which the answer is derived.

Participants are encouraged to develop and evaluate Vision-Language Models (VLMs) or multimodal pipelines capable of handling multilingual, multi-page PDF documents in an open-ended question answering setting.

📄
Multilingual PDFs
Questions in Japanese & Vietnamese across multi-page documents
🎯
Evidence Grounding
Identify the exact page(s) containing evidence for each answer
🤖
VLM Challenge
Text, tables, figures & photos — requires vision + language understanding

📄 Publication Opportunity

  • 🏆The top 3 solutions will be invited to submit a paper to the Grand Challenge track — your chance to have your winning method published!
  • 📝Paper length: 6 pages + up to 2 additional pages for references only.
  • 📚Accepted Grand Challenge papers will be included in the ACM MM 2026 main conference proceedings.
  • 🎟️At least one main-conference full registration is required per accepted paper.
  • 🎁Tentative prize: The top 3 winning teams will each receive a conference fee waiver (one per team).

Evaluation Criterion

Each question is evaluated on two aspects: answer correctness (VQA Score) and evidence grounding (Grounding Score). The final score is the average of these two.

VQA Score
Answer correctness
📍
Grounding Score
Evidence page matching
🏅
Overall Score
(VQA + Grounding) / 2

1VQA Score

Answer correctness is evaluated using LLM-as-a-Judge (Gemma-3 1B), which determines whether a predicted answer is semantically equivalent to the ground truth. This approach tolerates minor variations in phrasing, formatting, and representation.

string / numberThe LLM judge determines whether the predicted answer is semantically equivalent to the correct answer. Minor differences in formatting (e.g., 1000 vs 1,000, presence/absence of units) are tolerated.
unordered_listEach predicted item is matched against a ground truth item using the LLM judge. The overall score is the F1 score computed from item-level matches, regardless of order.
ordered_list
Items are evaluated by LCS (Longest Common Subsequence) score. Each item comparison uses the LLM judge. The score is normalized as:
Ordered List Score = LCS length / max(|predicted|, |ground_truth|)

2Grounding Score

The predicted and ground truth evidence page numbers are compared as sets. The score is computed as:

Grounding Score = 2 × |predicted ∩ ground_truth| / (|predicted| + |ground_truth|)
1.0Predicted page set exactly matches ground truth
0.0No overlap between predicted and ground truth

3Overall Score

Per question:(VQA Score + Grounding Score) / 2
Final score:Mean of all per-question scores across the entire test set

Rules

1Use of Open Models and Data

Participants must use publicly available (open) models and datasets only.

If you create a new dataset specifically for this competition, you are required to:

  • Publish it as a Kaggle Dataset.
  • Explicitly announce its existence in the competition Discussion (Issues) tab, so all participants have equal access.

2Inference Environment Constraints

⚙️ Requirement

Participants must ensure that their inference pipeline completes within 2 hours on a single A100 GPU (40 GB VRAM).

There is no restriction on training — you may use any hardware and time budget for training. The constraint applies to inference only.

Background & Rationale

Ideally, this competition would be hosted as a Code Competition to enforce a unified inference environment for all participants. However, due to Kaggle platform limitations, Code Competitions cannot be held under Community Competitions. As an alternative, we are standardizing the environment by specifying the above inference constraint.

The A100 (40 GB) constraint is based on the hardware the organizers will use to verify submitted code. We understand that not all participants own an A100, but since GPU performance varies significantly by generation, we had no choice but to set the constraint based on the organizer's verification environment.

💡 If You Do Not Have an A100

  • Look up the approximate performance ratio between your GPU and an A100.
  • Estimate the inference time budget for your GPU accordingly (e.g., if your GPU is roughly half as fast, aim for ~1 hour of inference time).

The 40 GB VRAM limit was chosen because it is neither too tight nor too loose — it should be achievable for most modern large models without requiring extreme optimization.

3Code Submission for Top Finishers

After the competition ends, top-ranked participants are required to submit their code to the organizers for reproducibility verification.

To ensure reproducibility, please follow these practices:

  • Set random seeds for all stochastic operations (model initialization, data shuffling, sampling, etc.).
  • Use Docker to containerize your environment. You will be asked to submit a Dockerfile along with your code.

4Dataset Licenses

The dataset used in this competition contains a mix of Japanese and Vietnamese text.

🇯🇵 Japanese Data

The Japanese PDF annotation data is released under the CC BY 4.0 license.

🇻🇳 Vietnamese Data

The Vietnamese data is primarily sourced from Viet Nam Government News, the Viet Nam Government Portal, Vietnam News Agency, and other copyrighted sources. All content remains fully protected under copyright law.

Participants may freely access, view, cite, download, and print the materials for reference purposes. However, altering or modifying any content or images in any form is strictly prohibited. If you republish or redistribute any information, you must clearly attribute the original source (e.g., "Government Portal", "Viet Nam Government News", or link to www.chinhphu.vn).

© Viet Nam Government Portal. All rights reserved.

© Viet Nam Government News – Viet Nam Government Portal. All rights reserved.

© Vietnam News Agency. All rights reserved.

Important Dates

EventDate
Dataset release on Kaggle page2026/4/22(passed)
Challenge closed2026/5/31
Results, report, docker container submission deadline2026/6/7
Paper submission deadline2026/6/25
Notification of results2026/7/16
Camera-ready submission2026/8/6
Grand Challenge at ACM MM 2026TBD

Presentation Policy

⚠️ On-site Attendance Required

ACM Multimedia 2026 is an on-site event only. This means that all papers and contributions must be presented by a physical person on-site; remote presentations will not be hosted or allowed. Papers and contributions not presented on-site will be considered a no-show and removed from the proceedings of the conference. More details will be provided to handle unfortunate situations in which none of the authors would be able to attend the conference physically.

Organizers

Duc Minh Vo
Duc Minh Vo
SB Intuitions, Japan
Akihiro Sugimoto
Akihiro Sugimoto
National Institute of Informatics, Japan
Hideki Nakayama
Hideki Nakayama
University of Tokyo, Japan
Khan Md Anwarus Salam
Khan Md Anwarus Salam
SoftBank, Japan
Daichi Sato
Daichi Sato
University of Tokyo, Japan
Takara Taniguchi
Takara Taniguchi
University of Tokyo, Japan
Kaito Baba
Kaito Baba
University of Tokyo, Japan

Contact: lava-workshop(at)googlegroups.com