Skip to content

Commit

Permalink
add MMSG paper
Browse files Browse the repository at this point in the history
  • Loading branch information
leloykun committed Jun 17, 2024
1 parent f13c502 commit 27a1a94
Show file tree
Hide file tree
Showing 39 changed files with 960 additions and 69 deletions.
2 changes: 1 addition & 1 deletion config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ params:
subtitle:
Building something 👨‍🍳🚀 • Former <span style="color:#011F5B">**Machine Learning (AI) Research Scientist, Full-Stack Software Engineer, & Data Engineer**</span> at [Expedock Software Inc.](https://expedock.com) • <span style="color:#011F5B">**2x IOI & 2x ICPC World Finalist**</span> • Mathematics at the [Ateneo de Manila University](https://www.ateneo.edu/)<br><br>

Multi-Modal Machine Learning • Document Information Extraction • Structured Generation • Non-Euclidean Geometry • Geometric Deep Learning
Multimodal Machine Learning • Document Information Extraction • Structured Generation • Non-Euclidean Geometry • Geometric Deep Learning
imageUrl: "picture.png"
imageWidth: 160
imageHeight: 160
Expand Down
Binary file added content/papers/mmsg/cover.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 27 additions & 0 deletions content/papers/mmsg/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: "Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report"
date: 2024-06-17
tags: ["Machine Learning", "Multimodal Machine Learning", "Structured Generation", "Computer Vision", "Document Information Extraction"]
author: "Franz Louis Cesista"
description: "Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method's ability to generalize to unseen tasks. And that simple engineering can beat expensive & complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use."
summary: "[Technical Report for CVPR's 2nd MMFM Challenge] This report presents Multimodal Structured Generation, a general framework which constrains the output logits of frozen Multimodal Foundation Models to force them to reason before responding with structured outputs that downstream APIs can parse and use. This approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference."
cover:
image: "cover.png"
alt: "Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report"
---

![cover](cover.png)

Authors: [Franz Louis Cesista](mailto:franzlouiscesista@gmail.com)

Arxiv: [Submitted -- Under Review]

PDF: [Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report](/mmsg.pdf)

Code on GitHub: https://github.com/leloykun/MMFM-Challenge

---

## Abstract

Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method's ability to generalize to unseen tasks. And that simple engineering can beat expensive & complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use. All of our scripts, deployment steps, and evaluation results can be accessed in [this repository](https://github.com/leloykun/MMFM-Challenge)
3 changes: 2 additions & 1 deletion content/papers/rasg/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,14 @@ summary: "[Preprint - Accepted @ IEEE 7th International Conference on Multimedia
cover:
image: "cover.png"
alt: "Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use [Preprint - Accepted @ IEEE MIPR 2024]"
weight: 1
---

![cover](cover.png)

Authors: [Franz Louis Cesista](mailto:franzlouiscesista@gmail.com), Rui Aguiar, Jason Kim, Paolo Acilo

Arxiv: [Abstract](https://arxiv.org/abs/2405.20245)

PDF: [Preprint - Accepted @ IEEE MIPR 2024](/rasg.pdf)

Code on GitHub: [Will be available on or before June 17th, 2024]
Expand Down
11 changes: 11 additions & 0 deletions public/archive/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,17 @@
<div class="archive-year">
<h2 class="archive-year-header">2024
</h2>
<div class="archive-month">
<h3 class="archive-month-header">June
</h3>
<div class="archive-posts">
<div class="archive-entry">
<h3 class="archive-entry-title">Multimodal Structured Generation: CVPR&rsquo;s 2nd MMFM Challenge Technical Report
</h3>
<a class="entry-link" aria-label="post link to Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report" href="https://leloykun.github.io/papers/mmsg/"></a>
</div>
</div>
</div>
<div class="archive-month">
<h3 class="archive-month-header">April
</h3>
Expand Down
2 changes: 1 addition & 1 deletion public/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ <h1><span style="color:#011F5B">Franz Louis Cesista</span></h1>
<span>Building something 👨‍🍳🚀 • Former <span style="color:#011F5B"><strong>Machine Learning (AI) Research Scientist, Full-Stack Software Engineer, &amp; Data Engineer</strong></span> at <a href="https://expedock.com" target="_blank">Expedock Software Inc.</a>
<span style="color:#011F5B"><strong>2x IOI &amp; 2x ICPC World Finalist</strong></span> • Mathematics at the <a href="https://www.ateneo.edu/" target="_blank">Ateneo de Manila University</a>
<br><br>
Multi-Modal Machine Learning • Document Information Extraction • Structured Generation • Non-Euclidean Geometry • Geometric Deep Learning</span><div class="social-icons">
Multimodal Machine Learning • Document Information Extraction • Structured Generation • Non-Euclidean Geometry • Geometric Deep Learning</span><div class="social-icons">
<a href="mailto:franzlouiscesista@gmail.com" target="_blank" rel="noopener noreferrer me" title="Email">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 21" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
<path d="M4 4h16c1.1 0 2 .9 2 2v12c0 1.1-.9 2-2 2H4c-1.1 0-2-.9-2-2V6c0-1.1.9-2 2-2z"></path>
Expand Down
31 changes: 20 additions & 11 deletions public/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
</image>
<generator>Hugo -- gohugo.io</generator>
<language>en</language>
<lastBuildDate>Mon, 15 Apr 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://leloykun.github.io/index.xml" rel="self" type="application/rss+xml" />
<lastBuildDate>Mon, 17 Jun 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://leloykun.github.io/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>ChatGPT May Have Developed Seasonal Depression</title>
<link>https://leloykun.github.io/ponder/chatgpt-seasonal-depression/</link>
Expand Down Expand Up @@ -65,16 +65,6 @@
<description>Whether you&amp;#39;re only here for the hype or genuinely interested in the field, you’re in for a wild ride.</description>
</item>

<item>
<title>Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use [Preprint - Accepted @ IEEE MIPR 2024]</title>
<link>https://leloykun.github.io/papers/rasg/</link>
<pubDate>Mon, 15 Apr 2024 00:00:00 +0000</pubDate>

<guid>https://leloykun.github.io/papers/rasg/</guid>
<description>Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks.
The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs &#43; RASG is oftentimes superior given real-world applications and constraints of BDIE.</description>
</item>

<item>
<title>Flash Hyperbolic Attention Minimal [WIP]</title>
<link>https://leloykun.github.io/personal-projects/flash-hyperbolic-attention-minimal/</link>
Expand Down Expand Up @@ -147,6 +137,25 @@ The contributions of this paper are threefold: (1) We show, with ablation benchm
<description>Booking demand prediction for Grab&amp;#39;s Southeast Asia operations. The project involves spatio-temporal forecasting, anomaly detection, and econometric modeling.</description>
</item>

<item>
<title>Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report</title>
<link>https://leloykun.github.io/papers/mmsg/</link>
<pubDate>Mon, 17 Jun 2024 00:00:00 +0000</pubDate>

<guid>https://leloykun.github.io/papers/mmsg/</guid>
<description>Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method&amp;#39;s ability to generalize to unseen tasks. And that simple engineering can beat expensive &amp;amp; complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use.</description>
</item>

<item>
<title>Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use [Preprint - Accepted @ IEEE MIPR 2024]</title>
<link>https://leloykun.github.io/papers/rasg/</link>
<pubDate>Mon, 15 Apr 2024 00:00:00 +0000</pubDate>

<guid>https://leloykun.github.io/papers/rasg/</guid>
<description>Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks.
The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs &#43; RASG is oftentimes superior given real-world applications and constraints of BDIE.</description>
</item>


</channel>
</rss>
Binary file added public/mmsg.pdf
Binary file not shown.
17 changes: 17 additions & 0 deletions public/papers/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,23 @@ <h1>
</h1>
</header>

<article class="post-entry">
<figure class="entry-cover">
<img loading="lazy" srcset="https://leloykun.github.io/papers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_360x0_resize_box_3.png 360w ,https://leloykun.github.io/papers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_480x0_resize_box_3.png 480w ,https://leloykun.github.io/papers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_720x0_resize_box_3.png 720w ,https://leloykun.github.io/papers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_1080x0_resize_box_3.png 1080w ,https://leloykun.github.io/papers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_1500x0_resize_box_3.png 1500w ,https://leloykun.github.io/papers/mmsg/cover.png 2041w"
sizes="(min-width: 768px) 720px, 100vw" src="https://leloykun.github.io/papers/mmsg/cover.png" alt="Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report"
width="2041" height="1119">
</figure>
<header class="entry-header">
<h2>Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report
</h2>
</header>
<div class="entry-content">
<p>[Technical Report for CVPR’s 2nd MMFM Challenge] This report presents Multimodal Structured Generation, a general framework which constrains the output logits of frozen Multimodal Foundation Models to force them to reason before responding with structured outputs that downstream APIs can parse and use. This approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference.</p>
</div>
<footer class="entry-footer"><span title='2024-06-17 00:00:00 +0000 UTC'>June 17, 2024</span>&nbsp;&middot;&nbsp;Franz Louis Cesista</footer>
<a class="entry-link" aria-label="post link to Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report" href="https://leloykun.github.io/papers/mmsg/"></a>
</article>

<article class="post-entry">
<figure class="entry-cover">
<img loading="lazy" srcset="https://leloykun.github.io/papers/rasg/cover_hua55047b2c8566bf561c75ee96079e3d3_463865_360x0_resize_box_3.png 360w ,https://leloykun.github.io/papers/rasg/cover_hua55047b2c8566bf561c75ee96079e3d3_463865_480x0_resize_box_3.png 480w ,https://leloykun.github.io/papers/rasg/cover_hua55047b2c8566bf561c75ee96079e3d3_463865_720x0_resize_box_3.png 720w ,https://leloykun.github.io/papers/rasg/cover_hua55047b2c8566bf561c75ee96079e3d3_463865_1080x0_resize_box_3.png 1080w ,https://leloykun.github.io/papers/rasg/cover_hua55047b2c8566bf561c75ee96079e3d3_463865_1500x0_resize_box_3.png 1500w ,https://leloykun.github.io/papers/rasg/cover.png 3972w"
Expand Down
11 changes: 10 additions & 1 deletion public/papers/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,16 @@
</image>
<generator>Hugo -- gohugo.io</generator>
<language>en</language>
<lastBuildDate>Mon, 15 Apr 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://leloykun.github.io/papers/index.xml" rel="self" type="application/rss+xml" />
<lastBuildDate>Mon, 17 Jun 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://leloykun.github.io/papers/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report</title>
<link>https://leloykun.github.io/papers/mmsg/</link>
<pubDate>Mon, 17 Jun 2024 00:00:00 +0000</pubDate>

<guid>https://leloykun.github.io/papers/mmsg/</guid>
<description>Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method&amp;#39;s ability to generalize to unseen tasks. And that simple engineering can beat expensive &amp;amp; complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use.</description>
</item>

<item>
<title>Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use [Preprint - Accepted @ IEEE MIPR 2024]</title>
<link>https://leloykun.github.io/papers/rasg/</link>
Expand Down
Binary file added public/papers/mmsg/cover.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 27a1a94

Please sign in to comment.