add MMSG paper

leloykun · Jun 17, 2024 · 27a1a94 · 27a1a94
1 parent f13c502
commit 27a1a94
Show file tree

Hide file tree

Showing 39 changed files with 960 additions and 69 deletions.
diff --git a/config.yml b/config.yml
@@ -61,7 +61,7 @@ params:
         subtitle:
             Building something 👨‍🍳🚀 • Former <span style="color:#011F5B">**Machine Learning (AI) Research Scientist, Full-Stack Software Engineer, & Data Engineer**</span> at [Expedock Software Inc.](https://expedock.com) • <span style="color:#011F5B">**2x IOI & 2x ICPC World Finalist**</span> • Mathematics at the [Ateneo de Manila University](https://www.ateneo.edu/)<br><br>
 
-            Multi-Modal Machine Learning • Document Information Extraction • Structured Generation • Non-Euclidean Geometry • Geometric Deep Learning
+            Multimodal Machine Learning • Document Information Extraction • Structured Generation • Non-Euclidean Geometry • Geometric Deep Learning
         imageUrl: "picture.png"
         imageWidth: 160
         imageHeight: 160

diff --git a/content/papers/mmsg/cover.png b/content/papers/mmsg/cover.png
diff --git a/content/papers/mmsg/index.md b/content/papers/mmsg/index.md
@@ -0,0 +1,27 @@
+---
+title: "Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report"
+date: 2024-06-17
+tags: ["Machine Learning", "Multimodal Machine Learning", "Structured Generation", "Computer Vision", "Document Information Extraction"]
+author: "Franz Louis Cesista"
+description: "Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method's ability to generalize to unseen tasks. And that simple engineering can beat expensive & complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use."
+summary: "[Technical Report for CVPR's 2nd MMFM Challenge] This report presents Multimodal Structured Generation, a general framework which constrains the output logits of frozen Multimodal Foundation Models to force them to reason before responding with structured outputs that downstream APIs can parse and use. This approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference."
+cover:
+    image: "cover.png"
+    alt: "Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report"
+---
+
+![cover](cover.png)
+
+Authors: [Franz Louis Cesista](mailto:franzlouiscesista@gmail.com)
+
+Arxiv: [Submitted -- Under Review]
+
+PDF: [Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report](/mmsg.pdf)
+
+Code on GitHub: https://github.com/leloykun/MMFM-Challenge
+
+---
+
+## Abstract
+
+Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method's ability to generalize to unseen tasks. And that simple engineering can beat expensive & complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use. All of our scripts, deployment steps, and evaluation results can be accessed in [this repository](https://github.com/leloykun/MMFM-Challenge)
diff --git a/content/papers/rasg/index.md b/content/papers/rasg/index.md
@@ -10,13 +10,14 @@ summary: "[Preprint - Accepted @ IEEE 7th International Conference on Multimedia
 cover:
     image: "cover.png"
     alt: "Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use [Preprint - Accepted @ IEEE MIPR 2024]"
-weight: 1
 ---
 
 ![cover](cover.png)
 
 Authors: [Franz Louis Cesista](mailto:franzlouiscesista@gmail.com), Rui Aguiar, Jason Kim, Paolo Acilo
 
+Arxiv: [Abstract](https://arxiv.org/abs/2405.20245)
+
 PDF: [Preprint - Accepted @ IEEE MIPR 2024](/rasg.pdf)
 
 Code on GitHub: [Will be available on or before June 17th, 2024]

diff --git a/public/archive/index.html b/public/archive/index.html
@@ -160,6 +160,17 @@
 <div class="archive-year">
   <h2 class="archive-year-header">2024
   </h2>
+  <div class="archive-month">
+    <h3 class="archive-month-header">June
+    </h3>
+    <div class="archive-posts">
+      <div class="archive-entry">
+        <h3 class="archive-entry-title">Multimodal Structured Generation: CVPR&rsquo;s 2nd MMFM Challenge Technical Report
+        </h3>
+        <a class="entry-link" aria-label="post link to Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report" href="https://leloykun.github.io/papers/mmsg/"></a>
+      </div>
+    </div>
+  </div>
   <div class="archive-month">
     <h3 class="archive-month-header">April
     </h3>

diff --git a/public/index.html b/public/index.html
@@ -94,7 +94,7 @@ <h1><span style="color:#011F5B">Franz Louis Cesista</span></h1>
         <span>Building something 👨‍🍳🚀 • Former <span style="color:#011F5B"><strong>Machine Learning (AI) Research Scientist, Full-Stack Software Engineer, &amp; Data Engineer</strong></span> at <a href="https://expedock.com" target="_blank">Expedock Software Inc.</a>
  • <span style="color:#011F5B"><strong>2x IOI &amp; 2x ICPC World Finalist</strong></span> • Mathematics at the <a href="https://www.ateneo.edu/" target="_blank">Ateneo de Manila University</a>
 <br><br>
-Multi-Modal Machine Learning • Document Information Extraction • Structured Generation • Non-Euclidean Geometry • Geometric Deep Learning</span><div class="social-icons">
+Multimodal Machine Learning • Document Information Extraction • Structured Generation • Non-Euclidean Geometry • Geometric Deep Learning</span><div class="social-icons">
     <a href="mailto:franzlouiscesista@gmail.com" target="_blank" rel="noopener noreferrer me" title="Email">
         <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 21" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
     <path d="M4 4h16c1.1 0 2 .9 2 2v12c0 1.1-.9 2-2 2H4c-1.1 0-2-.9-2-2V6c0-1.1.9-2 2-2z"></path>

diff --git a/public/index.xml b/public/index.xml
@@ -10,7 +10,7 @@
     </image>
     <generator>Hugo -- gohugo.io</generator>
     <language>en</language>
-    <lastBuildDate>Mon, 15 Apr 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://leloykun.github.io/index.xml" rel="self" type="application/rss+xml" />
+    <lastBuildDate>Mon, 17 Jun 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://leloykun.github.io/index.xml" rel="self" type="application/rss+xml" />
     <item>
       <title>ChatGPT May Have Developed Seasonal Depression</title>
       <link>https://leloykun.github.io/ponder/chatgpt-seasonal-depression/</link>
@@ -65,16 +65,6 @@
       <description>Whether you&amp;#39;re only here for the hype or genuinely interested in the field, you’re in for a wild ride.</description>
     </item>
 
-    <item>
-      <title>Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use [Preprint - Accepted @ IEEE MIPR 2024]</title>
-      <link>https://leloykun.github.io/papers/rasg/</link>
-      <pubDate>Mon, 15 Apr 2024 00:00:00 +0000</pubDate>
-
-      <guid>https://leloykun.github.io/papers/rasg/</guid>
-      <description>Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks.
-The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs &#43; RASG is oftentimes superior given real-world applications and constraints of BDIE.</description>
-    </item>
-
     <item>
       <title>Flash Hyperbolic Attention Minimal [WIP]</title>
       <link>https://leloykun.github.io/personal-projects/flash-hyperbolic-attention-minimal/</link>
@@ -147,6 +137,25 @@ The contributions of this paper are threefold: (1) We show, with ablation benchm
       <description>Booking demand prediction for Grab&amp;#39;s Southeast Asia operations. The project involves spatio-temporal forecasting, anomaly detection, and econometric modeling.</description>
     </item>
 
+    <item>
+      <title>Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report</title>
+      <link>https://leloykun.github.io/papers/mmsg/</link>
+      <pubDate>Mon, 17 Jun 2024 00:00:00 +0000</pubDate>
+
+      <guid>https://leloykun.github.io/papers/mmsg/</guid>
+      <description>Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method&amp;#39;s ability to generalize to unseen tasks. And that simple engineering can beat expensive &amp;amp; complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use.</description>
+    </item>
+
+    <item>
+      <title>Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use [Preprint - Accepted @ IEEE MIPR 2024]</title>
+      <link>https://leloykun.github.io/papers/rasg/</link>
+      <pubDate>Mon, 15 Apr 2024 00:00:00 +0000</pubDate>
+
+      <guid>https://leloykun.github.io/papers/rasg/</guid>
+      <description>Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks.
+The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs &#43; RASG is oftentimes superior given real-world applications and constraints of BDIE.</description>
+    </item>
+
 
   </channel>
 </rss>
diff --git a/public/mmsg.pdf b/public/mmsg.pdf
diff --git a/public/papers/index.html b/public/papers/index.html
@@ -125,6 +125,23 @@ <h1>
   </h1>
 </header>
 
+<article class="post-entry"> 
+<figure class="entry-cover">
+        <img loading="lazy" srcset="https://leloykun.github.io/papers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_360x0_resize_box_3.png 360w ,https://leloykun.github.io/papers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_480x0_resize_box_3.png 480w ,https://leloykun.github.io/papers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_720x0_resize_box_3.png 720w ,https://leloykun.github.io/papers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_1080x0_resize_box_3.png 1080w ,https://leloykun.github.io/papers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_1500x0_resize_box_3.png 1500w ,https://leloykun.github.io/papers/mmsg/cover.png 2041w" 
+            sizes="(min-width: 768px) 720px, 100vw" src="https://leloykun.github.io/papers/mmsg/cover.png" alt="Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report" 
+            width="2041" height="1119">
+</figure>
+  <header class="entry-header">
+    <h2>Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report
+    </h2>
+  </header>
+  <div class="entry-content">
+    <p>[Technical Report for CVPR’s 2nd MMFM Challenge] This report presents Multimodal Structured Generation, a general framework which constrains the output logits of frozen Multimodal Foundation Models to force them to reason before responding with structured outputs that downstream APIs can parse and use. This approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference.</p>
+  </div>
+  <footer class="entry-footer"><span title='2024-06-17 00:00:00 +0000 UTC'>June 17, 2024</span>&nbsp;&middot;&nbsp;Franz Louis Cesista</footer>
+  <a class="entry-link" aria-label="post link to Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report" href="https://leloykun.github.io/papers/mmsg/"></a>
+</article>
+
 <article class="post-entry"> 
 <figure class="entry-cover">
         <img loading="lazy" srcset="https://leloykun.github.io/papers/rasg/cover_hua55047b2c8566bf561c75ee96079e3d3_463865_360x0_resize_box_3.png 360w ,https://leloykun.github.io/papers/rasg/cover_hua55047b2c8566bf561c75ee96079e3d3_463865_480x0_resize_box_3.png 480w ,https://leloykun.github.io/papers/rasg/cover_hua55047b2c8566bf561c75ee96079e3d3_463865_720x0_resize_box_3.png 720w ,https://leloykun.github.io/papers/rasg/cover_hua55047b2c8566bf561c75ee96079e3d3_463865_1080x0_resize_box_3.png 1080w ,https://leloykun.github.io/papers/rasg/cover_hua55047b2c8566bf561c75ee96079e3d3_463865_1500x0_resize_box_3.png 1500w ,https://leloykun.github.io/papers/rasg/cover.png 3972w" 

diff --git a/public/papers/index.xml b/public/papers/index.xml
@@ -10,7 +10,16 @@
     </image>
     <generator>Hugo -- gohugo.io</generator>
     <language>en</language>
-    <lastBuildDate>Mon, 15 Apr 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://leloykun.github.io/papers/index.xml" rel="self" type="application/rss+xml" />
+    <lastBuildDate>Mon, 17 Jun 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://leloykun.github.io/papers/index.xml" rel="self" type="application/rss+xml" />
+    <item>
+      <title>Multimodal Structured Generation: CVPR&#39;s 2nd MMFM Challenge Technical Report</title>
+      <link>https://leloykun.github.io/papers/mmsg/</link>
+      <pubDate>Mon, 17 Jun 2024 00:00:00 +0000</pubDate>
+
+      <guid>https://leloykun.github.io/papers/mmsg/</guid>
+      <description>Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method&amp;#39;s ability to generalize to unseen tasks. And that simple engineering can beat expensive &amp;amp; complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use.</description>
+    </item>
+
     <item>
       <title>Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use [Preprint - Accepted @ IEEE MIPR 2024]</title>
       <link>https://leloykun.github.io/papers/rasg/</link>

diff --git a/public/papers/mmsg/cover.png b/public/papers/mmsg/cover.png
diff --git a/...rs/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_1080x0_resize_box_3.png b/...rs/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_1080x0_resize_box_3.png
diff --git a/...rs/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_1500x0_resize_box_3.png b/...rs/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_1500x0_resize_box_3.png
diff --git a/...ers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_360x0_resize_box_3.png b/...ers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_360x0_resize_box_3.png
diff --git a/...ers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_480x0_resize_box_3.png b/...ers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_480x0_resize_box_3.png
diff --git a/...ers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_720x0_resize_box_3.png b/...ers/mmsg/cover_huf507c5215a238dad9bc8bff25a74b1e6_207727_720x0_resize_box_3.png