Running prompts against images and PDFs with Google Gemini

simonw · Oct 23, 2024 · 471bb34 · 471bb34
1 parent 53c9abf
commit 471bb34
Showing 1 changed file with 344 additions and 0 deletions.
diff --git a/llms/prompt-gemini.md b/llms/prompt-gemini.md
@@ -0,0 +1,344 @@
+# Running prompts against images and PDFs with Google Gemini
+
+I'm still working towards adding multi-modal support to my [LLM](https://llm.datasette.io/) tool. In the meantime, here are notes on running prompts against images and PDFs from the command-line using the [Google Gemini](https://ai.google.dev/gemini-api) family of models.
+
+## Using curl
+
+Here's the initial recipe I figured out using `curl`.
+
+The Gemini models take a JSON document sent via POST that looks like this:
+
+```json
+{
+  "contents": [
+    {
+      "role": "user",
+      "parts": [
+        {
+          "text": "Extract text from this image"
+        },
+        {
+          "inlineData": {
+            "data": "... base 64 encoded image data ...",
+            "mimeType": "image/png"
+          }
+        }
+      ]
+    }
+  ]
+}
+```
+So the first challenge is to construct that document, including the base64 encoded image.
+
+On macOS you can encode a file using `base64 -i image.png`. On other platforms you may not need the `-i` option.
+
+So we can create the JSON document like this:
+
+```bash
+cat <<EOF > input.json
+{
+  "contents": [
+    {
+      "role": "user",
+      "parts": [
+        {
+          "text": "Extract text from this image"
+        },
+        {
+          "inlineData": {
+            "data": "$(base64 -i image.png)",
+            "mimeType": "image/png"
+          }
+        }
+      ]
+    }
+  ]
+}
+EOF
+```
+
+This creates a `input.json` file containing the base64 encoded image, ready to be sent to the Gemini API.
+
+Now we can send it using `curl`:
+
+```bash
+export GOOGLE_API_KEY='... your key here ...'
+
+curl -s "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-8b-latest:generateContent?key=$GOOGLE_API_KEY" \
+  -H 'Content-Type: application/json' \
+  -X POST \
+  -d @input.json
+```
+
+The model name goes in the URL - here I'm using `gemini-1.5-flash-8b-latest`, Google's cheapest and fastest model.
+
+Model values you can use are:
+
+- `gemini-1.5-flash-8b-latest` - the cheapest and fastest model, $0.04/million input tokens, 0.001 cents per image
+- `gemini-1.5-flash-latest` - the one in the middle, $0.07/million input tokens, 0.0019 cents per image
+- `gemini-1.5-pro-latest` - the most powerful model, $1.25/million input tokens, 0.0323 cents per image
+
+It's hard to overestimate how _cheap_ these models are. An input image is charged at 258 tokens. That means the price per image processed is measured in fraction of a cent - those numbers above really are correct, an image even through Gemini Pro will cost less than 1/30th of a cent, and the other two models are even cheaper.
+
+You get charged for output tokens too, which vary depending on the length of the response. Use [my LLM pricing calculator](https://tools.simonwillison.net/llm-prices) to explore those.
+
+The output of a prompt includes a usage section that shows you exactly how many tokens you spent. Here's example output for the prompt "extract text from this image" against this image:
+
+![Rough handwriting black marker on white card, it reads Example handwriting Let's try this out](https://github.com/user-attachments/assets/b0e18d6e-eca5-4a7a-bed8-7ffb0f0d9c68)
+
+
+```json
+{
+  "candidates": [
+    {
+      "content": {
+        "parts": [
+          {
+            "text": "Example handwriting\nLet's try this out"
+          }
+        ],
+        "role": "model"
+      },
+      "finishReason": "STOP",
+      "safetyRatings": [
+        {
+          "category": "HARM_CATEGORY_HATE_SPEECH",
+          "probability": "NEGLIGIBLE"
+        },
+        {
+          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
+          "probability": "NEGLIGIBLE"
+        },
+        {
+          "category": "HARM_CATEGORY_HARASSMENT",
+          "probability": "NEGLIGIBLE"
+        },
+        {
+          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
+          "probability": "NEGLIGIBLE"
+        }
+      ],
+      "avgLogprobs": -0.000025986179631824296
+    }
+  ],
+  "usageMetadata": {
+    "promptTokenCount": 264,
+    "candidatesTokenCount": 9,
+    "totalTokenCount": 273
+  },
+  "modelVersion": "gemini-1.5-flash-8b-001"
+}
+```
+Total cost: 0.0011 cents.
+
+## Using a Bash script
+
+I got Claude to write me a script to automate this process. Here's how you can use it:
+
+```bash
+export GOOGLE_API_KEY='... your key here ...'
+
+prompt-gemini 'extract text from this image' example-handwriting.jpg
+```
+It accepts PNG, JPG, GIF or PDF files, automatically sending the correct `mimeType` to the API. Note that PDFs with multiple pages are charged differently - I tried a 19 page PDF and it cost 12842 tokens, suggesting around 675 tokens per page.
+
+You can also add a `-m` option to specify a different model:
+
+```bash
+prompt-gemini 'extract text from this image' example-handwriting.jpg -m pro
+```
+Shortcuts `pro`, `flash` and `8b` are supported - it defaults to the cheapest 8b model.
+
+Here's the script - save it somewhere on your path and run `chmod 755 prompt-gemini` to make it executable:
+
+```bash
+#!/bin/bash
+
+# Check if GOOGLE_API_KEY is set
+if [ -z "$GOOGLE_API_KEY" ]; then
+    echo "Error: GOOGLE_API_KEY environment variable is not set" >&2
+    exit 1
+fi
+
+# Default model
+model="8b"
+prompt=""
+image_file=""
+
+# Parse arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        -m)
+            model="$2"
+            shift 2
+            ;;
+        *)
+            if [ -z "$prompt" ]; then
+                prompt="$1"
+            elif [ -z "$image_file" ]; then
+                image_file="$1"
+            fi
+            shift
+            ;;
+    esac
+done
+
+# Validate prompt
+if [ -z "$prompt" ]; then
+    echo "Error: No prompt provided" >&2
+    echo "Usage: prompt-gemini \"prompt\" [image_file] [-m model]" >&2
+    exit 1
+fi
+
+# Map model names to full model strings
+case $model in
+    "8b"|"flash-8b")
+        model_string="gemini-1.5-flash-8b-latest"
+        ;;
+    "flash")
+        model_string="gemini-1.5-flash-latest"
+        ;;
+    "pro")
+        model_string="gemini-1.5-pro-latest"
+        ;;
+    *)
+        model_string="gemini-1.5-$model"
+        ;;
+esac
+
+# Create temporary file
+temp_file=$(mktemp)
+trap 'rm -f "$temp_file"' EXIT
+
+# Determine mime type if image file is provided
+if [ -n "$image_file" ]; then
+    if [ ! -f "$image_file" ]; then
+        echo "Error: File '$image_file' not found" >&2
+        exit 1
+    fi
+
+    # Get file extension and convert to lowercase
+    ext=$(echo "${image_file##*.}" | tr '[:upper:]' '[:lower:]')
+
+    case $ext in
+        png)
+            mime_type="image/png"
+            ;;
+        jpg|jpeg)
+            mime_type="image/jpeg"
+            ;;
+        gif)
+            mime_type="image/gif"
+            ;;
+        pdf)
+            mime_type="application/pdf"
+            ;;
+        *)
+            echo "Error: Unsupported file type .$ext" >&2
+            exit 1
+            ;;
+    esac
+
+    # Create JSON with image data
+    cat <<EOF > "$temp_file"
+{
+  "contents": [
+    {
+      "role": "user",
+      "parts": [
+        {
+          "text": "$prompt"
+        },
+        {
+          "inlineData": {
+            "data": "$(base64 -i "$image_file")",
+            "mimeType": "$mime_type"
+          }
+        }
+      ]
+    }
+  ]
+}
+EOF
+else
+    # Create JSON without image data
+    cat <<EOF > "$temp_file"
+{
+  "contents": [
+    {
+      "role": "user",
+      "parts": [
+        {
+          "text": "$prompt"
+        }
+      ]
+    }
+  ]
+}
+EOF
+fi
+
+# Make API request
+curl -s "https://generativelanguage.googleapis.com/v1beta/models/$model_string:generateContent?key=$GOOGLE_API_KEY" \
+    -H 'Content-Type: application/json' \
+    -X POST \
+    -d @"$temp_file" | jq
+```
+
+## How I got Claude to write the Bash script
+
+Here's the prompt I fed to Claude to create this, starting with the Bash + `curl` recipe I had already figured out:
+
+
+> ```bash
+> cat <<EOF > input.json
+> {
+>   "contents": [
+>     {
+>       "role": "user",
+>       "parts": [
+>         {
+>           "text": "Extract text from this imaage"
+>         },
+>         {
+>           "inlineData": {
+>             "data": "$(base64 -i output_0.png)",
+>             "mimeType": "image/png"
+>           }
+>         }
+>       ]
+>     }
+>   ]
+> }
+> EOF
+> 
+> curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-8b-latest:generateContent?key=$GOOGLE_API_KEY" \
+>   -H 'Content-Type: application/json' \
+>   -X POST \
+>   -d @input.json | jq
+> ```
+> Turn this into a Bash script that runs like this:
+> ```bash
+> prompt-gemini "this is the prompt"
+> prompt-gemini "This is the prompt" blah.png
+> prompt-gemini "This is the prompt" blah.pdf
+> prompt-gemini "this is the prompt" -m pro
+> ```
+> It should exit with an error if `GOOGLE_API_KEY` is not set
+> 
+> It should use a temporary file for input.json which is deleted on completion
+> 
+> If no file was provided it should skip the inlineData bit
+> 
+> It should use the correct mimeType for PNG or PDF or JPG or JPEG or GIF depending on the file extension
+> 
+> The -m option should follow the following rules: it defaults to 8b, or it can be:
+> 
+> 8b => gemini-1.5-flash-8b-latest (the default)
+> flash-8b => gemini-1.5-flash-8b-latest
+> flash => gemini-1.5-flash-latest
+> pro => gemini-1.5-pro-latest
+> 
+> Any other value should be passed used directly in the `gemini-1.5-flash:generateContent` portion of the URL
+
+Here's [the full Claude transcript](https://gist.github.com/simonw/7cc2a9c3e612a8af502d733ff619e066).