Skip to content

Commit

Permalink
Running prompts against images and PDFs with Google Gemini
Browse files Browse the repository at this point in the history
  • Loading branch information
simonw authored Oct 23, 2024
1 parent 53c9abf commit 471bb34
Showing 1 changed file with 344 additions and 0 deletions.
344 changes: 344 additions & 0 deletions llms/prompt-gemini.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,344 @@
# Running prompts against images and PDFs with Google Gemini

I'm still working towards adding multi-modal support to my [LLM](https://llm.datasette.io/) tool. In the meantime, here are notes on running prompts against images and PDFs from the command-line using the [Google Gemini](https://ai.google.dev/gemini-api) family of models.

## Using curl

Here's the initial recipe I figured out using `curl`.

The Gemini models take a JSON document sent via POST that looks like this:

```json
{
"contents": [
{
"role": "user",
"parts": [
{
"text": "Extract text from this image"
},
{
"inlineData": {
"data": "... base 64 encoded image data ...",
"mimeType": "image/png"
}
}
]
}
]
}
```
So the first challenge is to construct that document, including the base64 encoded image.

On macOS you can encode a file using `base64 -i image.png`. On other platforms you may not need the `-i` option.

So we can create the JSON document like this:

```bash
cat <<EOF > input.json
{
"contents": [
{
"role": "user",
"parts": [
{
"text": "Extract text from this image"
},
{
"inlineData": {
"data": "$(base64 -i image.png)",
"mimeType": "image/png"
}
}
]
}
]
}
EOF
```

This creates a `input.json` file containing the base64 encoded image, ready to be sent to the Gemini API.

Now we can send it using `curl`:

```bash
export GOOGLE_API_KEY='... your key here ...'

curl -s "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-8b-latest:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d @input.json
```

The model name goes in the URL - here I'm using `gemini-1.5-flash-8b-latest`, Google's cheapest and fastest model.

Model values you can use are:

- `gemini-1.5-flash-8b-latest` - the cheapest and fastest model, $0.04/million input tokens, 0.001 cents per image
- `gemini-1.5-flash-latest` - the one in the middle, $0.07/million input tokens, 0.0019 cents per image
- `gemini-1.5-pro-latest` - the most powerful model, $1.25/million input tokens, 0.0323 cents per image

It's hard to overestimate how _cheap_ these models are. An input image is charged at 258 tokens. That means the price per image processed is measured in fraction of a cent - those numbers above really are correct, an image even through Gemini Pro will cost less than 1/30th of a cent, and the other two models are even cheaper.

You get charged for output tokens too, which vary depending on the length of the response. Use [my LLM pricing calculator](https://tools.simonwillison.net/llm-prices) to explore those.

The output of a prompt includes a usage section that shows you exactly how many tokens you spent. Here's example output for the prompt "extract text from this image" against this image:

![Rough handwriting black marker on white card, it reads Example handwriting Let's try this out](https://github.com/user-attachments/assets/b0e18d6e-eca5-4a7a-bed8-7ffb0f0d9c68)


```json
{
"candidates": [
{
"content": {
"parts": [
{
"text": "Example handwriting\nLet's try this out"
}
],
"role": "model"
},
"finishReason": "STOP",
"safetyRatings": [
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_HARASSMENT",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"probability": "NEGLIGIBLE"
}
],
"avgLogprobs": -0.000025986179631824296
}
],
"usageMetadata": {
"promptTokenCount": 264,
"candidatesTokenCount": 9,
"totalTokenCount": 273
},
"modelVersion": "gemini-1.5-flash-8b-001"
}
```
Total cost: 0.0011 cents.

## Using a Bash script

I got Claude to write me a script to automate this process. Here's how you can use it:

```bash
export GOOGLE_API_KEY='... your key here ...'

prompt-gemini 'extract text from this image' example-handwriting.jpg
```
It accepts PNG, JPG, GIF or PDF files, automatically sending the correct `mimeType` to the API. Note that PDFs with multiple pages are charged differently - I tried a 19 page PDF and it cost 12842 tokens, suggesting around 675 tokens per page.

You can also add a `-m` option to specify a different model:

```bash
prompt-gemini 'extract text from this image' example-handwriting.jpg -m pro
```
Shortcuts `pro`, `flash` and `8b` are supported - it defaults to the cheapest 8b model.

Here's the script - save it somewhere on your path and run `chmod 755 prompt-gemini` to make it executable:

```bash
#!/bin/bash

# Check if GOOGLE_API_KEY is set
if [ -z "$GOOGLE_API_KEY" ]; then
echo "Error: GOOGLE_API_KEY environment variable is not set" >&2
exit 1
fi

# Default model
model="8b"
prompt=""
image_file=""

# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
-m)
model="$2"
shift 2
;;
*)
if [ -z "$prompt" ]; then
prompt="$1"
elif [ -z "$image_file" ]; then
image_file="$1"
fi
shift
;;
esac
done

# Validate prompt
if [ -z "$prompt" ]; then
echo "Error: No prompt provided" >&2
echo "Usage: prompt-gemini \"prompt\" [image_file] [-m model]" >&2
exit 1
fi

# Map model names to full model strings
case $model in
"8b"|"flash-8b")
model_string="gemini-1.5-flash-8b-latest"
;;
"flash")
model_string="gemini-1.5-flash-latest"
;;
"pro")
model_string="gemini-1.5-pro-latest"
;;
*)
model_string="gemini-1.5-$model"
;;
esac

# Create temporary file
temp_file=$(mktemp)
trap 'rm -f "$temp_file"' EXIT

# Determine mime type if image file is provided
if [ -n "$image_file" ]; then
if [ ! -f "$image_file" ]; then
echo "Error: File '$image_file' not found" >&2
exit 1
fi

# Get file extension and convert to lowercase
ext=$(echo "${image_file##*.}" | tr '[:upper:]' '[:lower:]')

case $ext in
png)
mime_type="image/png"
;;
jpg|jpeg)
mime_type="image/jpeg"
;;
gif)
mime_type="image/gif"
;;
pdf)
mime_type="application/pdf"
;;
*)
echo "Error: Unsupported file type .$ext" >&2
exit 1
;;
esac

# Create JSON with image data
cat <<EOF > "$temp_file"
{
"contents": [
{
"role": "user",
"parts": [
{
"text": "$prompt"
},
{
"inlineData": {
"data": "$(base64 -i "$image_file")",
"mimeType": "$mime_type"
}
}
]
}
]
}
EOF
else
# Create JSON without image data
cat <<EOF > "$temp_file"
{
"contents": [
{
"role": "user",
"parts": [
{
"text": "$prompt"
}
]
}
]
}
EOF
fi

# Make API request
curl -s "https://generativelanguage.googleapis.com/v1beta/models/$model_string:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d @"$temp_file" | jq
```

## How I got Claude to write the Bash script

Here's the prompt I fed to Claude to create this, starting with the Bash + `curl` recipe I had already figured out:


> ```bash
> cat <<EOF > input.json
> {
> "contents": [
> {
> "role": "user",
> "parts": [
> {
> "text": "Extract text from this imaage"
> },
> {
> "inlineData": {
> "data": "$(base64 -i output_0.png)",
> "mimeType": "image/png"
> }
> }
> ]
> }
> ]
> }
> EOF
>
> curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-8b-latest:generateContent?key=$GOOGLE_API_KEY" \
> -H 'Content-Type: application/json' \
> -X POST \
> -d @input.json | jq
> ```
> Turn this into a Bash script that runs like this:
> ```bash
> prompt-gemini "this is the prompt"
> prompt-gemini "This is the prompt" blah.png
> prompt-gemini "This is the prompt" blah.pdf
> prompt-gemini "this is the prompt" -m pro
> ```
> It should exit with an error if `GOOGLE_API_KEY` is not set
>
> It should use a temporary file for input.json which is deleted on completion
>
> If no file was provided it should skip the inlineData bit
>
> It should use the correct mimeType for PNG or PDF or JPG or JPEG or GIF depending on the file extension
>
> The -m option should follow the following rules: it defaults to 8b, or it can be:
>
> 8b => gemini-1.5-flash-8b-latest (the default)
> flash-8b => gemini-1.5-flash-8b-latest
> flash => gemini-1.5-flash-latest
> pro => gemini-1.5-pro-latest
>
> Any other value should be passed used directly in the `gemini-1.5-flash:generateContent` portion of the URL
Here's [the full Claude transcript](https://gist.github.com/simonw/7cc2a9c3e612a8af502d733ff619e066).

0 comments on commit 471bb34

Please sign in to comment.