Skip to content

Commit

Permalink
chore: added docs for vision (#227)
Browse files Browse the repository at this point in the history
* chore: added docs for vision

* Create tender-elephants-whisper.md

* fix: added cache option doc

* Update docs/vision.md

Co-authored-by: Saikat Mitra <saikatmitra91@gmail.com>

* fox

* fix: added h3 for query doc

---------

Co-authored-by: Saikat Mitra <saikatmitra91@gmail.com>
  • Loading branch information
pushpam5 and saikatmitra91 authored Nov 14, 2024
1 parent 6a06f72 commit a3e369d
Show file tree
Hide file tree
Showing 3 changed files with 62 additions and 1 deletion.
5 changes: 5 additions & 0 deletions .changeset/tender-elephants-whisper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"appwright": patch
---

chore: added docs for vision
56 changes: 56 additions & 0 deletions docs/vision.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Vision methods

Appwright provides a set of built-in methods to tap or extract information from the screen. These methods use LLM Capabilities to perform actions on the screen.

## Extract information from the screen

The `query` method allows you to extract information from the screen based on a prompt. Ensure the `OPENAI_API_KEY` environment variable is set to authenticate the API request.

```ts
const text = await device.beta.query("Extract the contact details present in the footer");
```

By default, the `query` method returns a string. You can also specify a Zod schema to get the response in a specific format.

```ts
const isLoginButtonVisible = await device.beta.query(
`Is the login button visible on the screen?`,
{
responseFormat: z.boolean(),
},
);
```

### Using custom screenshot

By default, the query method retrieves information from the current screen. Alternatively, you can specify a screenshot to perform operations on that particular image.

```ts
const text = await device.beta.query(
"Extract contact details from the footer of this screenshot.",
{
screenshot: <base64ImageString>,
},
);
```

### Using a different model

By default, the `query` method uses the `gpt-4o-mini` model. You can also specify a different model.

```ts
const text = await device.beta.query(
`Extract contact details present in the footer`,
{
model: "gpt-4o",
},
);
```

## Tap on the screen

The `tap` method allows you to tap on the screen based on a prompt. Ensure the `EMPIRICAL_API_KEY` environment variable is set to authenticate the API request.

```ts
await device.beta.tap("point at the 'Login' button.");
```
2 changes: 1 addition & 1 deletion src/vision/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ export interface AppwrightVision {

/**
* Performs a tap action on the screen based on the provided prompt.
* Ensure the `VISION_MODEL_ENDPOINT` environment variable is set to authenticate the API request.
* Ensure the `EMPIRICAL_API_KEY` environment variable is set to authenticate the API request.
*
* **Usage:**
* ```js
Expand Down

0 comments on commit a3e369d

Please sign in to comment.