You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, rag is simply done by embedding each screenshot and then comparing the embeddings to the query embeddings. A time decay is added to the score to encourage more recent results.
For each group of frames within 2 minutes retrieved, the most recent frame is grabbed. Of those deduplicated frames, the top 3 are grabbed by time-adjusted embedding score.
Those three frames are then converted to text by grouping by block_num and ordering by y position. The application package name and local time of screetshot are added ahead of the text form the screenshot.
This context from the 3 screenshots is passed along with the original query to the LLM.
Some ideas to improve:
Improve OCR parsing to include spatial information better (could try leaving in some positional data)
Create some app specific parsing (such as messages to know which message came from who)
Could try to integrate some UI object detection
The text was updated successfully, but these errors were encountered:
Currently, rag is simply done by embedding each screenshot and then comparing the embeddings to the query embeddings. A time decay is added to the score to encourage more recent results.
For each group of frames within 2 minutes retrieved, the most recent frame is grabbed. Of those deduplicated frames, the top 3 are grabbed by time-adjusted embedding score.
Those three frames are then converted to text by grouping by block_num and ordering by y position. The application package name and local time of screetshot are added ahead of the text form the screenshot.
This context from the 3 screenshots is passed along with the original query to the LLM.
Some ideas to improve:
The text was updated successfully, but these errors were encountered: