Skip to content

ae9is/amazon-reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

amazon-reviews

Spring GraphQL API based on normalising the Amazon Reviews 2023 dataset in Postgres.

Also contains a Python API for running the BLaIR model for item recommendations.

Setup

Requirements

The project has been developed on Linux for Linux-based Docker image deployment. Your mileage may vary with other platforms.

The Python API requires generation of some embeddings which are loaded into the database. This task effectively requires GPU-enabled PyTorch; it's very slow without.

Environment variables

Setup loading .env variables: https://direnv.net/

direnv allow

Python

Uses Python 3.12. To easily switch between versions of python, consider setting up pyenv.

PDM is used for proper dependency resolution and convenience scripts.

pip install pipx
pipx install pdm
pdm install
pdm install-cpu
# OR
pdm install-cuda

Java

Uses Java 17+ with Gradle. Gradle is included in the project files already.

apt install openjdk-17-jdk
make deps

Data

See: https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main

The data needs to be downloaded and prepped for import into the database.

  1. Download the following files to data/import:

  2. Then run the following task:

    make parse
  3. For the item recommendations API, some item metadata needs to be fed through a model to generate embeddings. This takes a while and is interruptible:

    make embeddings

Note: Feel free to download, merge, and parse the data for all of the categories—but it's a lot bigger!

Run

To run the Spring and Python APIs and Postgres database via Docker:

direnv allow
make docker-build
docker compose up

Open http://localhost:4000/graphiql?path=/graphql

(Optional) You can also directly run the Spring API at the same time with:

make run

Open http://localhost:8080/graphiql?path=/graphql

GPU-accelerated containers

To run the Python API inside a Docker container with CUDA enabled in PyTorch, the container host should setup Docker for CUDA.

See: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html

Test

Build the Docker images first using:

make docker-build

Then run:

make test

Database

Migrations

Bring up the database instance with Docker and then:

# Create and load tables
bash docker-db-up.sh

# Dump and drop tables
bash docker-db-down.sh

CLI

To get a shell to the running Postgres instance:

make docker-bash-db
postgres@...:/$ psql reviews
reviews=# \d

Production Demo

Deployment

To spin up a free live demo using Render:

  1. Issue a release commit which causes the GitHub actions workflow to tag and publish the API Docker images to GitHub Packages.

  2. Create Render account.

  3. Create two web services for each API (Java GraphQL API, Python model API), and set environment variables as appropriate referring to:

  4. Create a Postgres database service. Make sure your local Postgres fits well within 1GB and load it into the remote Render database using:

    docker compose up
    make docker-bash-db
    pg_dump --format=custom --no-acl --no-owner --quote-all-identifiers --verbose --file /export/backup.dump --exclude-schema=pg_catalog -h localhost -U postgres reviews 
    # Modify this, inserting the external connection string for your database from Render
    pg_restore --verbose --no-acl --no-owner -d postgres://reviews:supersecretgeneratedpassword@instancesubdomain.region-postgres.render.com/reviews_abcd /export/backup.dump

Caveats

  • Free tier instances spin down on Render, i.e. it takes a minute after the first request to each API (graphql API, model API) for it to be live again
  • itemSummariesByQuery uses the model API behind scenes, and will error out for a bit until the model API goes live again
  • The model API is quantized for the demo and the recommendation results are somewhat lower quality

Example queries

Example 1

Query

query reviewsByAsin($asin: String!, $params: ReviewPaginationInput) {
  reviewsByAsin(asin: $asin, params: $params) {
    cursor
    list {
      asin
      helpfulVote
      id
      images {
        attachmentType
        id
        largeImageURL
        mediumImageURL
        smallImageURL
      }
      parentAsin
      rating
      text
      timestamp
      title
      userID
      verifiedPurchase
    }
  }
}

Variables

{
  "asin": "B0BSGM6CQ9",
  "params": {
  	"limit": 5,
    "sort": "NEWEST"
  }
}

Example 2

Query

query itemSummariesByQuery($queryText: String!, $limit: Int) {
  itemSummariesByQuery(queryText: $queryText, limit: $limit) {
    id
    title
    averageRating
    ratingNumber
    price
    store
    parentAsin
  }
}

Variables

{
  "queryText": "I need a quiet instrument that outputs to a MIDI interface",
  "limit": 5
}