Skip to content

Commit

Permalink
Recommendation working with expanded dataset (pytorch#251)
Browse files Browse the repository at this point in the history
* Recommendation working with expanded dataset

This code is the proposal for the new reference implementation of the
recommendation benchmark. It works on the expanded ML-20M dataset
(16x more users, 32x more items) generated with the code
from data_generation directory.

* switched to pytorch/pytorch container, README updated
  • Loading branch information
lukmaz authored and nvpaulius committed Apr 4, 2019
1 parent c097c79 commit 7c2fb6f
Show file tree
Hide file tree
Showing 18 changed files with 594 additions and 489 deletions.
30 changes: 30 additions & 0 deletions data_generation/fractal_graph_expansions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,33 @@ Other useful flags:
2) --max_dropout_rate, decreasing/increasing this value will result in
a denser/sparser generated data set. 0.99 (default) is used.

# Running instructions for the recommendation benchmark

### Steps to download and verify data

You can download and verify the dataset by running the `download_dataset.sh` and `verify_dataset.sh` scripts from the parent `recommendation` directory.
Assume you want to store the downloaded dataset in `/my_data_dir` directory:

1. Install `unzip` and `curl`.
2. Download and unzip `ml-20m.zip`:
```bash
mkdir /my_data_dir
cd /my_data_dir
# Creates ml-20.zip
source <PATH_TO_RECOMMENDATION_DIR>/download_dataset.sh
# Confirms the MD5 checksum of ml-20.zip
source <PATH_TO_RECOMMENDATION_DIR>/verify_dataset.sh
unzip ml-20m.zip
```

### Step to expand the dataset (x16 users, x32 items)

Assuming that the unzipped ML-20M dataset is stored under `/my_data_dir/ml-20m`,
go to `data_generation/fractal_graph_expansions` directory and run:

```bash
pip install -r requirements.txt
DATA_DIR=/my_data_dir ./data_gen.sh
```

The resulting dataset should be stored under `/my_data_dir/ml-20mx16x32`.
9 changes: 9 additions & 0 deletions data_generation/fractal_graph_expansions/data_gen.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
DATASET=${DATASET:-ml-20m}
USER_MUL=${USER_MUL:-16}
ITEM_MUL=${ITEM_MUL:-32}
DATA_DIR=${DATA_DIR:-/data/cache}

DATA_PATH=${DATA_DIR}/${DATASET}x${USER_MUL}x${ITEM_MUL}/

mkdir -p ${DATA_PATH}
python run_expansion.py --input_csv_file ${DATA_DIR}/${DATASET}/ratings.csv --num_row_multiplier ${USER_MUL} --num_col_multiplier ${ITEM_MUL} --output_prefix ${DATA_PATH}
17 changes: 12 additions & 5 deletions data_generation/fractal_graph_expansions/graph_expansion.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,10 @@ def _compute_and_write_row_block(
train_items_to_write = train_rows_to_write.getrow(k).indices
test_items_to_write = test_rows_to_write.getrow(k).indices

# for users with > 1 test items, keep only the first one
if len(test_items_to_write) > 1:
test_items_to_write = test_items_to_write[:1]

num_train = train_items_to_write.shape[0]
num_test = test_items_to_write.shape[0]

Expand All @@ -156,11 +160,14 @@ def _compute_and_write_row_block(

logging.info("Done producing data set row by row.")

shard_suffix = "_" + str(i).zfill(3)
util.serialize_to_file(
all_train_items_to_write, file_name=train_indices_out_path + shard_suffix)
util.serialize_to_file(
all_test_items_to_write, file_name=test_indices_out_path + shard_suffix)
util.savez_two_column(
all_train_items_to_write,
row_offset=(i * right_matrix.shape[0]),
file_name=train_indices_out_path + ("_%d" % i))
util.savez_two_column(
all_test_items_to_write,
row_offset=(i * right_matrix.shape[0]),
file_name=test_indices_out_path + ("_%d" % i))

num_cols = rows_to_write.shape[1]
metadata = SparseMatrixMetadata(num_interactions=num_interactions,
Expand Down
7 changes: 7 additions & 0 deletions data_generation/fractal_graph_expansions/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
numpy
absl-py
pandas
scipy
scikit-image
sklearn
tensorflow
18 changes: 11 additions & 7 deletions data_generation/fractal_graph_expansions/run_expansion.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,11 @@
"matrix will be multiplied.")
flags.DEFINE_string("output_prefix",
"",
"Prefix to the path of the pickle files that will be "
"produced. output_prefix_train.pkl and "
"output_prefix_test.pkl will be created.")
"Prefix to the path of the files that will be "
"produced. output_prefix/trainxAxB_C.npz and "
"output_prefix/testxAxB_C.npz will be created, "
"where A is num_row_multiplier, B is num_col_multiplier, "
"and C goes from 0 to (num_row_multiplier - 1).")
flags.DEFINE_integer("random_seed",
0,
"Random seed for all random operations.")
Expand Down Expand Up @@ -171,10 +173,12 @@ def main(_):
train_test_ratings_matrix = train_test_ratings_matrix.tocoo()
logging.info("Done creating signed train/test matrix.")

output_train_file = FLAGS.output_prefix + "_train.pkl"
output_test_file = FLAGS.output_prefix + "_test.pkl"
output_train_file_metadata = FLAGS.output_prefix + "_train_metadata.pkl"
output_test_file_metadata = FLAGS.output_prefix + "_test_metadata.pkl"
output_train_file = (FLAGS.output_prefix + "trainx" +
str(reduced_num_rows) + "x" + str(reduced_num_cols))
output_test_file = (FLAGS.output_prefix + "testx" +
str(reduced_num_rows) + "x" + str(reduced_num_cols))
output_train_file_metadata = None
output_test_file_metadata = None

logging.info("Creating synthetic train data set and dumping to %s.",
output_train_file)
Expand Down
11 changes: 11 additions & 0 deletions data_generation/fractal_graph_expansions/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,17 @@ def serialize_to_file(obj, file_name, append=False):
pickle.dump(obj, output_file)
logging.info("Done serializing to file %s.", file_name)

def savez_two_column(matrix, row_offset, file_name, append=False):
"""Savez_compressed obj to file_name."""
logging.info("Saving obj to file in two column .npz format %s.", file_name)
tc = []
for u, items in enumerate(matrix):
user = row_offset + u
for item in items:
tc.append([user, item])

np.savez_compressed(file_name, np.asarray(tc))
logging.info("Done saving to file %s.", file_name)

def sorted_product_set(array_a, array_b):
"""Compute the product set of array_a and array_b and sort it."""
Expand Down
50 changes: 11 additions & 39 deletions recommendation/pytorch/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,44 +1,16 @@
FROM nvidia/cuda:9.1-cudnn7-devel-ubuntu16.04
ARG FROM_IMAGE_NAME=pytorch/pytorch:1.0.1-cuda10.0-cudnn7-runtime
FROM ${FROM_IMAGE_NAME}

# Set working directory
WORKDIR /mlperf
# Install Python dependencies
WORKDIR /workspace/recommendation

RUN apt-get update
RUN apt-get install -y git make build-essential libssl-dev zlib1g-dev libbz2-dev \
libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
xz-utils tk-dev cmake unzip

# pyenv Install
RUN git clone https://github.com/pyenv/pyenv.git .pyenv

ENV HOME /mlperf
ENV PYENV_ROOT $HOME/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH

# Install Anaconda
RUN PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install anaconda3-5.0.1
RUN pyenv rehash
RUN pyenv global anaconda3-5.0.1

# Install PyTorch Requirements
ENV CMAKE_PREFIX_PATH "$(dirname $(which conda))/../"
RUN conda install -y numpy pyyaml mkl mkl-include setuptools cmake cffi typing
RUN conda install -c pytorch -y magma-cuda90
COPY requirements.txt .
RUN pip install -r requirements.txt

# Install PyTorch
RUN mkdir github
WORKDIR /mlperf/github
RUN git clone --recursive https://github.com/pytorch/pytorch
WORKDIR /mlperf/github/pytorch
RUN git checkout v0.4.0
RUN git submodule update --init
RUN python setup.py clean
COPY negative_sampling_cpp ./negative_sampling_cpp
WORKDIR /workspace/recommendation/negative_sampling_cpp
RUN python setup.py install

# Install ncf-pytorch
WORKDIR /mlperf/ncf
# TODO: Change to clone github repo
ADD . /mlperf/ncf
RUN pip install -r requirements.txt
WORKDIR /mlperf/experiment
ENTRYPOINT ["/mlperf/ncf/run_and_time.sh"]
# Copy NCF code and build
WORKDIR /workspace/recommendation
COPY . .
58 changes: 28 additions & 30 deletions recommendation/pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ sudo apt-get install unzip curl
```
3. Checkout the MLPerf repo
```bash
git clone https://github.com/mlperf/reference.git
git clone https://github.com/mlperf/training.git
```

4. Install other python packages

```bash
cd reference/recommendation/pytorch
cd training/recommendation/pytorch
pip install -r requirements.txt
```

Expand All @@ -30,58 +30,56 @@ pip install -r requirements.txt
1. Checkout the MLPerf repo

```bash
git clone https://github.com/mlperf/reference.git
git clone https://github.com/mlperf/training.git
```
2. Install CUDA and Docker

```bash
source reference/install_cuda_docker.sh
source training/install_cuda_docker.sh
```

3. Get the docker image for the recommendation task
3. Build the docker image for the recommendation task

```bash
# Pull from Docker Hub
docker pull mlperf/recommendation:v0.5
# Build from Dockerfile
cd training/recommendation/pytorch
sudo docker build -t mlperf/recommendation:v0.6 .
```

or
### Steps to run and time

```bash
# Build from Dockerfile
cd reference/recommendation/pytorch
sudo docker build -t mlperf/recommendation:v0.5 .
```
#### Getting the expanded dataset

### Steps to download and verify data
The original ML-20M dataset is expanded to 16x more users and 32x more items using the code from the `data_generation` directory in the `mlperf/training` repo.
To obtain the expanded dataset, follow the instructions from section
`Running instructions for the recommendation benchmark` from the README file in the
`data_generation/fractal_graph_expansions` directory.

You can download and verify the dataset by running the `download_dataset.sh` and `verify_dataset.sh` scripts in the parent directory:
#### Run the Docker container

```bash
# Creates ml-20.zip
source ../download_dataset.sh
# Confirms the MD5 checksum of ml-20.zip
source ../verify_dataset.sh
nvidia-docker run --rm -it --ipc=host --network=host -v /my_data_dir:/data/cache mlperf/recommendation:v0.6 /bin/bash
```

### Steps to run and time

#### From Source
#### Generating the negative test samples

Run the `run_and_time.sh` script with an integer seed value between 1 and 5
Assuming the expanded dataset is visible in the container under `/data/cache/ml-20mx16x32`
directory, run inside the container:

```bash
source run_and_time.sh SEED
```
python convert.py /data/cache/ml-20mx16x32
```

#### Docker Image
#### Running the training

```bash
sudo nvidia-docker run -i -t --rm --ipc=host \
--mount "type=bind,source=$(pwd),destination=/mlperf/experiment" \
mlperf/recommendation:v0.5 SEED
Assuming the expanded dataset together with the generated test negative samples files are
visible in the container under `/data/cache/ml-20mx16x32` directory, run inside the container:

```
./run_and_time.sh <SEED>
```


# 3. Dataset/Environment
### Publication/Attribution
Harper, F. M. & Konstan, J. A. (2015), 'The MovieLens Datasets: History and Context', ACM Trans. Interact. Intell. Syst. 5(4), 19:1--19:19.
Expand Down
Loading

0 comments on commit 7c2fb6f

Please sign in to comment.