Recommendation working with expanded dataset (pytorch#251)

* Recommendation working with expanded dataset This code is the proposal for the new reference implementation of the recommendation benchmark. It works on the expanded ML-20M dataset (16x more users, 32x more items) generated with the code from data_generation directory. * switched to pytorch/pytorch container, README updated
rajveerb · Apr 4, 2019 · 7c2fb6f · 7c2fb6f
1 parent c097c79
commit 7c2fb6f
Show file tree

Hide file tree

Showing 18 changed files with 594 additions and 489 deletions.
diff --git a/data_generation/fractal_graph_expansions/README.md b/data_generation/fractal_graph_expansions/README.md
@@ -86,3 +86,33 @@ Other useful flags:
   2) --max_dropout_rate, decreasing/increasing this value will result in
     a denser/sparser generated data set. 0.99 (default) is used.
 
+# Running instructions for the recommendation benchmark
+
+### Steps to download and verify data
+
+You can download and verify the dataset by running the `download_dataset.sh` and `verify_dataset.sh` scripts from the parent `recommendation` directory.
+Assume you want to store the downloaded dataset in `/my_data_dir` directory:
+
+1. Install `unzip` and `curl`.
+2. Download and unzip `ml-20m.zip`:
+```bash
+mkdir /my_data_dir
+cd /my_data_dir
+# Creates ml-20.zip
+source <PATH_TO_RECOMMENDATION_DIR>/download_dataset.sh
+# Confirms the MD5 checksum of ml-20.zip
+source <PATH_TO_RECOMMENDATION_DIR>/verify_dataset.sh
+unzip ml-20m.zip
+```
+
+### Step to expand the dataset (x16 users, x32 items)
+
+Assuming that the unzipped ML-20M dataset is stored under `/my_data_dir/ml-20m`, 
+go to `data_generation/fractal_graph_expansions` directory and run:
+
+```bash
+pip install -r requirements.txt
+DATA_DIR=/my_data_dir ./data_gen.sh
+```
+
+The resulting dataset should be stored under `/my_data_dir/ml-20mx16x32`.
diff --git a/data_generation/fractal_graph_expansions/data_gen.sh b/data_generation/fractal_graph_expansions/data_gen.sh
@@ -0,0 +1,9 @@
+DATASET=${DATASET:-ml-20m}
+USER_MUL=${USER_MUL:-16}
+ITEM_MUL=${ITEM_MUL:-32}
+DATA_DIR=${DATA_DIR:-/data/cache}
+
+DATA_PATH=${DATA_DIR}/${DATASET}x${USER_MUL}x${ITEM_MUL}/
+
+mkdir -p ${DATA_PATH}
+python run_expansion.py --input_csv_file ${DATA_DIR}/${DATASET}/ratings.csv --num_row_multiplier ${USER_MUL} --num_col_multiplier ${ITEM_MUL} --output_prefix ${DATA_PATH}
diff --git a/data_generation/fractal_graph_expansions/graph_expansion.py b/data_generation/fractal_graph_expansions/graph_expansion.py
@@ -134,6 +134,10 @@ def _compute_and_write_row_block(
     train_items_to_write = train_rows_to_write.getrow(k).indices
     test_items_to_write = test_rows_to_write.getrow(k).indices
 
+    # for users with > 1 test items, keep only the first one
+    if len(test_items_to_write) > 1:
+        test_items_to_write = test_items_to_write[:1]
+
     num_train = train_items_to_write.shape[0]
     num_test = test_items_to_write.shape[0]
 
@@ -156,11 +160,14 @@ def _compute_and_write_row_block(
 
   logging.info("Done producing data set row by row.")
 
-  shard_suffix = "_" + str(i).zfill(3)
-  util.serialize_to_file(
-      all_train_items_to_write, file_name=train_indices_out_path + shard_suffix)
-  util.serialize_to_file(
-      all_test_items_to_write, file_name=test_indices_out_path + shard_suffix)
+  util.savez_two_column(
+      all_train_items_to_write, 
+      row_offset=(i * right_matrix.shape[0]),
+      file_name=train_indices_out_path + ("_%d" % i))
+  util.savez_two_column(
+      all_test_items_to_write, 
+      row_offset=(i * right_matrix.shape[0]),
+      file_name=test_indices_out_path + ("_%d" % i))
 
   num_cols = rows_to_write.shape[1]
   metadata = SparseMatrixMetadata(num_interactions=num_interactions,

diff --git a/data_generation/fractal_graph_expansions/requirements.txt b/data_generation/fractal_graph_expansions/requirements.txt
@@ -0,0 +1,7 @@
+numpy
+absl-py
+pandas
+scipy
+scikit-image
+sklearn
+tensorflow
diff --git a/data_generation/fractal_graph_expansions/run_expansion.py b/data_generation/fractal_graph_expansions/run_expansion.py
@@ -51,9 +51,11 @@
                      "matrix will be multiplied.")
 flags.DEFINE_string("output_prefix",
                     "",
-                    "Prefix to the path of the pickle files that will be "
-                    "produced. output_prefix_train.pkl and "
-                    "output_prefix_test.pkl will be created.")
+                    "Prefix to the path of the files that will be "
+                    "produced. output_prefix/trainxAxB_C.npz and "
+                    "output_prefix/testxAxB_C.npz will be created, "
+                    "where A is num_row_multiplier, B is num_col_multiplier, "
+                    "and C goes from 0 to (num_row_multiplier - 1).")
 flags.DEFINE_integer("random_seed",
                      0,
                      "Random seed for all random operations.")
@@ -171,10 +173,12 @@ def main(_):
   train_test_ratings_matrix = train_test_ratings_matrix.tocoo()
   logging.info("Done creating signed train/test matrix.")
 
-  output_train_file = FLAGS.output_prefix + "_train.pkl"
-  output_test_file = FLAGS.output_prefix + "_test.pkl"
-  output_train_file_metadata = FLAGS.output_prefix + "_train_metadata.pkl"
-  output_test_file_metadata = FLAGS.output_prefix + "_test_metadata.pkl"
+  output_train_file = (FLAGS.output_prefix + "trainx" + 
+      str(reduced_num_rows) + "x" + str(reduced_num_cols))
+  output_test_file = (FLAGS.output_prefix + "testx" +
+      str(reduced_num_rows) + "x" + str(reduced_num_cols))
+  output_train_file_metadata = None
+  output_test_file_metadata = None
 
   logging.info("Creating synthetic train data set and dumping to %s.",
                output_train_file)

diff --git a/data_generation/fractal_graph_expansions/util.py b/data_generation/fractal_graph_expansions/util.py
@@ -66,6 +66,17 @@ def serialize_to_file(obj, file_name, append=False):
     pickle.dump(obj, output_file)
   logging.info("Done serializing to file %s.", file_name)
 
+def savez_two_column(matrix, row_offset, file_name, append=False):
+  """Savez_compressed obj to file_name."""
+  logging.info("Saving obj to file in two column .npz format %s.", file_name)
+  tc = []
+  for u, items in enumerate(matrix):
+    user = row_offset + u
+    for item in items:
+      tc.append([user, item])
+
+  np.savez_compressed(file_name, np.asarray(tc))
+  logging.info("Done saving to file %s.", file_name)
 
 def sorted_product_set(array_a, array_b):
   """Compute the product set of array_a and array_b and sort it."""

diff --git a/recommendation/pytorch/Dockerfile b/recommendation/pytorch/Dockerfile
@@ -1,44 +1,16 @@
-FROM nvidia/cuda:9.1-cudnn7-devel-ubuntu16.04
+ARG FROM_IMAGE_NAME=pytorch/pytorch:1.0.1-cuda10.0-cudnn7-runtime
+FROM ${FROM_IMAGE_NAME}
 
-# Set working directory
-WORKDIR /mlperf
+# Install Python dependencies
+WORKDIR /workspace/recommendation
 
-RUN apt-get update
-RUN apt-get install -y git make build-essential libssl-dev zlib1g-dev libbz2-dev \
-                       libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
-                       xz-utils tk-dev cmake unzip
-
-# pyenv Install
-RUN git clone https://github.com/pyenv/pyenv.git .pyenv
-
-ENV HOME /mlperf
-ENV PYENV_ROOT $HOME/.pyenv
-ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
-
-# Install Anaconda
-RUN PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install anaconda3-5.0.1
-RUN pyenv rehash
-RUN pyenv global anaconda3-5.0.1
-
-# Install PyTorch Requirements
-ENV CMAKE_PREFIX_PATH "$(dirname $(which conda))/../"
-RUN conda install -y numpy pyyaml mkl mkl-include setuptools cmake cffi typing
-RUN conda install -c pytorch -y magma-cuda90
+COPY requirements.txt .
+RUN pip install -r requirements.txt
 
-# Install PyTorch
-RUN mkdir github
-WORKDIR /mlperf/github
-RUN git clone --recursive https://github.com/pytorch/pytorch
-WORKDIR /mlperf/github/pytorch
-RUN git checkout v0.4.0
-RUN git submodule update --init
-RUN python setup.py clean
+COPY negative_sampling_cpp ./negative_sampling_cpp
+WORKDIR /workspace/recommendation/negative_sampling_cpp
 RUN python setup.py install
 
-# Install ncf-pytorch
-WORKDIR /mlperf/ncf
-# TODO: Change to clone github repo
-ADD . /mlperf/ncf
-RUN pip install -r requirements.txt
-WORKDIR /mlperf/experiment
-ENTRYPOINT ["/mlperf/ncf/run_and_time.sh"]
+# Copy NCF code and build
+WORKDIR /workspace/recommendation
+COPY . .
diff --git a/recommendation/pytorch/README.md b/recommendation/pytorch/README.md
@@ -15,13 +15,13 @@ sudo apt-get install unzip curl
 ```
 3. Checkout the MLPerf repo
 ```bash
-git clone https://github.com/mlperf/reference.git
+git clone https://github.com/mlperf/training.git
 ```
 
 4. Install other python packages
 
 ```bash
-cd reference/recommendation/pytorch
+cd training/recommendation/pytorch
 pip install -r requirements.txt
 ```
 
@@ -30,58 +30,56 @@ pip install -r requirements.txt
 1. Checkout the MLPerf repo
 
 ```bash
-git clone https://github.com/mlperf/reference.git
+git clone https://github.com/mlperf/training.git
 ```
 2. Install CUDA and Docker
 
 ```bash
-source reference/install_cuda_docker.sh
+source training/install_cuda_docker.sh
 ```
 
-3. Get the docker image for the recommendation task
+3. Build the docker image for the recommendation task
 
 ```bash
-# Pull from Docker Hub
-docker pull mlperf/recommendation:v0.5
+# Build from Dockerfile
+cd training/recommendation/pytorch
+sudo docker build -t mlperf/recommendation:v0.6 .
 ```
 
-or
+### Steps to run and time
 
-```bash
-# Build from Dockerfile
-cd reference/recommendation/pytorch
-sudo docker build -t mlperf/recommendation:v0.5 .
-```
+#### Getting the expanded dataset
 
-### Steps to download and verify data
+The original ML-20M dataset is expanded to 16x more users and 32x more items using the code from the `data_generation` directory in the `mlperf/training` repo.
+To obtain the expanded dataset, follow the instructions from section 
+`Running instructions for the recommendation benchmark` from the README file in the 
+`data_generation/fractal_graph_expansions` directory.
 
-You can download and verify the dataset by running the `download_dataset.sh` and `verify_dataset.sh` scripts in the parent directory:
+#### Run the Docker container
 
 ```bash
-# Creates ml-20.zip
-source ../download_dataset.sh
-# Confirms the MD5 checksum of ml-20.zip
-source ../verify_dataset.sh
+nvidia-docker run --rm -it --ipc=host --network=host -v /my_data_dir:/data/cache mlperf/recommendation:v0.6 /bin/bash
 ```
 
-### Steps to run and time
-
-#### From Source
+#### Generating the negative test samples
 
-Run the `run_and_time.sh` script with an integer seed value between 1 and 5
+Assuming the expanded dataset is visible in the container under `/data/cache/ml-20mx16x32` 
+directory, run inside the container:
 
-```bash
-source run_and_time.sh SEED
+```
+python convert.py /data/cache/ml-20mx16x32
 ```
 
-#### Docker Image
+#### Running the training
 
-```bash
-sudo nvidia-docker run -i -t --rm --ipc=host \
-    --mount "type=bind,source=$(pwd),destination=/mlperf/experiment" \
-    mlperf/recommendation:v0.5 SEED
+Assuming the expanded dataset together with the generated test negative samples files are 
+visible in the container under `/data/cache/ml-20mx16x32` directory, run inside the container:
+
+```
+./run_and_time.sh <SEED>
 ```
 
+
 # 3. Dataset/Environment
 ### Publication/Attribution
 Harper, F. M. & Konstan, J. A. (2015), 'The MovieLens Datasets: History and Context', ACM Trans. Interact. Intell. Syst. 5(4), 19:1--19:19.