Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Thrust::system:system_error #1721

Closed
daniellengyel opened this issue Aug 7, 2023 · 9 comments
Closed

[QST] Thrust::system:system_error #1721

daniellengyel opened this issue Aug 7, 2023 · 9 comments
Labels
question Further information is requested

Comments

@daniellengyel
Copy link

Hi, I am running into an issue when running raft::spatial::knn with a lot of data. When I use n_samples = 10^6, n_features = 100, n_search_items = 1000 and k=10 I get the following error

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorInvalidDeviceFunction: invalid device function
Aborted

I do not run into the same issue when n_samples = 10^5.

I also do not get this issue when I run raft through the cuML library, even when using the larger amount of data. Would you have any suggestions?

@daniellengyel daniellengyel added the question Further information is requested label Aug 7, 2023
@cjnolet
Copy link
Member

cjnolet commented Aug 7, 2023

@daniellengyel,

Small side note: we have deprecated the APIs in raft::spatial in favor of raft::neighbors. Corresponding api can be found in raft::neighbors::brute_force::knn.

@benfred please correct me if this is wrong, but I believe we would want the matrix_idx template parameter here to be large enough to store the total number of bytes (which would be 4B in the case of 10Mx100 matrix of floats.)

@daniellengyel
Copy link
Author

Good to know, I changed it to call neighbors::detail::brute_force_knn_impl. The issue persists even though there does not seem to be a similar matrix_idx template parameter, as I am calling brute_force_knn_impl directly.

@cjnolet
Copy link
Member

cjnolet commented Aug 8, 2023

@daniellengyel can you share a minimum reproducible code example? It will help us better pinpoint whether this is a legitimate bug in RAFT or whether the example can be modified to make this work for you.

@daniellengyel
Copy link
Author

daniellengyel commented Aug 8, 2023

Of course.
I run the following python code to generate data.

import numpy as np

num_samples = int(1e6)
num_features = int(1e2)
num_search = int(1e1)
X_r = np.random.normal(size=(num_samples, num_features))
Xeval_r = np.random.normal(size=(num_search, num_features))

with open("trainX.npy", "wb") as f:
    np.save(f, X_r)

with open("evalX.npy", "wb") as f:
    np.save(f, Xeval_r)

To load the data I get the cnpy .cpp and .h from here .
My cmake file (I run this after running ./build libraft in a copy of the raft repository. In the raft repository under cmake/modules/ConfigureCUDA.cmake I had to get rid of all -Werror's to get it to build)

cmake_minimum_required(VERSION 3.23.1 FATAL_ERROR)

project(test_raft LANGUAGES CXX CUDA)

enable_language(CUDA)

# import cnpy
find_package(ZLIB REQUIRED)
include_directories(${ZLIB_INCLUDE_DIRS})
add_library(cnpy "loading_lib/cnpy.cpp")
target_link_libraries(cnpy ${ZLIB_LIBRARIES})
include_directories(loading_lib)

# add raft
find_package(raft REQUIRED)

# get exec
add_executable(basic_example  test.cu)
target_link_libraries(basic_example PRIVATE raft::raft CUDA::cublas cnpy)

my main file

#include <cstdint>
#include <raft/core/device_mdarray.hpp>
#include <raft/core/device_resources.hpp>
#include <raft/core/host_mdarray.hpp>

#include <raft/util/cudart_utils.hpp>

#include <raft/neighbors/brute_force.cuh>

//loading data
#include <cnpy.h>


void cast2float(double* src_ptr, float* dst_ptr, int n_rows, int n_cols){
    for(int i=0; i < n_rows * n_cols; i++){dst_ptr[i] = (float)(src_ptr[i]);}
}


// seems like we need at least 5 search items. Otherwise cuda error...
int main()
{
  raft::device_resources handle;

  // load train data
  cnpy::NpyArray trainX_npy = cnpy::npy_load("../trainX.npy"); 
  cnpy::NpyArray evalX_npy = cnpy::npy_load("../evalX.npy");

  int n_samples = trainX_npy.shape[0];
  int n_features = trainX_npy.shape[1];
  int n_evals = evalX_npy.shape[0];

  std::cout << "nsamples: " << n_samples << "; n_features: " << n_features << "; n_evals: " << n_evals << std::endl;
  float* trainX_data;
  float* evalX_data;

  if(trainX_npy.word_size == 8){
    trainX_data = new float[n_samples * n_features];
    evalX_data = new float[n_evals * n_features];

    cast2float(trainX_npy.data<double>(), trainX_data, n_samples, n_features);
    cast2float(evalX_npy.data<double>(), evalX_data, n_evals, n_features);
  } else {
    trainX_data = trainX_npy.data<float>();
    evalX_data = evalX_npy.data<float>();
  }


  // upload to gpu
  auto trainX_gpu = raft::make_device_matrix<float, int>(handle, n_samples, n_features);
  auto evalX_gpu = raft::make_device_matrix<float, int>(handle, n_evals, n_features);

  raft::copy(trainX_gpu.data_handle(), trainX_data, (int)(n_samples*n_features), handle.get_stream());
  raft::copy(evalX_gpu.data_handle(), evalX_data, (int)(n_evals*n_features), handle.get_stream());

  if(trainX_npy.word_size == 8){
    delete[] trainX_data;
    delete[] evalX_data;
  }

  // conf knn
  int n_neighbors = 10; 
  raft::distance::DistanceType d_metric = raft::distance::DistanceType::Canberra;
  

  std::vector<float*> inp_vec;
  inp_vec.push_back(trainX_gpu.data_handle());

  std::vector<int> num_inp_vec;
  num_inp_vec.push_back((int)(n_samples));

  std::vector<int>* translations = nullptr;

  auto output_dist = raft::make_device_matrix<float, int>(handle, n_evals, n_neighbors);
  auto output_idx = raft::make_device_matrix<int, int>(handle, n_evals, n_neighbors);

  // run knn
   raft::neighbors::detail::brute_force_knn_impl(handle,
                     inp_vec,
                     num_inp_vec,
                     (int)n_features,
                     evalX_gpu.data_handle(),
                     (int)n_evals,
                     output_idx.data_handle(),
                     output_dist.data_handle(),
                     n_neighbors,
                     true, // rowMajorIndex
                     true,  // rowMajorQuery
                     translations,
                     d_metric); 
}

@daniellengyel
Copy link
Author

daniellengyel commented Aug 8, 2023

I also commented out the Comparator struct in Comparators.cuh. It seemed like the compiler had difficulties properly loading the functions from cuda_fp16.h.

@daniellengyel
Copy link
Author

daniellengyel commented Aug 9, 2023

I have not had any luck fixing the issue. I located the error coming from

thrust::for_each(resource::get_thrust_policy(handle),

An issue that other people ran into was when using thrust with CUB like here and here. It seems to be mainly an issue with a namespace conflict which doesn't appear to be the case here.

Another suggestion for solving it is given here. The recommendation is to use __host__ __device__ instead of only __device__. However, it did not do anything for me.

I am running with nvcc--version = 11.4. Since it seems to be mainly an issue with thrust I am unsure whether updating will help, but I might give it a try. Clearly everything is working when using raft through cuml, so I am not sure what is happening here.

@daniellengyel
Copy link
Author

daniellengyel commented Aug 9, 2023

I ran it in a docker container based on nvidia/cuda:12.2.0-devel-ubuntu20.04 and it seems to have done the trick.

When I tried to replace the thrust::for_each with my own kernel function which takes the same lambda function as in the original thrust::for_each function call as an argument I was getting a Reason=cudaErrorInvalidDeviceFunction:invalid device function error. However, I was not able to reproduce the error immediately by executing a similar set-up in a separate stripped-down file.

Still, it seems like other people ran into similar issues here, and upgrading resolved their issue. It seems like it did the same for me.

@cjnolet
Copy link
Member

cjnolet commented Aug 22, 2023

Sorry @daniellengyel, I meant to check up on, and respond to, this issue but I just realized it had been closed.

It sounds like you may have found a fix for this issue? Just want to make sure we are providing a good experience on the RAFT side.

@daniellengyel
Copy link
Author

daniellengyel commented Aug 26, 2023

No worries! I got everything working in the end. I was able to add my own distance functions and add a feature importance vector plus an indicator to the distance_op.core function to let the function know which feature it is working on. When I have time I will issue a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants