Skip to content

Releases: JuliaGPU/CUDA.jl

v5.3.2

26 Apr 13:59
Compare
Choose a tag to compare

CUDA v5.3.2

Diff since v5.3.1

Merged pull requests:

Closed issues:

  • CuArrays don't seem to display correctly in VS code (#875)
  • Task scheduling can result in delays when synchronizing (#1525)
  • Docs: add example on task-based parallelism with explicit synchronization (#1566)
  • Exception output from many threads is not helpful (#1780)
  • Autodetect external profiler (#2176)
  • LazyInitialized is not GC-safe (#2216)
  • Track CuArray stream usage (#2236)
  • Improve cross-device usage (#2323)
  • CUBLASLt wrapper for cublasLtMatmulDescSetAttribute can have device buffers as input (#2337)
  • Improve error message when assigning real valued arrray with complex numbers (#2341)
  • @device_code_sass broken (#2343)
  • Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
  • @gcsafe_ccall breaks inlining of ccall wrappers (#2347)

v5.3.1

19 Apr 07:16
9c9a05f
Compare
Choose a tag to compare

CUDA v5.3.1

Diff since v5.3.0

Merged pull requests:

Closed issues:

  • Missing CUBLASLt wrappers (#2322)
  • error when switching device (#2323)
  • v5.3.0: regression in Zygote performance (#2333)

v5.3.0

12 Apr 14:27
5da4d1d
Compare
Choose a tag to compare

CUDA v5.3.0

Diff since v5.2.0

Merged pull requests:

Closed issues:

  • Failed to compile PTX code when using NSight on Win11 (#1601)
  • sortperm fails with dims keyword (#2061)
  • NVTX-related segfault on Windows under compute-sanitizer (#2204)
  • Inverse Complex-to-Real FFT allocates GPU memory (#2249)
  • cuDNN not available for your platform (#2252)
  • Cannot reset CuArray to zero (#2257)
  • Cannot take gradient of sort on 2D CuArray (#2259)
  • Multi-threaded code hanging forever with Julia 1.10 (#2261)
  • CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
  • Adjoint not supported on Diagonal arrays (#2275)
  • Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
  • Release v5.3? (#2283)
  • Wrap CUDSS? (#2287)
  • Bug concerning broadcast between device array and unified array (#2289)
  • StackOverflowError trying to throw OutOfGPUMemoryError, subsequent errors (#2292)
  • BUG: sortperm! seems to perform much slower than it should (#2293)
  • Multiplying CuSparseMatrixCSC by CuMatrix results in Out of GPU memory (#2296)
  • BFloat16 support broken on Julia 1.11 (#2306)
  • does not emit line info for debbuging/profiling (#2312)
  • Kernel using StaticArray compiles in julia v1.9.4 but not in v1.10.2 (#2313)
  • Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)

v4.4.2

04 Apr 09:27
Compare
Choose a tag to compare

CUDA v4.4.2

Diff since v4.4.1

Merged pull requests:

Closed issues:

  • Element-wise conversion to Duals (#127)
  • IDEA: CuHostArray (#28)
  • Make Ref pass by-reference (#267)
  • Failed to compile PTX code when using NSight on Win11 (#1601)
  • view(data, idx) boundschecking is disproportionately expensive (#1678)
  • [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
  • Trouble using nsight systems for profiling CUDA in Julia (#1779)
  • dlopen("libcudart") results in duplicate libraries (#1814)
  • Support for JLD2 (#1833)
  • Windows Defender mis-labels artifacts as threat (#1836)
  • Support Cholesky factorization of CuSparseMatrixCSR (#1855)
  • Runtime not re-selected after driver upgrade (#1877)
  • Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
  • Cannot precompile GPU code with PrecompileTools (#2006)
  • Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
  • CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
  • StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069)
  • Support for LinearAlgebra.pinv (#2070)
  • PTX ISA 8.1 support (#2080)
  • Segmentation fault when importing CUDA (#2083)
  • "No system CUDA driver found" on NixOS (#2089)
  • CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
  • Miss...
Read more

v5.2.0

18 Jan 10:44
5876e9d
Compare
Choose a tag to compare

CUDA v5.2.0

Diff since v5.1.2

Merged pull requests:

Closed issues:

  • Trouble using nsight systems for profiling CUDA in Julia (#1779)
  • Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
  • Intermittent CI failure: Segfault during nonblocking synchronization (#2141)
  • First test for Julia/CUDA with 15 failures (#2158)
  • Update to CUTENSOR 2.0 (#2174)
  • Tests fail for CUDA#master (#2223)
  • Test failures on Nvidia GH200 (#2227)
  • mul! should support strided outputs (#2230)
  • Please add support for older cuda versions (cuda 8 and older) (#2231)
  • NSight Compute: prevent API calls during precompilation (#2233)
  • Integrated profiler: detect lack of permissions (#2237)

v5.1.2

07 Jan 10:34
fc99b1d
Compare
Choose a tag to compare

CUDA v5.1.2

Diff since v5.1.1

Merged pull requests:

Closed issues:

  • More informative errors when parameter size is too big (#2119)
  • Modifying struct containing CuArray fails in threads in 5.0.0 and 5.1.0 (#2171)
  • Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
  • Support for combining duplicate elements in sparse matrices (#2185)
  • Interactive sessions: periodically trim the memory pool (#2190)
  • Broadcast does not preserve buffer type (#2191)
  • CUDA doesn't precompile on Julia nightly/1.11 (#2195)
  • Latest julia: UndefVarError: make_seed not defined in Random (#2198)
  • CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
  • Most recent package versions not supported on CUDA.jl (#2212)
  • Testing of CUDA fails (#2222)
  • --debug-info=2 makes NNlibCUDACUDNNExt precompilation run forever (#2225)

v5.1.1

20 Nov 11:38
ffcd7e3
Compare
Choose a tag to compare

CUDA v5.1.1

Diff since v5.1.0

Merged pull requests:

Closed issues:

  • High CPU load during GPU syncronization (#2161)

v5.1.0

07 Nov 15:10
Compare
Choose a tag to compare

CUDA v5.1.0

CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming. For more details, see the blog post.

Diff since v5.0.0

Merged pull requests:

Closed issues:

  • Element-wise conversion to Duals (#127)
  • IDEA: CuHostArray (#28)
  • Make Ref pass by-reference (#267)
  • view(data, idx) boundschecking is disproportionately expensive (#1678)
  • [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
  • dlopen("libcudart") results in duplicate libraries (#1814)
  • Support for JLD2 (#1833)
  • Windows Defender mis-labels artifacts as threat (#1836)
  • Support Cholesky factorization of CuSparseMatrixCSR (#1855)
  • Runtime not re-selected after driver upgrade (#1877)
  • Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
  • Cannot precompile GPU code with PrecompileTools (#2006)
  • CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
  • PTX ISA 8.1 support (#2080)
  • Segmentation fault when importing CUDA (#2083)
  • "No system CUDA driver found" on NixOS (#2089)
  • CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
  • Missing CUDA_Runtime_Discovery as a dependency in cuDNN (#2094)
  • Binaries for Jetson (#2105)
  • Minimum/maximum of array of NaNs is infinity (#2111)
  • Performance regression for multiple @sync copyto! on CUDA v5 (#2112)
  • [CUBLAS] Regenerate the wrappers with updated argument types (#2115)
  • Unable to allocate unified memory buffers (#2120)
  • CUDA 12.3 has been released (#2122)
  • atomic min, max for Float32 and Float64 (#2129)
  • Native profiler output is limited to around 100 columns when printing to a file (#2130)
  • LLVM generates max.NaN which only works on sm_80 (#2148)
  • Unified memory-related error on Tegra T194 (#2149)
  • Errors on sm_61 (#2150)

v5.0.0

19 Sep 08:39
2fa6572
Compare
Choose a tag to compare

CUDA v5.0.0

Blog post: https://info.juliahub.com/cuda-jl-5-0-changes

This is a breaking release, but the breaking changes are minimal (see the blog post for details):

  • Julia 1.8 is now required, and only CUDA 11.4+ is supported
  • selection of local toolkits has changed slightly

Diff since v4.4.1

Merged pull requests:

Closed issues:

  • StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069)
  • Support for LinearAlgebra.pinv (#2070)

v4.4.1

25 Aug 20:24
Compare
Choose a tag to compare

CUDA v4.4.1

Diff since v4.4.0

Closed issues:

  • CUDA driver device support does not match toolkit (#70)
  • Launching kernels should not allocate (#66)
  • sync_threads() appears to not be sync'ing threads (#61)
  • Exception when using CuArrays with Flux (#129)
  • Kernel using MVector fails to compile or crashes at runtime due to heap allocation (#45)
  • Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master (#538)
  • Improve 'VS C++ redistributable' error message (#764)
  • CUSPARSE does not support reductions (#1406)
  • CUDA test failed (#1690)
  • Type constructor in broadcast doesn't compile (#1761)
  • accumulate(+) gives different results for CuArray compared to Array. (#1810)
  • Compat driver: preload all libraries (#1859)
  • Stream synchronization is slow when waiting on the event from CUDA (#1910)
  • cuDNN: Store convolution algorithm choice to disk. (#1947)
  • Disable 'No CUDA-capable device found' error log (#1955)
  • CUDNN_STATUS_NOT_SUPPORTED using 1D CNN model (#1977)
  • Memory allocations during in-place sparse matrix-vector multiplication (#1982)
  • CUSPARSE.sum_dim1 sums the absolute values of elements (#1983)
  • Update to CUDA 12.2 (#1984)
  • unsafe_wrap fails on zero element CuArrays (#1985)
  • rand in kernel works in a deterministic way (#2008)
  • Scalar indexing with CuArray * ReshapedArray{SubArray{CuArray}}} (#2009)
  • volumerhs performance regression (#2010)
  • CuSparseMatrix constructors allocate too much memory? (#2015)
  • Native profiler using CUPTI (#2017)
  • libLLVM-15jl.so (#2018)
  • "symbol multiply defined" error (#2021)
  • Confusion on row major vs column major (#2023)
  • Printing of CuArrays gives zeros or random numbers (#2033)
  • sortperm! fails when output is UInt vector (#2046)
  • Re-introduce spinning loop before nonblocking synchronization (#2057)

Merged pull requests: