Releases: JuliaGPU/CUDA.jl
Releases · JuliaGPU/CUDA.jl
v3.3.5
CUDA v3.3.5
Closed issues:
- Integer division error for the product of sparse times empty matrices (#962)
- Bad conversion from QR to CuArray (#969)
- Errors during installation test (#1004)
- Be explicit about imports (#1028)
- Exponentiation with constants can produce bad GPU code compared to the CPU (#1031)
rem
uses wrong intrinsic (#1040)- test Cuda fails on gpuarrays\reductions/minimum maximum (#1043)
- Broadcasted type conversion on literal value doesn't work (#1044)
- CUDA overrides somehow screwing up customized printing? (#1055)
- Is it possible to copy any data into GPU via recursive
CuDeviceArray
construction? (#1057) - CUDA doesn't compile after upgrade to Julia 1.6.2 (#1065)
- Timing discrepancy between CUDA.@time and Benchmarktools for Flux model (#1067)
- cannot convert range to Curray (#1070)
- Thread safety issue with
gemv!
(#1072) - CuSparseMatrixCSC conversion errors (#1075)
- cublasHgemmStridedBatched (#1076)
- ERROR: UndefKeywordError: keyword argument elements not assigned (#1077)
- Support for generating Float16 random numbers (#1081)
- Illegal memory access during complex exponential with large imaginary part as exponent (#1085)
- "Error: CUDA.jl does not yet support CUDA with ptxas 11.3.109" when using "JULIA_CUDA_USE_BINARYBUILDER=false" (#1089)
Merged pull requests:
- Add support for unified arrays. (#1023) (@maleadt)
- Look for libcuda in more places. (#1030) (@maleadt)
- Detect common integer exponentiations and handle them directly. (#1033) (@maleadt)
- Allow strided inputs to various library functions. (#1038) (@maleadt)
- Use correct intrinsics for rem (#1041) (@simonbyrne)
- update Package Manager link (#1052) (@ehgus)
- Update manifest (#1054) (@github-actions[bot])
- Add test for math_mode (#1056) (@kshyatt)
- Streamline atomics. (#1059) (@maleadt)
- Add support for device capability-dependent code. (#1060) (@maleadt)
- Adapt to GPUArrays changes. (#1061) (@maleadt)
- Add special constructors to work around Base AbstractQ size weirdness. (#1063) (@maleadt)
- Update manifest (#1064) (@github-actions[bot])
- Small allocator improvements (#1068) (@maleadt)
- Latency improvements (bis) (#1069) (@maleadt)
- lib: cusparse: fix #962 (#1073) (@thazhemadam)
- Make handle cache thread-safe. (#1074) (@maleadt)
- Bump GPUCompiler. (#1079) (@maleadt)
- add support for half-precision gemm (#1080) (@bjarthur)
- Extend and switch to the new CUDA RNG (#1082) (@maleadt)
- cusparse: fix conversion from sparse matrix to dense matrix (#1083) (@maleadt)
- Support/bump for CUDA 11.4.1 and CUDNN 8.2.2 (#1084) (@maleadt)
- Use sincos from libdevice to perform illegal global load. (#1086) (@maleadt)
- Bump GPUCompiler; use our own opt pipeline. (#1087) (@maleadt)
- Update manifest (#1090) (@github-actions[bot])
- Backports for 3.3.5 (#1091) (@maleadt)
v3.3.4
v3.3.3
CUDA v3.3.3
Merged pull requests:
- Adapt to LLVM changes. (#1022) (@maleadt)
- Update manifest (#1029) (@github-actions[bot])
- just some simple printing tests (#1032) (@kshyatt)
- Test for is_capturing (#1034) (@kshyatt)
- Tests for buffer printing (#1035) (@kshyatt)
- Make it possible to change the pool alloc and handle types. (#1036) (@maleadt)
- Backports for 3.3 (#1037) (@maleadt)
v3.3.2
CUDA v3.3.2
Closed issues:
- Missing artifacts errors (#1003)
- Relax restriction on types allowed in kernels? (#1005)
- PPC: Atomic{Float64} is not supported (#1008)
- Unexpected result in combination with Zygote.gradient() (#1019)
- Both ExprTools and LLVM export "parameters"; uses of it in module CUDA must be qualified (#1025)
Merged pull requests:
- Fixes for artifact loading. (#1006) (@maleadt)
- dlopen CUBLAS before CUTENSOR. (#1007) (@maleadt)
- Use a plain integer to keep track of pool last use time. (#1009) (@maleadt)
- More fixes to artifact discovery. (#1010) (@maleadt)
- add custom structs tutorial (#1011) (@jw3126)
- big mapreduce performance (#1012) (@xaellison)
- Fixes for Julia 1.7 (#1013) (@maleadt)
- Update manifest (#1014) (@github-actions[bot])
- Remove memory pools (#1015) (@maleadt)
- Move refcounting to an array storage type (#1016) (@maleadt)
- Remove unneeded disambiguation method. (#1017) (@maleadt)
- Simplify context validity check. (#1018) (@maleadt)
- Improve LazyInitialized (#1020) (@maleadt)
- More allocator clean-ups (#1021) (@maleadt)
- CUDA 11.4 (#1024) (@maleadt)
- Only import from ExprTools what we need. (#1026) (@maleadt)
- Backports release 3.3 (#1027) (@maleadt)
v3.3.1
CUDA v3.3.1
Closed issues:
- Reclaim with stream-ordered allocator (#952)
- possible hanging with
CUDA.@profile
? (#961) - Upgrading from v3.2.1 to v3.3.0 broke my installation (#970)
- Calls to
has_cudnn
running on wrongCuDevice
? (#978) - Test does not run on MIT Supercloud after upgrading to 3.3.0 (#980)
- Performance issue with complicated loops in function (#984)
- Is it possible to set cache config in CUDA.jl? (#988)
- @atomic should perform type conversions (#989)
- Compatible NVIDIA driver but still got compatibility warning (#1001)
Merged pull requests:
- Update manifest (#971) (@github-actions[bot])
- Fix disambiguation of CUDA 11.1 using CUSOLVER. (#972) (@maleadt)
- Simplify initialization helper macro. (#973) (@maleadt)
- Move at-typed_ccall to LLVM.jl. (#976) (@maleadt)
- Replace workspace macro with function (#981) (@maleadt)
- Implement and improve reclaim for the stream-ordered allocator (#983) (@maleadt)
- Bump GPUCompiler to fix WMMA test issue. (#985) (@maleadt)
- Rework memoization (#986) (@maleadt)
- Fixes for CUBLAS/CUDNN logging (#987) (@maleadt)
- Perform type conversions in at-atomic. (#990) (@maleadt)
- Don't initialize the API when setting log callbacks. (#992) (@maleadt)
- Create a helper for lazy, thread-safe initialization. (#993) (@maleadt)
- Optimize library handles (#996) (@maleadt)
- Optimize PerDevice for abstract element types. (#997) (@maleadt)
- Update manifest (#999) (@github-actions[bot])
- Replace PerDevice with context-keyed dictionaries. (#1000) (@maleadt)
- Improve launch latency (#1002) (@maleadt)
v3.3.0
CUDA v3.3.0
Closed issues:
- PTX code missing DWARF debug information (#72)
- Suggestion - Disable AbstractArray indexing fallback by default (#178)
- Support isbits Union Arrays (#103)
- Missing norm(x, p) kernel (#84)
- CUDA enhanced compatibility (#832)
- Support for CuSparseMatrixCSC{Float16} x CuVector{Float16} (#849)
- CuArray to zeroth power returns Matrix (#897)
- Fatal errors during sorting tests (#916)
- Error when computing reductions into a view with
reduce_blocks > 1
(#919) - CUDA FFT plan application runs Out of Memory in Pluto (#926)
has_cuda()
errors in CPU-only environments on master (#928)- Race condition when computing
mean!
of large arrays? (#929) - Supporting union bits types (#934)
- test failing in device/intrinsics (#942)
- Memory allocation fails for multi-GPU (#943)
- Scalar operations when using output of cu(::OffsetArray) (#954)
- Quicksort kernel does not cope with reduced threads (#955)
- CUDA.jl cannot find installed CUPTI libraries with local installation on linux (#956)
- Error for complex sparse-dense Matrix-vector multiplication (#958)
- "using CUDA" gives error in type inference of Ref{Bool} (#965)
Merged pull requests:
- Override outlined throw functions. (#874) (@maleadt)
- Enable location and debug info. (#891) (@maleadt)
- Compile using the toolkit, not the driver. (#892) (@maleadt)
- Rework timings (#898) (@maleadt)
- Fix #849, allow CUSPARSE to use F16 (#904) (@kshyatt)
- Add Windows CI. (#907) (@maleadt)
- Split test for better parallelization. (#908) (@maleadt)
- Update manifest (#909) (@github-actions[bot])
- Improve package latency. (#910) (@maleadt)
- Just some missing tests for CUBLAS (#911) (@kshyatt)
- Fix bug and add tests for iamax/iamin (#913) (@kshyatt)
- Fix profiler initialization and exception handling. (#914) (@maleadt)
- Add a show method for devices(). (#915) (@maleadt)
- Fix update of CUFFT handle. (#921) (@maleadt)
- Update manifest (#922) (@github-actions[bot])
- Reinstate compatibility with Kepler GPUs. (#923) (@maleadt)
- Use multiple GPUs on CI when available. (#924) (@maleadt)
- Fix two-step mapreduce with wrapped output. (#925) (@maleadt)
- Eagerly free the CUFFT workspace when generating a new one. (#927) (@maleadt)
- Fix CUDA.function without throwing. (#930) (@maleadt)
- Fix the REPL synchronization hook. (#931) (@maleadt)
- Re-initialize the random seed every time. (#932) (@maleadt)
- Protect against race in iterating compute processes. (#933) (@maleadt)
- Helper function to get the device given a cu ptr. (#935) (@akashkgarg)
- Implement CUDA's Enhanced Compatibility when selecting a toolkit. (#936) (@maleadt)
- Update manifest (#939) (@github-actions[bot])
- Re-introduce specialization of cufunction. (#940) (@maleadt)
- Support isbits union element types with CuArray. (#941) (@maleadt)
- Try generating code with unreachable control flow. (#944) (@maleadt)
- Upgrade to CUDA 11.3 Update 1. (#945) (@maleadt)
- Always use exit instead of trap. (#947) (@maleadt)
- Select devices without NVML. (#948) (@maleadt)
- Fixes for Julia 1.7. (#949) (@maleadt)
- Query the CUBLAS version without requiring a handle. (#951) (@maleadt)
- Improve CUBLAS and CUDNN logging. (#953) (@maleadt)
- Update manifest (#957) (@github-actions[bot])
- Enable sorting with reduced block sizes (#959) (@xaellison)
- Adapt to GPUCompiler changes, bump GPUArrays. (#963) (@maleadt)
- Adapt to change in allowscalar. (#964) (@maleadt)
- Don't disable the CUDNN log callback on Windows. (#966) (@maleadt)
- Use released dependencies. (#968) (@maleadt)
v3.2.1
CUDA v3.2.1
Closed issues:
- adding constant to an array: performance regression compared to CUDAdrv (#838)
- CUDA.abs() on vector input: performance regression compared to CUDAdrv (#839)
- CUDA.@sync seems to be using a lot of CPU while waiting (#893)
- Memory leaks with repeated use of
fft
of a CUDA Array (#894) - CUDA.jl v3.2 seems to download wrong version of CUDNN and CUTENSOR (#899)
Merged pull requests:
v3.2.0
CUDA v3.2.0
Closed issues:
- Explore CUDA graph API (#65)
- Runtime functions are missing debug information (#53)
- Native RNGs do not pass SmallCrush (#803)
- Remaining threads/FFT/mult-gpu error (#876)
Merged pull requests:
- Add wrappers for the CUDA graph API. (#877) (@maleadt)
- Use the profiler API to start capture. (#878) (@maleadt)
- Duplicate RNG state across block to avoid need for synchronization (#879) (@maleadt)
- Support for printing tuples. (#880) (@maleadt)
- Support unsigned inputs to integer intrinsics. (#881) (@maleadt)
- Switch to Philox2x32 for device-side RNG (#882) (@maleadt)
- Update manifest (#884) (@github-actions[bot])
- Treat CartesianIndices in views as scalars. (#886) (@maleadt)
- Robustly get variables from the environment during init. (#887) (@maleadt)
- Move Statistics functionality to GPUArrays. (#888) (@maleadt)
- Update artifacts and use sources from unified JLLs. (#889) (@maleadt)
- Lazy initialization of CUDNN and CUTENSOR (#890) (@maleadt)
- Update manifest (#895) (@github-actions[bot])
v3.1.0
CUDA v3.1.0
Closed issues:
- GPU Implementation of partialsort! (#93)
- Document associativity requirements of scan/reduce operators (#819)
- Problem in reduce_block? (#843)
- CUDNN convolution incorrect for small images (#848)
- Newly-spawned tasks should re-set the device (#851)
- sort!(CUDA.zeros(2^25)) throws invalid configuration argument (code 9, cudaErrorInvalidConfiguration) (#852)
- Type-preserving upload about cu in doc may be wrong (#855)
- Memory corruption / segfault with Threads.@async and planned FFTs (#859)
- Don't call nvmlErrorString (during init?) to prevent crashes on WSL (#860)
- unsafe_copy3d! does not work with stream-ordered allocations (#863)
- CUDA3 seems to have memory leak (#866)
Merged pull requests:
- Implement statistics functions: correlation and covariance (#509) (@berquist)
- @atomic support * and / (#842) (@yuehhua)
- CUDNN docstring revisions. (#844) (@GunnarFarneback)
- Sorting perf (again) (#845) (@xaellison)
- Update manifest (#846) (@github-actions[bot])
- Remove extraneous apostrophe (#847) (@kshyatt)
- reduce_block fixes. (#853) (@maleadt)
- Fix sorting large arrays. (#854) (@maleadt)
- Remove unsupported config launch keyword. (#856) (@maleadt)
- Identify the buffer during unsafe_wrap to support unified free. (#857) (@maleadt)
- Add support for CUDA 11.3. (#858) (@maleadt)
- Work around buggy NVML initialization on WSL (#861) (@maleadt)
- ae/partialsort (#864) (@xaellison)
- Update manifest (#865) (@github-actions[bot])
- Improve multitasking with CUFFT. (#867) (@maleadt)
- Introduce a HandleCache type. (#868) (@maleadt)
- Improve multitasking with CURAND (#869) (@maleadt)
- Document associativity requirement of accumulate (#870) (@HenriDeh)
- Half-Precision Intrinsics (#871) (@iyaja)
- Work around offset calculation bug in cuMemcpy3DAsync. (#872) (@maleadt)
- fix #848: CUDNN convolution incorrect for small images (#873) (@denizyuret)
v3.0.3
CUDA v3.0.3
Closed issues:
- CUDA.jl init error in the REPL without using a CUDA feature (#841)
Merged pull requests: