=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
NVIDIA CUDA Toolkit v4.0 Release Notes for Windows, Linux, and Mac OS X
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
--- Contents:
-------------
--- Release Highlights
--- Documentation
--- List of important files
--- Supported NVIDIA hardware
--- Supported Operating Systems for:
------ Windows
------ Linux
------ Mac OS X
--- Installation Notes
--- Upgrading from CUDA Toolkit Release 3.2
--- Notes on New Features and Performance Improvements
--- Known Issues
--- Resolved Issues
--- Source code for Open64 and cuda-gdb
--- Revision History
--- More information 
--- Acknowledgements

This release contains:
* NVIDIA CUDA Toolkit documentation
* NVIDIA OpenCL documentation
* NVIDIA CUDA compiler (nvcc) and supporting tools
* NVIDIA CUDA runtime libraries
* NVIDIA CUBLAS, CUFFT, CUSPARSE, CURAND, Thrust, and NPP libraries
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Release Highlights
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
NVIDA CUDA Toolkit version 4.0 has the following new features:

* NVIDIA cuda-gdb debugger
* NVIDIA Visual Profiler for CUDA C/C++ and OpenCL applications

Easier application porting
* Share GPUs across multiple threads
* Single thread access to GPUs
* No-copy pinning of system memory
* New CUDA C/C++ language features
* Thrust templated primitives library
* NPP image/video processing library
* Layered Textures

Faster multi-GPU programming
* Unified virtual addressing
* GPUDirect v2.0 with peer-to-peer communication

New and improved developer tools
* Automated performance analysis
* C++ debugging
* Debugger cuda-gdb for Mac OS
* GPU binary disassembler

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Documentation
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
For a list of documents supplied with this release, please refer to the /doc directory 
of your CUDA Toolkit installation.

For issues related to the Visual Profiler, please refer to the Visual Profiler 
release notes for the specific platform. Refer the Visual Profiler change log - 
"Changelog.txt" - for changes in Visual Profiler with respect to the previous 
version.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
List of important files
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  bin/nvcc                     Command line compiler

  include/
    cuda.h                     CUDA driver API header
    cudaGL.h                   CUDA OpenGL interop header for driver API
    cudaVDPAU.h                CUDA VDPAU interop header for driver API (Linux only)
    cuda_gl_interop.h          CUDA OpenGL interop header for toolkit API (Linux only)
    cuda_vdpau_interop.h       CUDA VDPAU interop header for toolkit API (Linux only)
    cudaD3D9.h                 CUDA DirectX 9 interop header (Windows only)
    cudaD3D10.h                CUDA DirectX 10 interop header (Windows only)
    cudaD3D11.h                CUDA Directx 11 interop header (Windows only)
    cufft.h                    CUFFT API header
    cublas.h                   CUBLAS API header
    cusparse.h                 CUSPARSE API header
    curand.h                   CURAND API header
    curand_kernel.h            CURAND device API header
    thrust/*                   Thrust Headers
    npp.h                      NPP API Header
    nvcuvid.h                  CUDA Video Decoder header (Windows and Linux)
    cuviddec.h                 CUDA Video Decoder header (Windows and Linux)
    NVEncodeDataTypes.h        CUDA Video Encoder (C-library or DirectShow) required for
                               projects (Windows only)
    NVEncodeAPI.h              CUDA Video Encoder (C-library) required for projects
                               (Windows only)
    INvTranscodeFilterGUIDs.h  CUDA Video Encoder (DirectShow) required for projects
                               (Windows only)
    INVVESetting.h             CUDA Video Encoder (DirectShow) required for projects
                               (Windows only)
------------------
Windows lib files
------------------
  lib/
    cuda.lib                   CUDA driver library
    cudart.lib                 CUDA runtime library
    cublas.lib                 CUDA BLAS library
    cufft.lib                  CUDA FFT library
    cusparse.lib               CUDA Sparse Matrix library
    curand.lib                 CUDA Random Number Generation library
    npp.lib                    NVIDIA Performance Primitives library
    nvcuvenc.lib               CUDA Video Encoder library
    nvcuvid.lib                CUDA Video Decoder library
------------------
Linux lib files
------------------
 lib/
    libcuda.so                 CUDA driver library
    libcudart.so               CUDA runtime library
    libcublas.so               CUDA BLAS library
    libcufft.so                CUDA FFT library
    libcusparse.so             CUDA Sparse Matrix library
    libcurand.so               CUDA Random Number Generation library
    libnpp.so                  NVIDIA Performance Primitives library
------------------
Mac OS X lib files
------------------
  lib/
    libcuda.dylib              CUDA driver library
    libcudart.dylib            CUDA runtime library
    libcublas.dylib            CUDA BLAS library
    libcufft.dylib             CUDA FFT library
    libcusparse.dylib          CUDA Sparse Matrix library
    libcurand.dylib            CUDA Random Number Generation library
    libnpp.dylib               NVIDIA Performance Primitives library
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Supported NVIDIA Hardware
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

* See http://www.nvidia.com/object/cuda_gpus.html

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Supported Operating Systems for Windows, Linux, and MAC OS X
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
------------------
Windows
------------------
* Supported Operating Systems (32-bit and 64-bit)
     -  Windows 7
     -  Windows Vista
     -  Windows XP
     -  Windows Server 2008 R2
     -  Windows Server 2008
     -  Windows Server 2003

* Supported Compilers
--------   --------         ---       
Platform   Compiler         IDE       
--------   --------         ---       
Windows    MSVC8(14.00)     VS 2005   
Windows    MSVC9(15.00)     VS 2008   
Windows    MSVC2010(16.00)  VS 2010   

------------------
Linux 
------------------
The CUDA development environment relies on tight integration with the host development environment- including the host compiler and C runtime libraries, and is therefore only supported on distro versions that have been qualified for this CUDA Toolkit release.  For example, since the CUDA Toolkit 4.0 was not tested with any Linux distros that use the GNU C Compiler (GCC) version 4.5, it is not supported on those distros.


* Supported Distros
------          -- --  ------             ---         -----         
Distro          32 64  Kernel             GCC         GLIBC         
------          -- --  ------             ---         -----         
SLES11-SP1       X  X  2.6.32.12-0.7-pae  4.3-62.198  2.11.1-0.17.4 
RHEL-6.0            X  2.6.32-71.el6      4.4.4       2.12          
Ubuntu-10.10     X  X  2.6.35-23-generic  4.4.5       2.12.1        
OpenSUSE-11.2    X  X  2.6.31.5-0.1       4.4.1       2.10.1        
Fedora13         X  X  2.6.33.3-85        4.4.4       2.12          
RHEL-4.8            X  2.6.9-89.ELsmpl    3.4.6       2.3.4         
RHEL-5.5         X  X  2.6.18-194.el5     4.1.2       2.5           

* Distros Not Supported
------          -- --  ------             ---         -----         
Distro          32 64  Kernel             GCC         GLIBC         
------          -- --  ------             ---         -----         
RHEL-4.8         X     2.6.9-89.ELsmpl    3.4.6       2.3.4         
Ubuntu-10.04     X  X  2.6.32-21-generic  4.4.3       2.11.1        
SLED11-SP1       X  X  2.6.32.12.0.7      4.3.4       2.11.1        

Note that 32-bit versions of RHEL 4.8 and RHEL 6.0 have not been tested with this release and are therefore not supported in the CUDA Toolkit 4.0 release.  

------------------
Mac OS X
------------------
--------          -- --  ------     ---                    ------
Platform          32 64  Kernel     GCC                    Status 
--------          -- --  ------     ---                    ------
Mac OS X 10.6      X  X  10.0.0     4.2.1 (build 5646)     Continued 
Mac OS X 10.5.2+   X  X                                    Removed 
Mac OS X 10.5.7    X  X  9.7.0      4.0.1 (build 5490)     Removed 

------------------

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Installation Notes
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Windows:
Silent Installation:
Install using msiexec.exe from the shell and pass the following arguments:
msiexec.exe /i cudatoolkit.msi /qn
To uninstall:
Use /x instead of /i 
--------------------------------------------------------------------------------
Linux: 
On some Linux releases, due to a GRUB bug in the handling of upper memory and a default vmalloc too small on 32-bit systems, it may be necessary to pass this information to the bootloader:

vmalloc=256MB, uppermem=524288

Example of grub conf:

title Red Hat Desktop (2.6.9-42.ELsmp)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Upgrading from previous CUDA Toolkit 3.2
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Please refer to the CUDA_4.0_Readiness_Tech_Brief.pdf document. 

--------------------------------------------------------------------------------
Mac-related Note
--------------------------------------------------------------------------------
CUDA 4.0 does not have support for XCODE4.0.
--------------------------------------------------------------------------------

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Notes on New Features and Performance Improvements
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

CUDA Driver Features:
---------------------
* cudaMemcpyAsync works with non pinned heap memory. The asynchronous copy APIs (cudaMemcpyAsync et al in the runtime API and cuMemcpyHtoDAsync et al in the driver API) may take ordinary pageable host memory as its source or destination argument.  
This is in contrast to CUDA 3.2 where host memory could only be used if it was allocated through CUDA (using cudaMallocHost et al through the runtime API or cuMemAllocHost through the driver API).
Note that while using pageable host memory is now permitted for use with the asynchronous copy APIs, using pageable host memory will result in the copies being performed synchronously. 

* cudaMemcpy is supported across contexts. The ability to copy memory between devices in the runtime API (and between context in the driver API) has been added.
When using unified addressing, the function cudaMemcpy (and its variants) with the copy direction cudaMemcpyDefault may be used to copy between devices in the runtime API (the function cuMemcpy may be used in the driver API).
When not using unified addressing, the function cudaMemcpyPeer in the runtime API (and cuMemcpyPeer in the driver API) and its variants may be used to copy between devices.
This functionality is supported on all platforms and all devices.  This functionality will take advantage of direct peer access where it is enabled.  
Note that this functionality may not be optimal on compute level 1.0 devices and across non-SLI-linked devices using the WDDM driver model on Vista and Win7.

* cudaStreamWaitEvent supported across contexts.
The function cudaStreamWaitEvent (or cuStreamWaitEvent in the driver API) may be used to effect cross-device (or cross-context, in the driver API) synchronization.  An event recorded on one device may be waited on by a stream created by another device.
The dependency added will be resolved asynchronously, and this will be very efficient.  Note that this may not be optimally efficient yet for compute 1.0 devices or for devices that are not in SLI on Windows Vista/7, using the WDDM driver model.

* Added flag for property "Concurrent Data Transfer" to indicate two simultaneous DMA transfers.
The ability of the device to concurrently pull data (from host or a peer device) and push data (to host or a peer) may be queried.
In the runtime API, this may be done by examining the device property "asyncEngineCount" will be set to 1 if only one direction of a transfer may be active at a time and 2 if both directions may be active at a time.
The driver API device property query is CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT.

* (Windows and Linux) Added support for unified virtual address space.
Devices supporting 64-bit and compute 2.0 ahd higher capability now share a single unified address space between the host and all devices.  This means that the pointer used to access memory on the host is the same as the pointer to used to access memory on the device. Therefore, the location of memory may be queried directly from its pointer value; the direction of a memory copy need not be specified.
The function cudaPointerGetAttribute in the runtime API (and cuPointerGetAttribute in the driver API) may be used to query attributes about a pointer.  The copy direction cudaMemcpyDefault in the runtime API (and the functions cuMemcpy, its variants, and the memory type CU_MEMORYTYPE_UNIFIED in the driver API) may be used to copy data without specifying the direction.
Note that this functionality is available only on Linux-64, Windows XP-64, and Windows Vista/7 using the TCC driver model.

* The ability of directly accessing memory on peer devices has been added.
If direct access of memory on the peer device is possible (which can be queried by runtime API cudaDeviceCanAccessPeer or driver API cuDeviceCanAccessPeer), this functionality can be enabled by cudaDeviceEnablePeerAccess (or cuCtxEnablePeerAccess).
This functionality is supported on all NVIDIA CUDA devices with compute level 2.0 and up running 64-bit Linux, XP, and TCC drivers. Note that peer access is not supported on WDDM.

* (Linux) DX and OGL textures (shared through interop), mapped as CUDA arrays, can now be bound to surface references in CUDA. In order to be able to do so, the DX/OGL resource should be registered with the appropriate register flag as follows:
For the driver api, it's CU_GRAPHICS_REGISTER_FLAGS_SURFACE_LDST.
For the runtime api, it's cudaGraphicsRegisterFlagsSurfaceLoadStore.
Please note that surface has smaller width/height restrictions than textures. If the texture is registered with the surface load/store flag, and the size is too big, then that will generate an error. 

* Removed alignment requirments from cuMemcpy* functions.
All restrictions on the alignment of the source and destination pointer and pitch for all 2D and 3D copies (using cudaMemcpy3D et al in the runtime API and cuMemcpy3D et al in the driver API) have been removed.
Note that using unaligned operands for a copy may result in poorer performance than using aligned operands.

* Added 64-bit support to WinXP-64 

* (Windows and Linux) CUDA-OpenGL interop currently supports the following set of texture formats:

{GL_R, GL_RG, GL_RGBA, GL_LUMINANCE, GL_LUMINANCE_ALPHA, GL_ALPHA, GL_INTENSITY} X {,8,16,16F,32F,8UI,16UI,32UI,8I,16I,32I} 
These formats are also supported for OpenCL-OpenGL interop.
For further details on these texture formats, please refer to the OpenGL specification. 

* Event and stream creation/destruction improved in this version.
The functions cudaStreamDestroy and cudaEventDestroy (cuStreamDestroy and cuEventDestroy) are now asynchronous and light-weight.  Destroying a stream or event will return immediately, even if there is still pending work in the stream or pending work behind the event.  The stream or event's resources will be released asynchronous once the stream or event has completed its work.

* Added device attributes for memory clock and number of threads per SM. 
The following new device attributes are supported in the CUDA driver API: 
CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE: gives the peak memory clock frequency in kilohertz. 
CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH: gives the global memory bus width in bits. 
CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE: gives the size of the L2 cache in bytes. 
CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR: gives the number of maximum threads that can be resident at one time on a multiprocessor.

* (Windows) This version allows a single CUcontext to be current to multiple threads simultaneously. 

* A kernel that is compiled with a __launch_bounds__ directive will have the max threads/block taken into account when querying the max thread count via cuFuncGetAttribute(&i, CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK, f). Also cuFuncSetBlockShape(f, x, y, z) will reject block shapes that exceed the max threads/block set via a __launch_bounds__. These changes in behavior will be likewise be visible in the CUDART counterparts to these CUDA APIs. 

* Querying the maximum grid Z dimension on Fermi and later architectures will now return values greater than 1 (on Fermi it is 65535). Methods for querying the max grid Z dimension are as follows:
CUDART:
1) call cudaGetDeviceProperties(&prop, dev) and check prop.maxGridSize[2]
CUDA driver:
1) call cuDeviceGetProperties(&devProps, hDev) and check devProps.maxGridSize[2]
2) call cuDeviceGetAttribute(&i, CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Z, hDev)
Launching 3D grids is accomplished in CUDART by passing in a 3rd grid dimension in <<< >>> or in cudaConfigureCall(). Launching 3D grids with the CUDA driver requires the use of the new cuLaunchKernel API, which has gridDimX, gridDimY and gridDimZ parameters.
It is important to note that only on Fermi and later architectures will an app be able to actually use 3D grid launches. 

* (Windows) Layered Textures (2D)implemented.

Note: Layered textures are currently not supported on the Tesla architecture (sm_1x). 
Layered textures" are better known as "array textures" in graphics APIs. A layered texture is a collection of either 1D or 2D textures of identical size and format, arranged in layers. Such textures can be created as follows:
- by specifying the flag CUDA_ARRAY3D_LAYERED when creating the CUDA array using the driver API.
- by specifying the flag cudaArrayLayered when creating the CUDA array using the runtime API.
Kernels can access any texel from any particular layer using a new set of intrinsics that have the following format: 
- tex1DLayered(texref, float x, int layer) 
- tex2DLayered(texref, floay x, float y, int layer) 
Note that in a 2D layered texture, no filtering is performed between layers i.e. there is no trilinear filtering done like it is done for 3D textures. Similarly, for 1D layered texture, there is no bilinear filtering done like the way it is for 2D textures. 

The second argument in the template for texture references now means "texture type" instead of "dim". i.e. instead of "texture<returnType, dim, readMode>", it is  "texture<returnType, textureType, readMode>" 
The "textureType" arguments can be one of the following #defines: 
#define cudaTextureType1D 0x01 
#define cudaTextureType2D 0x02 
#define cudaTextureType3D 0x03 
#define cudaTextureType1DLayered 0xF1 
#define cudaTextureType2DLayered 0xF2 
Backward compatibility for the existing 1D, 2D and 3D textures is maintained by aliasing the corresponding #defines to their "dim" value. As a reult, sample texture references would look like:
texture<float4, cudaTextureType3D> texRef3D; 
texture<float4, cudaTextureType1DLayered> texRef1DLayered; 

* This version has a new launching API called cuLaunchKernel. This API offers many improvements over previous launching APIs:
1) All function state associated with a launch is specified via one API call. This makes multithreaded launching of kernels feasible.
2) Support for 3D dimensional grid launches on h/w that supports it (see associated NVbug 599870 - 3D grid launches)
3) Kernel parameter passing can either be done via an easy to use method where addresses of parameters are passed in and the driver worries about packing the parameters together, or an expert mode (much like cuParamSetv) where all parameters are pre-packed by the application in one chunk. 

* Added mechanism for registering system memory for DMA.

CUDA Compiler Features:
------------------------

* Among the new features added in the CUDA 4.0 compiler are:
Support for inline PTX: much like an __asm__ directive, PTX can now be inlined with CUDA C/C++.
Support for driver-loadable fatbins: fatbin files can contain multiple PTX and cubin files targeted at different architectures. In previous releases, only applications that used the runtime API were able to use fatbin files. Now with CUDA 4.0, driver API applications can use them too.
For more details on these features, please consult the nvcc documentation (nvcc.pdf) that comes with the release. 

* Starting with CUDA 4.0 release, the compiler implements enhanced error checks for function calls. The compiler checks that the calling function and the called function have compatible __host__, __device__ and __global__ attributes. The compatibility rules for calls between functions with such attributes are documented in the CUDA Programming Guide.
If the compiler detects an incompatible call, it will generate error or warning messages. Warnings may be turned into errors in a future release. Additional error checks may be implemented in a future release. It is recommended that the user modify the calling function or the called function to ensure compatibility with the function call restrictions documented in the CUDA Programming Guide.

CUDA Libraries Features:
------------------------

* The CUBLAS Library now supports a new API that is thread-safe and allows the application to more easily take advantage of parallelism using streams (especially for functions with scalar return parameters). Because this new API is thread-safe, the CUBLAS library will work cleanly with applications that use the new multi-threading features of the CUDA Runtime Library (CUDART) in the CUDA Toolkit v4.0.
The legacy CUBLAS API is still supported, but it is not thread-safe and does not offer as many opportunities for parallelism with streams as the new API.  Existing applications that use CUBLAS should work without any changes to the existing code, they only need to explicitly link to the CUDART dynamic library during compilation. Note that this link requirement was not necessary with the previous versions of CUBLAS if the application only used CUBLAS entry points (and hence did not use any explicit CUDART entry points).
We recommend that new applications use the new API. In addition, we recommend that you convert to the new API for exisiting applications that need maximum stream parallelism or correct operation in a multi-threaded scenario. 
The documentation in doc/CUBLAS_Library.pdf has been rewritten to focus on the new API; some treatment of the legacy API is still included.

* The TRMM routines in the CUBLAS Library can selectively operate either out-of-place or in-place (the traditional BLAS interface only operates in-place). The out-of-place option, which is new in this release, offers a significant speedup, up to 3x, on the Fermi architecture compared to the previous release, and a modest speedup on the Tesla architecture compared to the previous release. In general, as the input matrix sizes get larger, the performance of the TRMM routine can now approach the performance of the corresponding raw GEMM routines when operating out-of-place. 

* The performance of the ZGEMM routine in the CUBLAS library, specifically for input matrices larger than about 100x100, has been optimized for the Fermi architecture.

* Added the cublasGetVersion() function to the CUBLAS Library. 

* Performance has significantly improved (>1.5x) for double-precision power-of-2 transforms up to size 2048, especially on the Fermi architecture. Certain API features such as non-standard element strides, etc. will not trigger these new kernels, therfore  performance is improved only in some cases.

* In the previous release candidate, the CUFFT Library had a performance regression for some 2D FFT sizes as compared to the 3.2 release.  These regressions have been fixed.

* Added the cufftGetVersion() function to the CUFFT Library.

* In the previous version of the CUFFT Library, the "Bluestein" or "chirp" FFT algorithm was used to accelerate transforms for sizes that cannot be factored into a combination of powers-of-2, -3, -5, or -7 for 1D transforms only.  This release employs the Bluestein algorithm to accelerate 2-D and 3-D transforms as well.

* The CUFFT Library APIs now support multiple batches for all 1D, 2D and 3D transforms. The previous release had limited support for multiple batches for 2D and 3D transforms. 


* In this version of the CUDA Toolkit (v4.0), the CUFFT Library now supports more complex input and output data layouts via the advanced data layout parameters inembed, istride, idist, onembed, ostride and odist, as accepted by the cufftPlanMany() API.  In this release, these parameters are supported only for complex-to-complex (C2C) transforms. This feature allows transforming a subset of an input array, or outputting to only a portion of a larger data structure. If the user sets inembed or onembed to NULL, then the CUFFT Library will function as it did in the previous releases and assume a basic data layout and ignore the other advanced parameters. If the user intends to use the advanced parameters, then all of the advanced interface parameters should be specified correctly. Advanced parameters are defined in units of the relevant data type (cufftReal, cufftDoubleReal, cuComplex, cuDoubleComplex).

* The CUSPARSE library now provides a solver for triangular sparse linear systems, via the cusparse*csrsv_analysis() and cusparse*csrsv_solve() APIs. Refer to the document, CUSPARSE_Library.pdf for detailed usage information.

* The cusparse*csrmv() and cusparse*csrmm() routines in the CUSPARSE library now support symmetric (CUSPARSE_MATRIX_TYPE_SYMMETRIC) and Hermitian (CUSPARSE_MATRIX_TYPE_HERMITIAN) matrix types.

* Renamed cudaDeviceBlockingSync to cudaDeviceScheduleBlockingSync 

* The cospi() routine has been added for single-precision and double-precision floating-point datatypes. The function cospi(x) implements cos(x * PI). No special include file is required to access this routine. Note: the sinpi() routine has already been available in previous releases.

* In previous releases of the CUDA toolkit, the CUBLAS and CUSPARSE libraries included compiled kernel PTX and compiled kernel binaries for compute capability 1.0, 1.3 and 2.0. Starting with this release, the compiled kernel PTX will only be shipped for the highest supported compute capability (i.e., 2.0 for this release). This results in a significant reduction of file size for the CUBLAS and CUSPARSE dynamically linked libraries for all platforms. Note: there is no change to the compiled kernel binaries. 

* The CURAND library now supports generation of double-precision floating point Sobol' quasi-random sequences with 53 bits of randomness, as well as 64 bit integer Sobol' quasi-random sequences. These are accessed via the CURAND_RNG_QUASI_SOBOL64 and CURAND_RNG_QUASI_SCRAMBLED_SOBOL64 generator types in the host API and the curandStateSobol64_t and curandStateScrambledSobol64_t generator structures in the device API. 

* The CURAND library now supports generation of log-normally distributed random numbers, via the curandGenerateLogNormal() and curandGenerateLogNormalDouble() host API functions and the curand_log_normal(), curand_log_normal2(), curand_log_normal_double() and curand_log_normal2_double() device API functions.

* The CURAND library now supports generation of scrambled Sobol' quasi-random numbers, via the CURAND_RNG_QUASI_SCRAMBLED_SOBOL32 and CURAND_RNG_QUASI_SCRAMBLED_SOBOL64 generator types in the host API and the curandStateScrambledSobol32_t and curandStateScrambledSobol64_t generator structures in the device API. 

* The CURAND library documentation (doc/CURAND_Library.pdf) now contains a summary and selected detailed results of the statistical quality tests run against the generators provided by CURAND. 

* Beginning with this release, the NVIDIA Performance Primitives (NPP) library is included directly within the CUDA Toolkit. Currently, the NPP library supports a variety of basic signal and image processing primitives that are optimized across the range of CUDA capable GPUs. Documentation is found at doc/NPP_Library.pdf and the public header file is at include/npp.h. 

* Added a complete set of Arithmetic and Logical Signal Processing Primitives.

* NPP has added Beta support for asynchronous operation using CUDA streams via the nppSetStream() and nppGetStream() functions. This feature is provided in an early form in this release and will be provided in a non-Beta fully tested form in a future release. 

* The Thrust CUDA library is now included with the CUDA Toolkit in the /include/thrust directory. A "Quick Start" document is available at doc/Thrust_Quick_Start_Guide.pdf. Additionally, several code samples in the NVIDIA GPU Computing SDK now employ Thrust. The Thrust library source code, additional detailed documentation, example programs and a discussion group will continue to be available at the project's original home at http://code.google.com/p/thrust/. 

* This version of Thrust introduces discard_iterator, an output iterator which ignores values assigned to it. discard_iterator is useful for discarding unnecessary output from algorithms with multiple output ranges (such as reduce_by_key), and measuring in advance the total size of the result of algorithms which produce variably-sized output (such as set_intersection). 

* The Thrust library now provides set operations for sorted ranges, including union, difference and symmetric difference. These new operations are exposed via thrust/set_operations.h. 

* Added CUDA runtime API functions to control profiling:
cudaProfilerInitialize() - Initialize profiling
cudaProfilerStart() - Start profiling
cudaProfilerStop() - Stop profiling
A new header file "cuda_profiler_api.h" has been added for these runtime API functions. The corresponding driver APIs are cuProfilerInitialize(), cuProfilerStart(), cuProfilerStop() and the header file is "cudaProfiler.h".

CUDA Libraries Performance:
---------------------------

* The performance of transforms in the CUFFT library that are pure powers of 3, 5, and 7 have been optimized significantly in this release, especially for double precision. 

* In version 3.2 of CUSPARSE, the csrmv() and csrmm() functions ran slower when the "beta" parameter was =0 than when it was =1. In this version, the performance variation has been removed, and csrmv() and csrmm() should run slightly faster when "beta" =0. 

* The GEMV routines, for all datatypes, in the CUBLAS library have been significantly optimized for the case in which the input matrix, A, is transposed. Performance has improved up to 2x, especially when the input matrix, A, is large. The performance improvements apply to both the Tesla (GT200) and Fermi (GF100) architectures. 

* The performance of the TRSM routines in the CUBLAS library for large matrices has been significantly improved on Fermi and Tesla architecture platforms.

* The performance of the double-precision hyperbolic sine function, sinh(), has been improved significantly on GF100 (Fermi architecture) and GT200 (Tesla architecture).  The exact improvement achieved for end applications using sinh will vary based on the specific characteristics of each application.

* Improved performance of CUFFT on R2C and C2R transforms whose input data size along the X (or, least significant) dimension is a multiple of 2 but not a multiple of 4.  In the previous release, the performance was much better when this size was a full multiple of 4; now, both cases should run at the same higher performance. 

* The performance of double-precision floating point division on the Fermi architecture has been significantly optimized for the round-to-nearest-even case, which is the default rounding mode employed when using the '/' operator in CUDA-C device code. The round-to-nearest-even mode can be explicitly employed in CUDA using the __ddiv_rn() intrinsic. The exact improvement achieved for end applications that perform double precision divides will vary based on the specific characteristics of each application. 

* CURAND supports a new ordering technique for pseudo-random generators (CURAND_ORDERING_PSEUDO_SEEDED) that significantly reduces the state setup time. However, since this ordering technique uses a different starting seed for each thread on the device, it may result in statistical weaknesses of the pseudorandom output for some user seed values. 

* The performance of the SYR2K and HER2K routines in the CUBLAS library has been optimized for the Fermi architecture. 

* The SYMM and HEMM routines in CUBLAS have been significantly optimized for the Fermi architecture. For instance, in some cases there is a 3x performance improvement over the previous version of these routines, both for single and for double precision. 

* The performance of the double-precision reciprocal square-root function, rsqrt(), has been improved significantly for GT200 (the Tesla architecture) and GF100 (the Fermi architecture). The exact improvement achieved for end applications that use rsqrt will vary based on the specific characteristics of each application. 

* The performance and accuracy of the double-precision erfc() function have been improved. This function is now accurate to 4 ulps, and the performance has significantly improved on both the Tesla and Fermi architectures. The exact improvement achieved for end applications that use erfc will vary based on the specific characteristics of each application. 

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Known Issues
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

* In the current release, the TCC driver cannot be run under a guest account; admin privileges are needed to run TCC. This requirement will be removed in a future release.

* GPUs without a display attached are not subject to the 2 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. Thus, for devices like S1070 that do not have an attached display, users may disable the Windows TDR
timeout. Disabling the TDR timeout will allow kernels to run for extended periods of time without triggering an error.
The following is an example .reg script:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
"TdrLevel"=dword:00000000

* The header file search locations and the order that they are visited have been revised.
Until CUDA 3.2, nvcc searched the following locations, in order:
(1)The toolkit include paths,
(2)The current working directory,
(3)The paths specified with -I,
(4)The paths specified with -isystem, and
(5)The system include paths.
The header files in the toolkit include path could not be overridden as the toolkit include paths were always visited first.
From CUDA 4.0, nvcc searches through the include paths in the following order:
(1)The paths specified with -I,
(2)The toolkit include paths,
(3)The paths specified with -isystem, and
(4)The system include paths.
The current working directory is not added to the include paths by default anymore, adhering to the C/C++ compiler convention. That is, to add the current working directory to the include search paths, -I. or -isystem. must be given to nvcc, depending on the desired search order. Alternatively, the #include directives can be used in the quoted form, instead of the angle-bracket form, to include header files in the current working directory.

* A CUDA program may not compile correctly if a type or typedef 'T' is private to a class or a structure, and at least one of the following is satisfied:
- 'T' is a parameter type for a __global__ function.
- 'T' is an argument type for a template instantiation of a __global__ function.
This restriction will be fixed in a future release."

* (Windows) Structure and union types with bit fields may not work correctly in device code on the Windows platform. 
In addition:
- Transferring variables that contain such types, from host to device or from device to host, may not work correctly.
- Use of variables with such types in device code may not work correctly.
This issue will be addressed in a future release.


* When compiling thrust::reduce cudafe generates use of private typedefs 

* (Windows) The CUDA C compiler may produce a different memory layout, compared to the host Microsoft compiler, for a C++ object of class type T that satisfies any of the following conditions:
1. T has virtual functions or derives from a direct or indirect base class that has virtual functions.
2. T has a direct or indirect virtual base. 
3. T has multiple inheritance with more than one direct or indirect empty base class. 
The size for such an object may also be different in host and device code. As long as type T is used exclusively in host or device code, the program should work correctly. 
Do not pass objects of type T between host and device code (e.g. as arguments to __global__ functions or through cudaMemcpy calls).

* For certain configurations, the CUFFT Library will produce slightly different results for the same input when ECC is on versus when ECC is off, even on the same architecture. Note: in both cases the results are mathematically within the expected tolerance. The difference arises from optimizations specific to the ECC on and ECC off cases that result in slightly different factorizations of the overall transform into smaller radixes. 

* The CUFFT library is not thread-safe, and hence cannot be accessed concurrently from multiple threads in the same process. This will be fixed in a future release.

* CUDALibraries has 4 SDK samples that do not build on certain Linux 32-bit Operating Systems. The Makefile links incorrectly to -lUtilNPP_i686; it should be -lUtilNPP_i386. 
To build NPP samples properly on 32-bit Linux replace all instances of 
"-lUtilNPP_$(OS_ARCH)" with "-lUtilNPP_$(LIB_ARCH)" in the following Makefiles:
- CUDALibraries/src/boxFilterNPP/Makefile
- CUDALibraries/src/freeImageInteropNPP/Makefile
- CUDALibraries/src/imageSegmentationNPP/Makefile
- CUDALibraries/src/histEqualizationNPP/Makefile


* When a program is terminated while waiting on a breakpoint, the system needs to be rebooted. This affects the TCC driver for Windows Vista and Windows 7.* There is a known driver bug when debugging CUDA applications which use TCC. If the application terminates while paused at a GPU breakpoint, internal driver state can be corrupted. Until the system is rebooted, further attempts to create CUDA contexts will enter an infinite loop during cuCtxCreate().

* GPU enumeration order on multi-GPU systems is non-deterministic and may change with this or future releases. Users should make sure to enumerate all CUDA-capable GPUs in the system and select the most appropriate one(s) to use.

Vista, Server 2008 and Windows 7 related:
-----------------------------------------

* In order to run CUDA on a non-TESLA GPU, either the Windows desktop must be extended onto the GPU, or the GPU must be selected as the PhysX GPU.

* Individual kernels are limited to a 2-second runtime by Windows Vista. Kernels that run for longer than 2 seconds will trigger the Timeout Detection and Recovery (TDR) mechanism. For more information, see http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx.

* The CUDA Profiler does not support performance counter events on Windows Vista.  All profiler configuration regarding performance counter events is ignored.

* The maximum size of a single memory allocation created by cudaMalloc or cuMemAlloc on WDDM devices is limited to:
  MIN ( ( System Memory Size in MB - 512 MB ) / 2, PAGING_BUFFER_SEGMENT_SIZE )
  For Vista, PAGING_BUFFER_SEGMENT_SIZE is approximately 2GB.

* The OS may impose artificial limits on the amount of memory you can allocate using the Cuda APIs for both system and video memory. In many cases, these limits are significantly less than the size of physical system and video memory, but there are exceptions that make it difficult to quantify the expected behavior for a particular application.

XP, Vista, Server 2008 and Windows 7 related:
---------------------------------------------

* Applications that try to use too much memory may cause a CUDA memcopy or kernel to fail with the error CUDA_ERROR_OUT_OF_MEMORY. If this happens, the CUDA context is placed into an error state and must be destroyed and recreated if the application wants to continue using CUDA.

* malloc may fail due to running out of virtual memory space. The address space limitation is fixed by a Microsoft issued hotfix. Please install the patch located at:
   http://support.microsoft.com/kb/940105 if this is an issue. Windows Vista SP1 includes this hotfix.

* When compiling a source file that includes vector_types.h with the Microsoft compiler on a 32-bit Windows system, the 16-byte aligned vector types are not properly aligned at 16 bytes.

XP related:
-----------

* OpenGL interoperability
  - OpenGL can not access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do any requested operations on the buffer.
  - Deleting a buffer while it is mapped for CUDA results in undefined behavior.
  - Attempting to map or unmap while a different context is bound than was current during the buffer register operation will generally result in a program error and should thus be avoided.
  - Interoperability will use a software path on SLI
  - Interoperability will use a software path if monitors are attached to multiple GPUs and a single desktop spans more than one GPU (i.e. WinXP dualview).

* OpenCL program binary formats may change in this or future releases. Users should create programs from source and should not rely on compatibility of generated binaries between different versions of the driver.

* (Windows and Linux) Individual GPU program launches are limited to a run time  of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure   reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.

* (Windows and Linux) It is a known issue that cudaThreadExit() may not be called implicitly on host thread exit. Due to this, developers are recommended to explicitly call cudaThreadExit() while the issue is being resolved.
per email thread started by Cliff Woolley
* For maximum performance when using multiple byte sizes to access the same data, coalesce adjacent loads and stores when possible rather than using a union or individual byte accesses. Accessing the data via a union may result in the compiler reserving extra memory for the object, and accessing the data as individual bytes may result in non-coalesced accesses. This will be improved in a future compiler release.

Linux Only:
-----------       

* (Linux) There is a known bug in ICC with respect to passing 16-byte aligned types by value to GCC-built code such as the CUDA Toolkit libraries e.g. CUBLAS. At this time, passing a double2 or cuDoubleComplex or any other 16-byte aligned type by value to GCC-built code from ICC-built code will pass incorrect data. Intel has been informed of this bug. As a workaround, a GCC-built wrapper function that accepts the data by reference from the ICC-built code can be linked with the ICC-built code; the GCC-built wrapper can then, in turn, pass the data by value to the CUDA Toolkit libraries.

* In order to run CUDA applications, the CUDA module must be
  loaded and the entries in /dev created.  This may be achieved
  by initializing X Windows, or by creating a script to load the
  kernel module and create the entries.

  An example script (to be run at boot time):

  #!/bin/bash

  /sbin/modprobe nvidia

  if [ "$?" -eq 0 ]; then

  # Count the number of NVIDIA controllers found.
  N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
  NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
  mknod -m 666 /dev/nvidia$i c 195 $i;
  done

  mknod -m 666 /dev/nvidiactl c 195 255

  else
  exit 1
  fi

* The Linux kernel provides a mode where it allows user processes to overcommit system memory. (Refer to kernel documentation for /proc/sys/vm/ for details). If this mode is enabled- the default on many distros- the kernel may have to kill processes in order to free up pages for allocation requests. The CUDA driver process, especially for CUDA applications that allocate lots of zero-copy memory with cuMemHostAlloc or cudaMallocHost, is particularly vulnerable to being killed in this way. Since there is no way for the CUDA SW stack to report an OOM error to the user before the process disappears, users, especially on 32bit Linux, are encouraged to disable memory overcommit in their kernel to avoid this problem. 
Please refer to documentation on vm.overcommit_memory and vm.overcommit_ratio for more information.

Linux and Mac:
--------------
* When compiling with GCC, special care must be taken for structs that contain 64-bit integers.  This is because GCC aligns long longs to a 4 byte boundary by default, while NVCC aligns long longs to an 8 byte boundary by default.  Thus, when using GCC to compile a file that has a struct/union, users must give the -malign-double option to GCC.  When using NVCC, this option is automatically passed to GCC.

* It is a known issue that cudaThreadExit() may not be called implicitly on
  host thread exit. Due to this, developers are recommended to explicitly
  call cudaThreadExit() while the issue is being resolved.

Mac Only:
---------
* OpenGL interop will always use a software path leading to reduced performance when compared to interop on other platforms.

* CUDA kernels which do not terminate or run without interruption for several tens of seconds may trigger the GPU to reset causing a disruption of any attached displays. This may cause display image to become corrupted, which will disappear upon a reboot.

* The kernel driver may leak wired (i.e. unpageable memory) if CUDA applications terminate in unexpected ways. Continued leaks will lead to severely degraded system performance and requires a reboot to fix.

* On systems with multiple GPUs installed or systems with multiple monitors connected to a single GPU, OpenGL interoperability always copies shared buffers through host memory.

* Current hardware limits the number of asynchronous memcopies that can be overlapped with kernel execution. Overlap is also limited to kernels executing for less than 1 second. These limitations are expected to improve on future hardware. 

* The following APIs exhibit high CPU utilization if they wait for the hardware for a significant amount of time.  As a workaround, apps may use cu(da)StreamQuery and/or cu(da)EventQuery to check whether the GPU is busy and yield the thread as desired.
  - cuCtxSynchronize
  - cuEventSynchronize
  - cuStreamSynchronize
  - cudaThreadSynchronize
  - cudaEventSynchronize
  - cudaStreamSynchronize

* The MacBook Pro currently presents both GPUs as available for use in Performance mode. This is incorrect behavior, as only one GPU is available at a time. CUDA applications that try to run on the second GPU (device ID 1) will potentially hang. This hang may be terminated by pressing ctrl-C or closing the offending application.

* There is a potential for a system hang if any running CUDA application terminates abnormally while executing divergent code on the MAC OS. This issue has been fixed in the newer Mac driver version 256.01.00f03 available on: http://www.nvidia.com/


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Resolved Issues 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

The following known issues that were published in CUDA Toolkit 3.2 (and 4.0 RC, RC2) release notes and errata documents have been fixed:

* For devices with compute capability 1.x, only the "Occupancy analysis" part of "Kernel analysis" was supported by the Visual Profiler. The information displayed under "Limiting Factor Identification" in the kernel analysis window was not accurate and was not to be used. This issue has been fixed.

* When profiling OpenCL applications on devices with compute capability 1.x. an "Invalid cta_launched column" error was previously reported. This issue has been fixed.

* Visual Profiler was reported to crash when trying to profile a application on Ubuntu 10.10. This issue has been fixed.

* Earlier version reported a known issue that when profiling an application in Visual Profiler on a device with compute capability 1.x with the "Normalized counters" option enabled, incorrect signals are selected resulting in warnings. This issue has been fixed.

* Earlier version reported a known issue that for some SDK applications (e.g. simpleMultiGPU) which run on multiple GPU devices, the Visual Profiler output is generated only for one device. This issue has been fixed.

* NV50_P2P allocations are limited to only allow P2P objects to be allocated between GPUs in the same peer group. 
The details are as follows:
pciDomainID is added to the cudaDeviceProp structure
description: "pciDomainID" is the PCI domain identifier of the device
CU_DEVICE_ATTRIBUTE_PCI_DOMAIN_ID added as a constant for cuDeviceGetAttribute 

Mac Related:
------------ 
* To save power, some Apple products automatically power-down the CUDA-capable GPU in the system. If the operating system has powered down the CUDA-capable GPU, CUDA fails to run and the system returns an error that no device was found.
In order to ensure that your CUDA-capable GPU is not powered down by the operating system do the following:
1. Go to "System Preferences" 
2. Open the "Energy Saver" section
3. Un-check the "Automatic graphics switching" check box in the upper left

* This issue described in the previous version has been fixed in CUDA Toolkit 4.0.
On Mac OS only, the NVIDIA C Compiler (nvcc) handles size_t incorrectly during 64-bit compilation. The version of nvcc included with CUDA Toolkit 3.2 fails to handle variables of type size_t as an 8-byte entity in PTX when compiling 64-bit device code. To address this issue, NVIDIA has released a patch that updates components of nvcc.
The patch is available as "CUDA Toolkit: GFEC Patch for MacOS" from the following location:
http://developer.nvidia.com/object/cuda_3_2_downloads.html
Please refer to additional information and installation instructions in the README file distributed with the patch.

* The following issue reported in the previous version has been fixed in CUDA Toolkit 4.0.
In CUBLAS 3.2, the GEMM, SYRK, and HERK routines for Fermi GPUs can enter an infinite recursion leading to an application crash for certain input sizes meeting the criteria below.  To work around this problem, the input to CUBLAS must be recursively subdivided until the individual calls to these CUBLAS routines do not match these criteria.
Given threshold size T, where T is equal to 2^27 - 512 (i.e., 134217216), the crash might be seen in any of the following circumstances:
      1) A is not transposed, lda * k >= T, and T is divisible by lda.
      2) B is not transposed, ldb * n >= T, T is divisible by n, and n is divisible by 32
      3) A is transposed, lda * m >= T, T is divisible by m, and m is divisible by 32
      4) B is transposed, ldb * k >= T, and T is divisible by ldb.


* The performance of the TRMM routine in this 4.0 release has regressed compared to the performance in the 3.2 release.  This will be fixed in the final 4.0 production release.  As a work-around, the new out-of-place option provided in the new CUBLAS API for TRMM can be used.  The performance of this out-of-place implementation is much higher than the 3.2 performance.

* In the previous release of the CUBLAS Library, the cublasDgemm() routine produced incorrect results in some cases when k < 32 and matrix A is transposed. This has been fixed in this release. 


* (Windows and Linux) In the previous version, divergent_branch counter in Visual Profiler reported an incorrect value (of zero) for Fermi. This issue has been fixed in CUDA Toolkit 4.0.

* (Windows) cudaMempy3D no longer ignores src and dst position parameters for host memory.

* The cublasCgemm() routine in the CUBLAS library would crash in a few specific cases in the previous release; fixed in this release. 

* The cufftPlanMany() API in the 4.0 RC release had a bug that caused previously working application code to fail. In particular, when inembed was set to NULL, and 'istride' or 'idist' were set to invalid values, the API would return the CUFFT_INVALID_VALUE error code. This has been fixed, and now the error checks are only executed if 'inembed' is not NULL. This applies to the 'onembed', 'ostride' and 'odist' parameters as well. 

* In the previous version of the CUFFT Library, there was a memory leak in some cases when creating and subsequently destroying a plan for a FFT transform whose size had a prime factor larger than 47. This has been fixed in the current release. 

* The cublasFree() interface in the Legacy CUBLAS API has been corrected to remove the 'const' type qualifier from the 'void *devicePtr' argument in order to match the cudaFree() and the standard C free() APIs. Note that this may cause user code that depends on that parameter being const not to compile with the latest version of CUBLAS, though this should be an uncommon scenario. 

* In the previous release, in certain situations, the CUFFT library would print the following error message to stderr: cufft: Failed to find applicable transform. 
In the current release, all errors are reported via API return codes and the library does not print anything directly to stdout or stderr.

* Fixed in this release.
When profiling an application in Visual Profiler on a device with compute capability 1.x with the "Normalized counters" option enabled, incorrect signals are selected resulting in warnings. To avoid the warnings, do not enable the "Normalized counters" option.

* Fixed in this release.
Issue reported in earlier release notes: For some SDK applications (e.g. simpleMultiGPU) which run on multiple GPU devices, the Visual Profiler output is generated only for one device.

* Fixed in this release.
In the earlier release, Visual Profiler sample project Nbody.cvp could not be opened on Linux unless the file was remaned from "Nbody_nbody_Context_0.csv" to "Nbody_Nbody_Context_0.csv".

* Fixed in this release.
Issue reported in earlier release notes: GPU enumeration order on multi-GPU systems is non-deterministic and may change with this or future releases. Users should make sure to enumerate all CUDA-capable GPUs in the system and select the most appropriate one(s) to use.

* Fixed in this release.
(Vista, Server 2008 and Windows 7 related)
Issue reported in earlier release notes:  The CUDA Profiler does not support performance counter events on Windows Vista.  All profiler configuration regarding performance counter events is ignored.

* In previous releases, the nppiNormDiff_8u_C1R function in the NPP library returned both output values into host pointers. In this release, the semantics of this API function have been changed and now the pointers provided for the two outputs are assumed to be pointing to device memory. There will be no compilation error as the prototype of the function has not changed and the program may fail silently; hence if this function is being used we recommend that the code be updated proactively by users. 

* In previous versions of the NPP Library, the Rotate primitives set pixel values inside the destination ROI to 0 (black) if there is no pixel value from the source image that corresponds to a particular destination pixel. This incorrect behavior has been fixed.  Now, these desitnation pixels are left untouched so that they stay at the original background color.

* In the previous CUDA Toolkit 4.0 release candidates, the NPP Library header file, nppi.h, made use of const references for passing structs to functions. This causes compilation errors when included from within a C file (as opposed to from within a C++ file). Since the NPP API is intended to be a pure C API, the offending C++ constructs have been removed from the header file. 

* In the previous release of the NPP Library, the nppiGraphcut_32s8u API function would return a NPP_TEXTURE_BIND_ERROR in some cases when the API should have executed to completion without error. This has been fixed in the current release. 

* Improved the accuracy of the generation of normally distributed single-precision pseudo-random numbers in the CURAND library. The main observed impacts of this improvement are (1) the maximum difference between the results generated by a GPU generator and a HOST generator are much smaller for single-precision normally distributed random numbers; and (2) the performance of GPU random number generation is now slower than the previous version for single-precision normally distributed random numbers. 

* The Sobol' direction vectors used by the CURAND library have been updated using the latest Joe-Kuo file new-joe-kuo-6.21201. The file was obtained from this website: http://web.maths.unsw.edu.au/~fkuo/sobol/. The smallest dimension with updated values in the new file is the 212th dimension. Therefore, the exact Sobol' sequences generated by CURAND may differ from the previous release even for the same exact input parameters, if more than 211 dimensions are requested. The authors of the direction vectors indicate that the previous set of vectors were corrupted and that their use be discontinued. 

* The previous version of the NPP library had a bug in the nppsDiv_32s_C1R primitive when dividing by 0. This bug has been fixed, and now the primitive will correctly return NPP_MAX_32S or NPP_MIN_32S when dividing by 0. 

* (Windows) In the previous version a setup consisting of GF100 M2070-Q + R260.27 driver resulted in SDK sample DeviceQuery not running when switched from OS to regular user account. This has been fixed in this version.
(Operating Systems: Windows2008 Server64, WinXP-x64 )

* In the previous release of the NPP Library, the nppiMinMax_8u_C1R() function would not work in certain situations; this has been fixed in this release. 

* For an OpenCL C program, the maximum alignment of a function scope local variable and a function parameter variable is limited to 16-byte.  

* In previous releases, the nppiMean_StdDev_8u_C1R function in the NPP library returned both output values into host pointers. In this release, the semantics of this API function have been changed and now the pointers provided for the two outputs are assumed to be pointing to device memory. There will be no compilation error as the prototype of the function has not changed and the program may fail silently; hence if this function is being used we recommend that the code be updated proactively by users. 

* In the previous release, the *Filter_8u_C1R functions in the NPP library produced incorrect results when the nSrcStep input parameter was not a multiple of 4. This has been corrected, and now the functions work for all values of nSrcStep. The exact list of impacted functions is nppiFilterRow_8u_C1R, nppiFilterBox_8u_C1R, nppiFilter_8u_C1R, nppiFilterMax_8u_C1R, and nppiFilterMin_8u_C1R.
* In previous releases, the nppiMinMax_8u_C1R function in the NPP library returned both output values into host pointers. In this release, the semantics of this API function have been changed and now the pointers provided for the two outputs are assumed to be pointing to device memory. There will be no compilation error as the prototype of the function has not changed and the program may fail silently; hence if this function is being used we recommend that the code be updated proactively by users. 

* The accuracy of single-precision transforms in the CUFFT Library has been signifcantly improved, especially for larger transforms and multi-dimensional transforms. 
Note that the accuracy improvements in general did not impact performance compared to the previous version of CUFFT, however some single precision power-of-2 kernels on the Fermi architecture will show a minor performance regression compared to the previous version of the library.

* In previous versions of the CUFFT Library, for some 1D transform sizes larger than 32M elements, the first call to cufftExec*() would fail due to insufficient memory or due to grid size limitations. These resource limitations are now properly checked for and reported by cufftPlan*() such that if sufficient resources are not available to execute an FFT of the requested size, the error will be reported at plan time rather than at execution time. 

* Thrust no longer supports scatter and gather directly between host and device memory; instead the output needs to be staged through a temporary object and copied explicitly with thrust::copy().

* Thrust no longer supports operations on device_vector when the backend is CUDA in the absence of nvcc. Hence, operations which modify device_vector's size or elements are unavailable in a .cpp file.


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Source code for Open64 and cuda-gdb
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
The Open64 and cuda-gdb source files are controlled under terms of the GPL license. 
Current and previously released versions are located at:
are located at  ftp://download.nvidia.com/CUDAOpen64

Linux users:
* Please refer to the Release Notes and Known Issues sections in the cuda-gdb 
User Manual (cuda-gdb.pdf).
* Please refer to cuda-memcheck.pdf for notes on supported error detection and known issues.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Revision History
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

  05/2011 - Version 4.0
  04/2011 - Version 4.0 RC2 [Errata]
  02/2011 - Version 4.0 RC
  11/2010 - Version 3.2
  10/2010 - Version 3.2 RC2
  09/2010 - Version 3.2 RC

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
More Information
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

For more information and help with CUDA, please visit:
http://www.nvidia.com/cuda

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Acknowledgements
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

NVIDIA extends thanks to EM Photonics (http://www.emphotonics.com) for their contributions to the matrix-vector multiplication functions in the CUBLAS library incorporated into the v4.0 release.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
