Accelerate OpenCV 4.2.0 – build with CUDA and python bindings

OpenCV 4.5.0 (changelog) which is compatible with CUDA 11.1 and cuDNN 8.0.4 was released on 12/10/2020, see Accelerate OpenCV 4.5.0 on Windows – build with CUDA and python bindings, for the updated guide.

Because the pre-built Windows libraries available for OpenCV 4.2.0 do not include the CUDA modules, or support for the Nvidia Video Codec SDK, Nvidia cuDNN, Intel Media SDK or Intel’s Math Kernel Libraries (MKL) or Intel Threaded Building Blocks (TBB) performance libraries, I have included the build instructions, below for anyone who is interested. If you just need the Windows libraries then go to Download OpenCV 4.2.0 with CUDA 10.2. To get an indication of the performance boost from calling the OpenCV CUDA functions with these libraries see the OpenCV 3.4 GPU CUDA Performance Comparisson (nvidia vs intel).

The guide below details instructions on compiling the 64 bit version of OpenCV 4.2.0 shared libraries with Visual Studio 2019, CUDA 10.2, and optionally the Nvidia Video Codec SDK, Nvidia cuDNN, Intel Media SDK, Intel Math Kernel Libraries (MKL), Intel Threaded Building Blocks (TBB) and Python bindings for accessing OpenCV CUDA modules from within Python.

The main topics covered are given below. Although most of the sections can be read in isolation I recommend reading the pre-build checklist first to check whether you will benefit from and/or need to compile OpenCV with CUDA support.

Pre-build Checklist

Before continuing there are a few things to be aware of:

  1. This guide is for OpenCV 4.2.0. Whilst the instructions should also work on newer versions, this is not guaranteed so please only ask questions related to the stable 4.2.0 release on this page.
  2. You can download all the pre-built binaries described in this guide from the downloads page. Unless you want to;
    • build for another version of Visual Studio; and/or
    • include non-free algorithms like SURF; and/or
    • generate CUDA binaries compatible with devices of compute capability lower than 5.3; and/or
    • build bindings for python versions other than to 3.7;

    or just want to build OpenCV from scratch, you may find they are all you need.

  3. If you have already tried to build and are having issues check out the troubleshooting guide
  4. Thanks to Hamdi Sahloul, since August 2018 the CUDA modules can now be called directly from Python, to include this support see the including Python bindings section.
  5. The procedure outlined has been tested on Visual Studio Community 2019 (16.4.2).
  6. The OpenCV DNN modules are now CUDA accelerated. To target you need to install cuDNN (see the below for instructions) before building.

    • If you want to use your application on a different machine you will need to ensure that the cudnn64_7.dll is installed on that machine, either in a location on the system/user path or in the same directory as your application.
    • Installing cuDNN will automatically cause OpenCV to be built with the CUDA DNN backend, therefore until this PR has been merged, including cuDNN in your CUDA directory means you will need to compile for CUDA Compute Capability 5.3 or higher (-DCUDA_ARCH_BIN=5.3,6.0,6.1,7.0,7.5), or disable the module with -DOPENCV_DNN_CUDA=OFF.
  7. If you have built OpenCV with CUDA support then to use those libraries and/or redistribute applications built with them on any machines without the CUDA toolkit installed, you will need to ensure those machines have,
    • an Nvidia capable GPU with driver version of 441.22 or later (see this for a full list of CUDA Toolkit versions and their required drivers), and
    • the CUDA dll’s (cublas64_10.dll, nppc64_10.dll etc.) placed somewhere on the system or user path, or in the same directory as the executable. These can be located in the following directory.
      C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin
  8. The latest version of Intel TBB uses a shared library, therefore if you build with Intel TBB you need to add
    C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\redist\intel64_win\tbb\vc_mt

    to your path variable, and make sure you redistribute tbb.dll with any of your applications.

  9. Depending on the hardware the build time can be over 3 hours. If this is an issue you can speed this up by generating the build files with ninja and/or targeting a specific CUDA compute capability.


There are a couple of components you need to download and/or install before you can get started, you first need to:

  1. Install Visual Studio 2019, selecting the “Desktop development with C++” workload shown in the image below. If you already have an installation ensure that the correct workload is installed and that you have updated to the latest version.
  2. Download the source files for both OpenCV and OpenCV contrib, available on GitHub. Either clone the git repos OpenCV and OpenCV Contrib making sure to checkout the 4.2.0 tag or download these archives OpenCV 4.2.0 and OpenCV Contrib 4.2.0 containing all the source file.
    Note: I have seen lots of guides including instructions to download and use git to get the source files, however this is a completely unnecessary step. If you are a developer and you don’t already have git installed then, I would assume there is a good reason for this and I would not advise installing just to build OpenCV.
  3. Install CMake – Version 3.16.2 is used in the guide.
  4. Install The CUDA 10.2 Toolkit. Note: If your system path is too long, CUDA will not add the path to its binaries C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin during installation. If you receive a warning about this at the end of the installation process do not forget to manually add the path to your system path, otherwise opencv_world420.dll will fail to load.
  5. Optional – To decode video on the GPU with Nvidia Video Codec SDK
    • Register and download the Video Codec SDK.
    • Extract and copy the include and Lib directories to your CUDA installation. For CUDA 10.2 your CUDA installation directory is

      C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2

    Note: Before building you may want to ensure that your GPU has decoding support by refering to Nvidia Video Decoder Support Matrix.

  6. Optional – To use the DNN CUDA backend
    • Register and download cuDNN.
    • Extract and copy the bin, include and Lib directories to your CUDA installation. For CUDA 10.2 your CUDA installation directory is

      C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2
  7. Optional – To accelerate video decoding on Intel CPU’s with Quick Sync register and download and install Intel Media SDK.
  8. Optional – To accelerate specific OpenCV operations install both the Intel MKL and TBB by registering for community licensing, and downloading for free. MKL version 2019.5.281 and TBB version 2019.8.281 are used in this guide, I cannot guarantee that other versions will work correctly.
  9. Optional – To call OpenCV CUDA routines from python, install the x64 bit version of Anaconda, making sure to tick “Register Anaconda as my default Python ..” This guide has been tested against Anaconda with Python 3.7, installed in the default location for a single user.
  10. Optional – To significantly reduce the build time, download the Ninja build system to reduce build times – Version 1.9.0 is used in this guide.

Generating OpenCV build files with CMake

Before you can build OpenCV you have to generate the build files with CMake. There are two ways to do this, from the command prompt or with the CMake GUI, however by far the quickest and easiest way to proceed is to use the command prompt to generate the base configuration. Then if you want to add any additional configuration options, you can open up the build directory in the CMake GUI as described here.

In addition there are several ways to build OpenCV using Visual Studio. For simplicity only two methods are discussed here:

  1. Building OpenCV with Visual Studio solution files.
  2. Building OpenCV with the ninja build system to reduce the build time.

Finally instructions are included for building and using the Python bindings to access the OpenCV CUDA modules.

Building OpenCV 4.2.0 with CUDA and Intel MKL + TBB, with Visual Studio solution files from the command prompt (cmd)

The following steps will build the opencv_world420.dll using NVIDIA’s recommended settings for future hardware compatibility. This does however have two drawbacks, first the build can take several hours to complete and second, the shared library will be at least 800MB depending on the configuration that you choose below. To find out how to reduce both the compilation time and size of opencv_world420.dll read choosing the compute-capability first and then continue as below. If you wish to build the Python bindings and/or use the Ninja build system then see section including python bindings and/or decreasing the build time with Ninja respectively before proceeding.

  1. Open up the command prompt (windows key + r, then type cmd and press enter).
  2. Ignore this step if you are not building with Intel MKL + TBB. Enter the below
    "C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\tbb\bin\tbbvars.bat" intel64

    to temporarily set the environmental variables for locating your TBB installation.

  3. Set the location of the source files and build directory, by entering the text shown below, first setting PATH_TO_OPENCV_SOURCE to the root of the OpenCV files you downloaded or cloned (the directory containing 3rdparty, apps, build, etc.) and PATH_TO_OPENCV_CONTRIB_MODULES to the modules directory inside the contrib repo (the directory containing cudaarithm, cudabgsegm, etc).
    set "openCvSource=PATH_TO_OPENCV_SOURCE"
    set "openCvBuild=%openCvSource%\build"
    set "buildType=Release"
    set "generator=Visual Studio 16 2019 Win64"
  4. Copy the below to the command prompt. This is the base configuration and will build opencv_world420.dll with CUDA including and the corresponding tests and examples. Additionally if the Nvidia Video Codec SDK, cuDNN or the Intel Media SDK are installed the corresponding modules will automatically be included.
    "C:\Program Files\CMake\bin\cmake.exe" -B"%openCvBuild%/" -H"%openCvSource%/" -G"%generator%" -DCMAKE_BUILD_TYPE=%buildType% -DOPENCV_EXTRA_MODULES_PATH="%openCVExtraModules%/" ^
    -DBUILD_opencv_world=ON ^

    Then append the following commands as required, and press enter to run CMake:

    • Remove all optional CUDA modules. This is useful if you only want to use the CUDA backend for the DNN module and will significantly reduce compilation time and size of the opencv_world420.dll.
      -DBUILD_opencv_cudaarithm=OFF -DBUILD_opencv_cudabgsegm=OFF -DBUILD_opencv_cudafeatures2d=OFF -DBUILD_opencv_cudafilters=OFF -DBUILD_opencv_cudaimgproc=OFF -DBUILD_opencv_cudalegacy=OFF -DBUILD_opencv_cudaobjdetect=OFF -DBUILD_opencv_cudaoptflow=OFF -DBUILD_opencv_cudastereo=OFF -DBUILD_opencv_cudawarping=OFF -DBUILD_opencv_cudacodec=OFF
    • Include Intel MLK without multithreading support and dependence on tbb.dll
    • Make Intel MKL multi-threaded by adding the following to -DWITH_MKL=ON.
      Note: This will make opencv_world420.dll dependant on tbb.dll.

    • Include Intel TBB – recommended for DNN inference on the CPU.
      Note: As above this will make opencv_world420.dll dependant on tbb.dll.

    • Include non free modules.
  5. If you want to make any configuration changes before building, then you can do so now through the CMake GUI.
  6. The OpenCV.sln solution file should now be in your PATH_TO_OPENCV_SOURCE/build directory. To build OpenCV you have two options depending on you preference you can:
    • Build directly from the command line by simply entering the following (swaping Release for Debug to build a release version)
      "C:\Program Files\CMake\bin\cmake.exe" --build %openCvBuild% --target INSTALL --config Debug
    • Build through Visual Studio GUI by opening up the OpenCV.sln in Visual Studio, selecting your Configuration, clicking on Solution Explorer, expanding CMakeTargets, right clicking on INSTALL and clicking Build.

    Either approach will both build the library and copy the necessary redistributable parts to the install directory, PATH_TO_OPENCV_SOURCE/build/install in this example. All that is required now to run any programs compiled against these libs is to add the directory containing opencv_world420.dll (and tbb.dll if you have built with Intel TBB) to you path environmental variable.

If everything was successful, congratulations, you now have OpenCV built with CUDA. To quickly verify that the CUDA modules are working and check if there is any performance benefit on your specific hardware see below

Decreasing the build time with Ninja

The build time for OpenCV can be reduced by more than 2x (from 2 hours to 30 mins on an i7-8700) by utilizing the ninja build system instead of directly generating Visual Studio solution files. The only difference you may notice is that Ninja will only produce one configuration at a time, either a Debug or Release, therefore the buildType must be set before calling CMake. In the section above the configuration was set to Release, to change it to Debug simply replace Release with Debug as shown below

set "buildType=Debug"

Using ninja only requires a two extra configuration steps:

  1. Setting both the path to the ninja executable and configuring Visual Studio Development tools. Both are achieved by entering the following into the command prompt before entering the CMake command, making sure to first set PATH_TO_NINJA to the directory containing ninja.exe, and changing Community to either Professional or Enterprise if necessary
    "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat"
    set "ninjaPath=PATH_TO_NINJA"
    set path=%ninjaPath%;%path%
  2. Changing the generator from “Visual Studio 16 2019” to ninja
    set "generator=Ninja"

For example entering the following into the command prompt will generate ninja build files to build OpenCV with CUDA 10.2 and Python bindings

"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat"
set "ninjaPath=PATH_TO_NINJA"
set path=%ninjaPath%;%path%
set "openCvSource=PATH_TO_OPENCV_SOURCE"
set "openCvBuild=%openCvSource%\build"
set "buildType=Release"
set "generator=Ninja"
set "pathToAnaconda=PATH_TO_ANACONDA3"
set "pyVer=37"
"C:\Program Files\CMake\bin\cmake.exe" -B"%openCvBuild%/" -H"%openCvSource%/" -G"%generator%" -DCMAKE_BUILD_TYPE=%buildType% -DOPENCV_EXTRA_MODULES_PATH="%openCVExtraModules%/" ^
-DBUILD_opencv_world=ON ^
-DBUILD_opencv_python3=ON -DPYTHON3_INCLUDE_DIR=%pathToAnaconda%/include -DPYTHON3_LIBRARY=%pathToAnaconda%/libs/python%pyVer%.lib -DPYTHON3_EXECUTABLE=%pathToAnaconda%/python.exe -DPYTHON3_NUMPY_INCLUDE_DIRS=%pathToAnaconda%/lib/site-packages/numpy/core/include -DPYTHON3_PACKAGES_PATH=%pathToAnaconda%/Lib/site-packages/ -DOPENCV_SKIP_PYTHON_LOADER=ON

The build can then be started in the same way as before dropping the –config option as

"C:\Program Files\CMake\bin\cmake.exe" --build %openCvBuild% --target install

Adding additional configuration options with the CMake GUI

Once you have generated the base Visual Studio solution file from the command prompt the easiest way to make any aditional configuration changes is through the CMake GUI. To do this:

  1. Fire up the CMake GUI.
  2. Making sure that the Grouped checkbox is ticked, click on the browse build button

    and navigate to your PATH_TO_OPENCV_SOURCE/build directory. If you have selected the correct directory the main CMake window should resemble the below.

  3. Now any additional configuration changes can be made by just expanding any of the grouped items and ticking or unticking the values displayed. Once you are happy just press Configure,

    if the bottom window displays configuration successful press Generate, and you should see

    Now you can open up the Visual Studio solution file and proceed as before.

  4. Troubleshooting:
    • Make sure you have the latest version of Visual Studio 2019 (>= 16.42)
    • Not all options are compatible with each other and the configuration step may fail as a result. If so examine the error messages given in the bottom window and look for a solution.
    • If the build is failing after making changes to the base configuration, I would advise you to remove the build directory and start again making sure that you can at least build the base Visual Studio solution files produces from the command line

Including Python bindings

Building and installing python support is incredibly simple, the instructions below are for python 3.7 however they can easily be adapted for other versions of python aswell:

  1. Open up the windows command prompt and enter
    set "pathToAnaconda=PATH_TO_ANACONDA3"
    set "pyVer=37"

    ensuring the PATH_TO_ANACONDA3 only uses forward slashes (/) as path seperators and points to the Anaconda3 directory, e.g. C:/Users/mbironi/Anaconda3/.

  2. Follow the instructions from above to build your desired configuration, appending the below to the CMake configuration before running CMake.
    -DBUILD_opencv_python3=ON -DPYTHON3_INCLUDE_DIR=%pathToAnaconda%/include -DPYTHON3_LIBRARY=%pathToAnaconda%/libs/python%pyVer%.lib -DPYTHON3_EXECUTABLE=%pathToAnaconda%/python.exe -DPYTHON3_NUMPY_INCLUDE_DIRS=%pathToAnaconda%/lib/site-packages/numpy/core/include -DPYTHON3_PACKAGES_PATH=%pathToAnaconda%/Lib/site-packages/ -DOPENCV_SKIP_PYTHON_LOADER=ON
  3. Make sure you build release, python bindings cannot by default be generated for a debug configuration. That said you can easily generate a debug build by modifying the contents of pyconfig.h, changing
    pragma comment(lib,"python37_d.lib")


    pragma comment(lib,"python37.lib")


    #       define Py_DEBUG


    //#       define Py_DEBUG

    The default location of pyconfig.h in Anaconda3 is %USERPROFILE%\Anaconda3\include\pyconfig.h. However the version you are compiling against may differ, to check the location simply open up CMake in the build directory as detailed in Adding additional configuration options with CMake GUI and check the entries under PYTHON2_INCLUDE_DIR and PYTHON3_INCLUDE_DIR shown below

  4. Verify that the cmake output detailing the modules to be built includes python3 and if not look for errors in the output preceding the below.

    --   OpenCV modules:
    --     To be built:                 aruco bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dpm face features2d flann fuzzy hfs highgui img_hash imgcodecs imgproc line_descriptor ml objdetect optflow phase_unwrapping photo plot python2 python3 quality reg rgbd saliency shape stereo stitching structured_light superres surface_matching text tracking ts video videoio videostab world xfeatures2d ximgproc xobjdetect xphoto
  5. Once generated the bindings need to be copied to the site-packages directory. This can be accomplished using the following which assumes you have python 3.7 installed through Anaconda in the default location for a single user.
    copy "%openCvBuild%\lib\python3\cv2.cp37-win_amd64.pyd" "%USERPROFILE%\Anaconda3\Lib\site-packages\cv2.cp37-win_amd64.pyd"
  6. Include the path to the opencv_world420.dll shared library in your user or system path or temporarily by entering
    set path=%openCvBuild%\install\x64\vc16\bin;%path%
  7. Test the freshly compiled python module can be located and loads correctly by entering
    python -c "import cv2; print(f'OpenCV: {cv2.__version__} for python installed and working')"

    and checking the output for

    OpenCV: 4.2.0 for python installed and working

    If you do not see the above output then see the troubleshooting section below.

If there were no errors from the above steps the Python bindings should be installed correctly. To use on a permanent basis don’t forget to permanently add the path to the opencv_world420.dll shared library to your user or system path. To quickly verify that the CUDA modules can be called and check if there is any performance benefit on your system continue below, then to see how to get the most performance from the OpenCV Python CUDA bindings see Accelerating OpenCV with CUDA streams in Python.

Troubleshooting, if the output from step (6) is:

  1. ModuleNotFoundError: No module named 'cv2'

    You have not copied the bindings to your python distribution, see step (4).

  2. ImportError: ERROR: recursion is detected during loading of "cv2" binary extensions. Check OpenCV installation.

    Ensure that you don’t have OpenCV installed though conda and/or pip, and that you don’t have another copy of the python bindings in your site-packages directory.

  3. ImportError: DLL load failed: The specified procedure could not be found.

    One of the required dll’s is not present on your windows path. From the feedback I have received it is most likely you have not added the location of either opencv_world420.dll, the path to the CUDA binaries, or the path to tbb.dll if built with Intel TBB. This can be quickly checked by entering in the following

    where opencv_world420.dll
    where nppc_64_10.dll
    where cudnn64_7.dll & :: if you have built the DNN module with the CUDA backend
    where tbb.dll & :: if you have built with Intel TBB

    and checking that you see the path to the dll in each case. If instead you see

    INFO: Could not find files for the given pattern(s).

    add the paths (step (5) above, step (4) from the Prerequisites and step (6) from the Pre-build Checklist) and check again. Once the you can see the paths to the dll’s check step (6) again.

  4. If you get any other errors, make sure to check OpenCV is installed correctly by running through the steps in Verifying OpenCV is CUDA accelerated.

Troubleshooting common configuration/build errors

  • CUDA : OpenCV requires enabled 'cudev' module from 'opencv_contrib'

    The most common cause of this is that -DOPENCV_EXTRA_MODULES_PATH has been set to the root of the opencv_contrib repo and not the modules directory. Double check that




    where OPENCV_CONTRIB is the location of the opencv_contrib repo on your local machine.

  • CUDA backend for DNN module requires CC 5.3 or higher.  Please remove unsupported architectures from CUDA_ARCH_BIN option.

    Because cuDNN has been installed, the DNN CUDA module, which currently requires compute capability 5.3 and above to function, will be built. This configuration error can be solved in two ways, by adding either;


      to disable the CUDA DNN module if it is not required, or;

    2. -DCUDA_ARCH_BIN=5.3,6.0,6.1,7.0,7.5 -DCUDA_ARCH_PTX=7.5

      to remove unsuported architectures;

    to your CMake configuration.

Verifying OpenCV is CUDA accelerated

The easiest way to quickly verify that everything is working is to check that one of the inbuilt CUDA performance tests passes. For this I have chosen the GEMM test which;

  • runs without any external data;
  • should be highly optimized on both the GPU and CPU making it “informative” to compare the performance timings later on, and;
  • has OpenCL versions.

To run the CUDA performance test simply enter the following into the existing command prompt

"%openCvBuild%\install\x64\vc16\bin\opencv_perf_cudaarithm.exe" --gtest_filter=Sz_Type_Flags_GEMM.GEMM/29

the full output is shown below. To verify that everything is working look for the green [ PASSED ] text in the image below.

The above test performed matrix multiplication on a 1024x1024x2 single precision matrix using a midrange GTX 1060 GPU 100 times, with a mean execution time of 3.70 ms, which can be seen in the following output taken from the image above.

[ PERFSTAT ]    (samples=16   mean=3.70   median=3.67   min=3.60   stddev=0.11 (2.9%))

If the test has passed then we can confirm that the above code was successfully run on the GPU using CUDA. Next it would be interesting to compare these results to the same test run on a CPU to check we are getting a performance boost, on the specific hardware set up we have.

CPU (i5-6500) Performance

The standard opencv core GEMM performance test does not use 1024×1024 matrices, therefore for this comparison we can simply change the GEMM tests inside opencv_perf_core.exe to process this size instead of 640×640. This is achieved by simply changing the following line to be

::testing::Values(Size(1024, 1024), Size(1280, 1280)),

Denoting the the modified executable as opencv_perf_core_1024.exe, the corresponding CPU test can be run as

"%openCvBuild%\install\x64\vc16\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3

resulting in the following output on a midrange i5-6500.

[ PERFSTAT ]    (samples=10   mean=1990.56   median=1990.67   min=1962.95   stddev=16.56 (0.8%))

The execution time is thee orders of magnitude greater than on the GPU so what is wrong with our CPU? As it turns out nothing is wrong, to get a baseline result, I purposely ran this without building OpenCV against any optimized BLAS. To demonstrate the performance benefit of building OpenCV with Intel’s MKL (which includes optimized BLAS) and TBB I have run the same test again with two different levels of optimization, OpenCV built against:

  1. Intel MKL without multi-threading
    [ PERFSTAT ]    (samples=10   mean=90.77   median=90.15   min=89.64   stddev=1.98 (2.2%))
  2. Intel MKL multi-threaded with TBB
    [ PERFSTAT ]    (samples=100   mean=28.86   median=28.37   min=27.34   stddev=1.33 (4.6%))

This demonstrates the importance using multi-threaded MKL and brings the gap between CPU and GPU performance down significantly. Now we are ready to compare with OpenCL.

OpenCL Performance

In OpenCV 4.0 the CUDA modules were moved from the main to the contrib repository, presumably because OpenCL will be used for GPU acceleration going forward. To examine the implications of this I ran the same performance tests as above again, only this time on each of my three OpenCL devices. The results for each device are given below including the command to run each test.

  • Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
    set OPENCV_OPENCL_DEVICE=Intel(R) OpenCL:CPU:Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
    "%openCvBuild%\install\x64\vc16\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=13   mean=205.35   median=205.63   min=200.35   stddev=2.82 (1.4%))
  • Intel(R) HD Graphics 530
    set OPENCV_OPENCL_DEVICE=Intel(R) OpenCL:GPU:Intel(R) HD Graphics 530
    "%openCvBuild%\install\x64\vc16\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=13   mean=130.88   median=129.82   min=127.46   stddev=2.72 (2.1%))
  • GeForce GTX 1060 3GB
    "%openCvBuild%\install\x64\vc16\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=13   mean=8.83   median=8.85   min=8.53   stddev=0.17 (1.9%))

The performance results for all the tests are shown together below.

The results in the figure show that for this specific test and hardware configuration (GTX 1060 vs i5-6500):

  1. If we ignore OpenCL the CUDA implementation on the GTX 1060 is comfortably faster than the MKL + TBB implementation executed on the CPU.
  2. The OpenCL implementation on the GTX 1060 is significantly slower than the CUDA version. This is expected but unfortunate considering the OpenCV CUDA routines have been moved from the main repository and may eventually be depreciated.
  3. OpenCL still has a long way to go, in addition to its poor performance when compared with CUDA on the same device the implementations on both the CPU (i5-6500) and the iGPU (HD Graphics 530) were an order of magnitude slower than the optimized MKL + TBB implementation on the CPU.

The above comparison is just for fun, to give an example of how to quickly check if using OpenCV with CUDA on your specific hardware combination is worth while. For a more indepth comparisson on several hardware configurations see OpenCV 3.4 GPU CUDA Performance Comparisson (nvidia vs intel)

Python CUDA performance

To quickly verify that the CUDA modules are being called from Python you can run the same GEMM test as before, this time from an Interactive Python session. Assuming that all of the steps in Including Python bindings completed successfully, open up the Anaconda3 prompt and issue the following to start the Python session and ensure that the path to OpenCV is set correctly.

set path=%openCvBuild%\install\x64\vc16\bin;%path%

Then run the GEMM test on the GPU with CUDA from within Python

import numpy as np
import cv2 as cv
npTmp = np.random.random((1024, 1024)).astype(np.float32)
npMat1 = np.stack([npTmp,npTmp],axis=2)
npMat2 = npMat1
cuMat1 = cv.cuda_GpuMat()
cuMat2 = cv.cuda_GpuMat()
%timeit cv.cuda.gemm(cuMat1, cuMat2,1,None,0,None,1)

You should see output similar to

4.47 ms ± 56.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

which is very close the to the result (3.70 ms on the GTX 1060 from C++) when the same test was called directly from C++. If you receive similar output then this confirms that you are running OpenCV from python on the GPU with CUDA.

For completeness you can run the same test on the CPU as

%timeit cv.gemm(npMat1,npMat2,1,None,0,None,1)

and confirm that the new result

27.9 ms ± 664 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

is comparable with the previous one (28.86 ms on the i5-6500 from C++).

You can also perform a quick sanity check to confirm that you are seeing good performance for the GEMM operation in OpenCV. An easy way to do this is to run the same operation again only this time in NumPy.

npMat3 = npTmp + npTmp*1j
npMat4 = npMat3
%timeit npMat3 @ npMat4

As you can see the data is structured in a slightly different way, however the timings

32 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

should hopefully be comparable to the OpenCV result (27.9 ms on the i5-6500 calling OpenCV from Python).

From the results of these quick tests it can be implied that:

  1. The OpenCV CUDA modules are being called from python.
  2. The overhead from using the CPU and/or CUDA python interface instead of directly calling from C++ is small.
  3. The GEMM operation in OpenCV is highly optimized if built with against Intel MKL.

Choosing the compute-capability

The default command line options given above implement NVIDIA’s recommended settings for future hardware compatibility. This means that any programs linked against the resulting opencv_world shared library should work on all GPU’s currently supported by CUDA 10.2 and all GPU’s released in the future. As mentioned above this comes at a cost, both in terms of compilation time and shared library size. Before discussing the CMake settings which can be used to reduce these costs we need to understand the following concepts:

  • Compute-capability – every GPU has a fixed compute-capability which determines its general specifications and features. In general the more recent the GPU the higher the compute-capability and the more features it will support. This is important because:
    • Each version of CUDA supports different compute-capabilities. Usually a new version of CUDA comes out to suppoort a new GPU architecture, in the case of CUDA 10.0, support was added for the Turing (compute 7.5) architecture. On the flip side support for older architechtures can be removed for example CUDA 9.0 removed support for the Fermi (compute 2.0) architecture. Therefore by choosing to build OpenCv with CUDA 10.2 we have limited ourselves to GPU’s of compute-capability >=3.0. Notice we have not limited ourselves to compute-capability GPU’s <=7.5, the reason for this is discussed in the next section.
    • You can build opencv_world420.dll to support one or many different compute-capabilities, depending on your specific requirements.
  • Supporting a compute-capability – to support a specific compute-capability you can do either of the following, or a combination of the two:
    • Generate architecture-specific cubin files, which are only forward-compatible with GPU architectures with the same major version number. This can be controlled by passing CUDA_ARCH_BIN to CMake. For example passing -DCUDA_ARCH_BIN=3.0 to CMake, will result in opencv_world420.dll containing binary code which can only run on compute-capability 3.0, 3.5 and 3.7 devices. Futhermore it will not support any specific features of compute-capability 3.5 (e.g. dynamic parallelism) or 3.7 (e.g. 128 K 32 bit registers). In the case of OpenCV 4 this would not restrict any functionality because it only uses features from compute-capability 3.0 and below. This can be confirmed by a quick search of the main and contrib repositories for the __CUDA_ARCH__ flag.
    • Generate forward-compatible PTX assembly for a virtual architecture, which is forward-compatable with all GPU architectures of greater than or equal compute-capability. This can be controlled by passing CUDA_ARCH_PTX to CMake. For example by passing -DCUDA_ARCH_PTX=7.5 to CMake, the opencv_world420.dll will contain PTX code for compute-capability 7.5 which can be Just In Time (JIT) compiled to architecture-specific binary code by the CUDA driver, on any future GPU architectures. Because of the default CMake rules when CUDA_ARCH_BIN is not explicitly set it will also contain architecture-specific cubin files for GPU architectures 3.0-7.5.
  • PTX considerations – given that PTX code is forward-compatible and cubin binaries are not it would be tempting to only include the former. To understand why this might not be such a great idea, a things to be aware of when generating PTX code:
    1. As mentioned previously the CUDA driver JIT compiles PTX code at run time and cache’s the resulting cubin files so that the compile operation should in theory be a one-time delay, at least until the driver is updated. However if the cache is not large enough JIT compilation will happen every time, causing delay every time your program executes.To get an idea of this delay I passed -DCUDA_ARCH_BIN=3.0 and -DCUDA_ARCH_PTX=3.0 to CMake before building OpenCV. I then emptied the cache (default location %appdata%\NVIDIA\ComputeCache\) and ran the GEMM performance example on a GTX 1060 (compute-capability 6.1), to force JIT compilation. I measured an initial delay of over 3 minutes as the PTX code was JIT compiled before the program started to execute. Following that, the delay of subsequent executions was around a minute, because the default cache size (256 MB) was not large enough to store all the compiled PTX code. Given my compile options the only solution to remove this delay is to increase the size of the cache by setting the CUDA_CACHE_MAXSIZE environmental variable to a number of bytes greater than required. Unfortunately because, “Older binary codes are evicted from the cache to make room for newer binary codes if needed”, this is more of a band aid than a solution. This is because the maximum cache size is 4 GB, therefore your PTX compiled code can be evicted at any point in time if other programs on your machine are also JIT compiling from PTX, bringing back the “one-time” only delay.
    2. For maximum device coverage you should include PTX for the lowest possible GPU architecture you want to support.
    3. For maximum performance NVIDIA recommends including PTX for the highest possible architecture you can.

CMake command line options to control cubin/PTX content of the OpenCV shared library

Given (1)-(3) above, the command line options that you want to pass to CMake when building OpenCV will depend on your specific requirements. I have given some examples below for various scenarios given a main GPU of compute-capability 6.1:

  • Firstly stick with the defaults if compile time and shared library size are not an issue. This offers the greatest amount of flexibility from a development standpoint, avoiding the possibility of needing to recompile OpenCV when you switch GPU.
  • If your programs will always be run on your main GPU, just pass -DCUDA_ARCH_BIN=6.1 to CMake to target your architecture only. It should take around an hour to build, depending on your CPU and the resulting shared library should not be larger than 200 MB.
  • If you are going to deploy your application, but only to newer GPU’s pass -DCUDA_ARCH_BIN=6.1,7.0,7.5 and -DCUDA_ARCH_PTX=7.5 to CMake for maximum performance and future compatibility.This is advisable because you may not have any control over the size of the JIT cache on the target machine, therefore including cubin’s for all compute-capabilities you want to support, is the only way be sure to prevent JIT compilation delay on every invocation of your application.
  • If size is really an issue but you don’t know which GPU’s you want to run your application on then to ensure that your program will run on all current and future supported GPU’s pass -DCUDA_ARCH_BIN=6.1 and -DCUDA_ARCH_PTX=3.0 to CMake for maximum coverage.
Digiprove sealCopyright secured by Digiprove © 2020 James Bowley

7 thoughts on “Accelerate OpenCV 4.2.0 – build with CUDA and python bindings

  1. Thank You,Is the pre-built binaries in the downloads page include Intel MKL + TBB,Quicksync and Nvidea Video Codec SDK?

    1. Hi Rog, I have updated the downloads page to include this information. In short the first download for OpenCV 4.2.0 includes everything except Intel MKL + TBB, because this would make them dependent on tbb.dll.

    1. Yes just follow the guide and you will have built the CUDA modules and the inference engine with GPU support.

      1. I am able to build opencv with inference engine and with cuda but separately.
        But when I am configuring both at the same time then it is building with no inference engine(CMAKE configuration output and Build Information output).

Leave a Reply

Your email address will not be published.