Accelerating OpenCV 4.1.0 – build with CUDA, Intel MKL + TBB and python bindings

OpenCV 4.5.0 (changelog) which is compatible with CUDA 11.1, has a CUDA DNN backend compatible with cuDNN 8.0.4 and improved python CUDA bindings was released on 12/10/2020, see Accelerate OpenCV 4.5.0 on Windows – build with CUDA and python bindings, for the updated guide.

Because the pre-built Windows libraries available for OpenCV 4.1.0 do not include the CUDA modules, or support for Intel’s Math Kernel Libraries (MKL) or Intel Threaded Building Blocks (TBB) performance libraries, I have included the build instructions, below for anyone who is interested. If you just need the Windows libraries then go to Download OpenCV 4.1.0 with CUDA 10.1. To get an indication of the performance boost from calling the OpenCV CUDA functions with these libraries see the OpenCV 3.4 GPU CUDA Performance Comparisson (nvidia vs intel).

The guide below details instructions on compiling the 64 bit version of OpenCV 4 shared libraries with Visual Studio 2017, CUDA 10.1, and optionally; the Intel Math Kernel Libraries (MKL); Intel Threaded Building Blocks (TBB); Python bindings for accessing OpenCV CUDA modules from withing Python.

The main topics covered are given below. Although most of the sections can be read in isolation I recommend reading the pre-build checklist first to check whether you will benefit from and/or need to compile OpenCV with CUDA support.

Pre-build Checklist

Before continuing there are a few things to be aware of:

  1. You can download all the pre-built binaries described in this guide from the downloads page. Unless you need an alternative configuration or just want to build OpenCV from scratch they are probably all you need.
  2. Thanks to Hamdi Sahloul, since August 2018 the CUDA modules can now be called directly from Python, to include this support see the including Python bindings section.
  3. The procedure outlined has been tested on Visual Studio Community 2017 (15.9.4) and Visual Studio 2019.
  4. The OpenCV DNN modules are not CUDA accelerated. I have seen other guides which include instructions to download cuDNN. This is completely unnecessary and will have no effect on performance.
  5. If you have built OpenCV with CUDA support then to use those libraries and/or redistribute applications built with them on any machines without the CUDA toolkit installed, you will need to ensure those machines have,
    • an Nvidia capable GPU with driver version of 418.96 or later (see this for a full list of CUDA Toolkit versions and their required drivers), and
    • the CUDA dll’s (cublas64_10.dll, nppc64_10.dll etc.) placed somewhere on the system or user path, or in the same directory as the executable. These can be located in the following directory.
      C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin
  6. The latest version of Intel TBB uses a shared library, therefore if you build with Intel TBB you need to add
    C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\redist\intel64_win\tbb\vc_mt

    to your path variable, and make sure you redistribute that dll with any of your applications.

  7. Depending on the hardware the build time can be over 3 hours. If this is an issue you can speed this up by generating the build files with ninja and/or targeting a specific CUDA compute capability.


There are a couple of components you need to download and/or install before you can get started, you first need to:

  1. Install Visual Studio 2017, selecting the “Desktop development with C++” workload shown in the image below. If you already have an installation ensure that the correct workload is installed and that you have updated to the latest version.
  2. Download the source files for both OpenCV and OpenCV contrib, available on GitHub. Either clone the git repos OpenCV and OpenCV Contrib making sure to checkout the 4.1.0 tag or download these archives OpenCV 4.1.0 and OpenCV Contrib 4.1.0 containing all the source file.
    Note: I have seen lots of guides including instructions to download and use git to get the source files, however this is a completely unnecessary step. If you are a developer and you don’t already have git installed then, I would assume there is a good reason for this and I would not advise installing just to build OpenCV.
  3. Install CMake – Version 3.13.2 is used in the guide.
  4. Install The CUDA 10.1 Toolkit. Note: If your system path is too long, CUDA will not add the path to its binaries C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin during installation. If you receive a warning about this at the end of the installation process do not forget to manually add the path to your system path, otherwise opencv_world410.dll will fail to load.
  5. Optional – To decode video on the GPU with Nvidia Video Codec SDK
    • Register and download the Video Codec SDK.
    • Extract and copy the include and Lib directories to your CUDA installation. For CUDA x.x your CUDA installation directory would be the

      C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vx.x

    Note: Before building you may want to ensure that your GPU has decoding support by refering to Nvidia Video Decoder Support Matrix.

  6. Optional – To accelerate video decoding on Intel CPU’s with Quick Sync register and download and install Intel Media SDK
  7. Optional – To accelerate specific OpenCV operations install both the Intel MKL and TBB by registering for community licensing, and downloading for free. MKL version 2019.1.144 and TBB version 2019.2.144 are used in this guide, I cannot guarantee that other versions will work correctly.
  8. Optional – To call OpenCV CUDA routines from python, install the x64 bit version of Anaconda, making sure to tick “Register Anaconda as my default Python ..” This guide has been tested against Anaconda 3.7 installed in the default location for a single user.
  9. Optional – To significantly reduce the build time, download the Ninja build system to reduce build times – Version 1.9.0 is used in this guide.

Generating OpenCV build files with CMake

Before you can build OpenCV you have to generate the build files with CMake. There are two ways to do this, from the command prompt or with the CMake GUI, however by far the quickest and easiest way to proceed is to use the command prompt to generate the base configuration. Then if you want to add any additional configuration options, you can open up the build directory in the CMake GUI as described here.

In addition there are several ways to build OpenCV using Visual Studio. For simplicity only two methods are discussed here:

  1. Building OpenCV with Visual Studio solution files.
  2. Building OpenCV with the ninja build system to reduce the build time.

Finally instructions are included for building and using the Python bindings to access the OpenCV CUDA modules.

Building OpenCV 4 with CUDA and Intel MKL + TBB, with Visual Studio solution files from the command prompt (cmd)

The next five steps will build the opencv_world shared library using NVIDIA’s recommended settings for future hardware compatibility. This does however have two drawbacks, first the build can take several hours to complete and second, the shared library will be at least 959MB depending on the configuration that you choose below. To find out how to reduce both the compilation time and size of opencv_world read choosing the compute-capability first and then continue as below. If you wish to build the Python bindings and/or use the Ninja build system then see section including python bindings and/or decreasing the build time with Ninja respectively before proceeding.

  1. Open up the command prompt (windows key + r, then type cmd and press enter)
  2. Ignore this step if you are not building with Intel MKL + TBB. Enter the below
    "C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\tbb\bin\tbbvars.bat" intel64

    to temporarily set the environmental variables for locating your TBB installation.

  3. Set the location of the source files and build directory, by entering the text shown below, first setting PATH_TO_OPENCV_SOURCE to the root of the OpenCV files you downloaded or cloned (the directory containing 3rdparty,apps,build,etc.) and PATH_TO_OPENCV_CONTRIB_MODULES to the modules directory inside the contrib repo (the directory containing cudaarithm, cudabgsegm, etc).
    set "openCvSource=PATH_TO_OPENCV_SOURCE"
    set "openCvBuild=%openCvSource%\build"
    set "buildType=Release"
    set "generator=Visual Studio 15 2017 Win64"
  4. Then choose your configuration from below and copy to the command prompt:
    • OpenCV with CUDA
    • OpenCV with CUDA and MKL multi-threaded with TBB
    • OpenCV with CUDA, MKL multi-threaded with TBB and TBB
  5. If you want to make any configuration changes before building, then you can do so now through the CMake GUI.
  6. The OpenCV.sln solution file should now be in your PATH_TO_OPENCV_SOURCE/build directory. To build OpenCV you have two options depending on you preference you can:
    • Build directly from the command line by simply entering the following (swaping Release for Debug to build a release version)
      "C:\Program Files\CMake\bin\cmake.exe" --build %openCvBuild% --target INSTALL --config Debug
    • Build through Visual Studio GUI by opening up the OpenCV.sln in Visual Studio, selecting your Configuration, clicking on Solution Explorer, expanding CMakeTargets, right clicking on INSTALL and clicking Build.

    Either approach will both build the library and copy the necessary redistributable parts to the install directory, PATH_TO_OPENCV_SOURCE/build/install in this example. All that is required now to run any programs compiled against these libs is to add the directory containing opencv_world400.dll (and tbb.dll if you have build with Intel TBB) to you path environmental variable.

If everything was successful, congratulations, you now have OpenCV built with CUDA. To quickly verify that the CUDA modules are working and check if there is any performance benefit on your specific hardware see below

Decreasing the build time with Ninja

The build time for OpenCV can be cut in half by utilizing the ninja build system instead of directly generating Visual Studio solution files. The only difference you may notice is that Ninja will only produce one configuration at a time, either a Debug or Release, therefore the buildType must be set before calling CMake. In the section above the configuration was set to Release, to change it to Debug simply replace Release with Debug as shown below

set "buildType=Debug"

Using ninja only requires a two extra configuration steps:

  1. Setting both the path to the ninja executable and configuring Visual Studio Development tools. Both are achieved by entering the following into the command or Anaconda3 prompt before entering the CMake command, making sure to first set PATH_TO_NINJA to the directory containing ninja.exe, and changing Community to either Professional or Enterprise if necessary
    "C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvars64.bat"
    set "ninjaPath=PATH_TO_NINJA"
    set path=%ninjaPath%;%path%
  2. Changing the generator from “Visual Studio 15 2017 Win64” to ninja
    set "generator=Ninja"

For example entering the following into the Anaconda3 prompt will generate ninja build files to build OpenCV with CUDA 10.1 and Python bindings

"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvars64.bat"
set "ninjaPath=PATH_TO_NINJA"
set path=%ninjaPath%;%path%
set "openCvSource=PATH_TO_OPENCV_SOURCE"
set "openCvBuild=%openCvSource%\build"
set "buildType=Release"
set "generator=Ninja"
set "pathToAnaconda=PATH_TO_ANACONDA3"
"C:\Program Files\CMake\bin\cmake.exe" -B"%openCvBuild%/" -H"%openCvSource%/" -G"%generator%" -DCMAKE_BUILD_TYPE=%buildType% -DBUILD_opencv_world=ON -DBUILD_opencv_gapi=OFF -DWITH_CUDA=ON -DCUDA_TOOLKIT_ROOT_DIR="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1" -DCUDA_FAST_MATH=ON -DWITH_CUBLAS=ON -DINSTALL_TESTS=ON -DINSTALL_C_EXAMPLES=ON -DBUILD_EXAMPLES=ON -DWITH_OPENGL=ON -DOPENCV_EXTRA_MODULES_PATH="%openCVExtraModules%/" -DOPENCV_ENABLE_NONFREE=ON -DCUDA_ARCH_PTX=7.5 -DBUILD_opencv_python3=ON -DBUILD_opencv_hdf=OFF -DPYTHON3_INCLUDE_DIR=%pathToAnaconda%/include -DPYTHON3_LIBRARY=%pathToAnaconda%/libs/python37.lib -DPYTHON3_EXECUTABLE=%pathToAnaconda%/python.exe -DPYTHON3_NUMPY_INCLUDE_DIRS=%pathToAnaconda%/lib/site-packages/numpy/core/include -DPYTHON3_PACKAGES_PATH=%pathToAnaconda%/Lib/site-package -DWITH_NVCUVID=ON -DWITH_MFX=ON

The build can then be started in the same way as before dropping the –config option as

"C:\Program Files\CMake\bin\cmake.exe" --build %openCvBuild% --target install

Adding additional configuration options with the CMake GUI

Once you have generated the base Visual Studio solution file from the command prompt the easiest way to make any aditional configuration changes is through the CMake GUI. To do this:

  1. Fire up the CMake GUI.
  2. Making sure that the Grouped checkbox is ticked, click on the browse build button

    and navigate to your PATH_TO_OPENCV_SOURCE/build directory. If you have selected the correct directory the main CMake window should resemble the below.

  3. Now any additional configuration changes can be made by just expanding any of the grouped items and ticking or unticking the values displayed. Once you are happy just press Configure,

    if the bottom window displays configuration successful press Generate, and you should see

    Now you can open up the Visual Studio solution file and proceed as before.

  4. Troubleshooting:
    • Make sure you have the latest version of Visual Studio 2017 (>= 15.8)
    • Not all options are compatible with each other and the configuration step may fail as a result. If so examine the error messages given in the bottom window and look for a solution.
    • If the build is failing after making changes to the base configuration, I would advise you to remove the build directory and start again making sure that you can at least build the base Visual Studio solution files produces from the command line

Including Python bindings

Building and installing python support is incredibly simple:

  1. Open up the Anaconda3 command prompt and enter
    set "pathToAnaconda=PATH_TO_ANACONDA3"

    ensuring the PATH_TO_ANACONDA3 only uses forward slashes (/) as path seperators and points to the Anaconda3 directory, e.g. C:/Users/mbironi/Anaconda3/.

  2. Follow the instructions from above to build your desired configuration, issuing all the commands to the Anaconda prompt instead of the default windows command prompt and appending the below to the CMake configuration before generating the build files.
    -DBUILD_opencv_python3=ON -DBUILD_opencv_hdf=OFF -DPYTHON3_INCLUDE_DIR=%pathToAnaconda%/include -DPYTHON3_LIBRARY=%pathToAnaconda%/libs/python37.lib -DPYTHON3_EXECUTABLE=%pathToAnaconda%/python.exe -DPYTHON3_NUMPY_INCLUDE_DIRS=%pathToAnaconda%/lib/site-packages/numpy/core/include -DPYTHON3_PACKAGES_PATH=%pathToAnaconda%/Lib/site-package
  3. Make sure you build release, python bindings cannot by default be generated for a debug configuration. That said you can easily generate a debug build by modifying the contents of pyconfig.h, changing
    pragma comment(lib,"python37_d.lib")


    pragma comment(lib,"python37.lib")


    #       define Py_DEBUG


    //#       define Py_DEBUG

    The default location of pyconfig.h in Anaconda3 is %USERPROFILE%\Anaconda3\include\pyconfig.h. However the version you are compiling against may differ, to check the location simply open up CMake in the build directory as detailed in Adding additional configuration options with CMake GUI and check the entries under PYTHON2_INCLUDE_DIR and PYTHON3_INCLUDE_DIR shown below

  4. Verify that the cmake output detailing the modules to be built includes python3 and if not look for errors in the output preceding the below.

    --   OpenCV modules:
    --     To be built:                 aruco bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dpm face features2d flann fuzzy hfs highgui img_hash imgcodecs imgproc line_descriptor ml objdetect optflow phase_unwrapping photo plot python2 python3 quality reg rgbd saliency shape stereo stitching structured_light superres surface_matching text tracking ts video videoio videostab world xfeatures2d ximgproc xobjdetect xphoto
  5. Once generated the bindings should automatically be copied to your python install. If not you can manually copy them using the following which assumes you have python 3.7 installed through Anaconda in the default location for a single user.
    copy "%openCvBuild%\lib\python3\cv2.cp37-win_amd64.pyd" "%USERPROFILE%\Anaconda3\Lib\site-packages\cv2.cp37-win_amd64.pyd"
  6. Include the path to the opencv_world shared library in your user or system path or temporarily by entering
    set path=%openCvBuild%\install\x64\vc15\bin;%path%
  7. Test the freshly compiled python module can be located and loads correctly by entering
    python -c "import cv2; print(f'OpenCV: {cv2.__version__} for python installed and working')"

    and checking the output for

    OpenCV: 4.1.0 for python installed and working

    If you do not see the above output then see the troubleshooting section below.

If there were no errors from the above steps the Python bindings should be installed correctly. To use on a permanent basis don’t forget to permanently add the path to the opencv_world shared library to your user or system path. To quickly verify that the CUDA modules can be called and check if there is any performance benefit on your system continue below, then to see how to get the most performance from the OpenCV Python CUDA bindings see Accelerating OpenCV with CUDA streams in Python.

Troubleshooting, if the output from step (6) is:

  1. ModuleNotFoundError: No module named 'cv2'

    You have not copied the bindings to your python distribution, see step (4).

  2. ImportError: ERROR: recursion is detected during loading of "cv2" binary extensions. Check OpenCV installation.

    Ensure that you don’t have OpenCV installed though conda and/or pip, and that you don’t have another copy of the python bindings in your site-packages directory.

  3. ImportError: DLL load failed: The specified procedure could not be found.

    One of the required dll’s is not present on your windows path. From the feedback I have received it is most likely you have not added the location of either opencv_world410.dll, the path to the CUDA binaries, or the path to tbb.dll if built with Intel TBB. This can be quickly checked by entering in the following

    where opencv_world410.dll
    where nppc_64_10.dll
    where tbb.dll (if you have built with Intel TBB)

    and checking that you see the path to the dll in each case. If instead you see

    INFO: Could not find files for the given pattern(s).

    add the paths (step (5) above, step (4) from the Prerequisites and step (6) from the Pre-build Checklist) and check again. Once the you can see the paths to the dll’s check step (6) again.

  4. If you get any other errors, make sure to check OpenCV is installed correctly by running through the steps in Verifying OpenCV is CUDA accelerated.

Verifying OpenCV is CUDA accelerated

The easiest way to quickly verify that everything is working is to check that one of the inbuilt CUDA performance tests passes. For this I have chosen the GEMM test which;

  • runs without any external data;
  • should be highly optimized on both the GPU and CPU making it “informative” to compare the performance timings later on, and;
  • has OpenCL versions.

To run the CUDA performance test simply enter the following into the existing command prompt

"%openCvBuild%\install\x64\vc15\bin\opencv_perf_cudaarithm.exe" --gtest_filter=Sz_Type_Flags_GEMM.GEMM/29

the full output is shown below. To verify that everything is working look for the green [ PASSED ] text in the image below.

The above test performed matrix multiplication on a 1024x1024x2 single precision matrix using a midrange GTX 1060 GPU 100 times, with a mean execution time of 3.70 ms, which can be seen in the following output taken from the image above.

[ PERFSTAT ]    (samples=16   mean=3.70   median=3.67   min=3.60   stddev=0.11 (2.9%))

If the test has passed then we can confirm that the above code was successfully run on the GPU using CUDA. Next it would be interesting to compare these results to the same test run on a CPU to check we are getting a performance boost, on the specific hardware set up we have.

CPU (i5-6500) Performance

The standard opencv core GEMM performance test does not use 1024×1024 matrices, therefore for this comparison we can simply change the GEMM tests inside opencv_perf_core.exe to process this size instead of 640×640. This is achieved by simply changing the following line to be

::testing::Values(Size(1024, 1024), Size(1280, 1280)),

Denoting the the modified executable as opencv_perf_core_1024.exe, the corresponding CPU test can be run as

"%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3

resulting in the following output on a midrange i5-6500.

[ PERFSTAT ]    (samples=10   mean=1990.56   median=1990.67   min=1962.95   stddev=16.56 (0.8%))

The execution time is thee orders of magnitude greater than on the GPU so what is wrong with our CPU? As it turns out nothing is wrong, to get a baseline result, I purposely ran this without building OpenCV against any optimized BLAS. To demonstrate the performance benefit of building OpenCV with Intel’s MKL (which includes optimized BLAS) and TBB I have run the same test again with two different levels of optimization, OpenCV built against:

  1. Intel MKL without multi-threading
    [ PERFSTAT ]    (samples=10   mean=90.77   median=90.15   min=89.64   stddev=1.98 (2.2%))
  2. Intel MKL multi-threaded with TBB
    [ PERFSTAT ]    (samples=100   mean=28.86   median=28.37   min=27.34   stddev=1.33 (4.6%))

This demonstrates the importance using multi-threaded MKL and brings the gap between CPU and GPU performance down significantly. Now we are ready to compare with OpenCL.

OpenCL Performance

In OpenCV 4.0 the CUDA modules were moved from the main to the contrib repository, presumably because OpenCL will be used for GPU acceleration going forward. To examine the implications of this I ran the same performance tests as above again, only this time on each of my three OpenCL devices. The results for each device are given below including the command to run each test.

  • Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
    set OPENCV_OPENCL_DEVICE=Intel(R) OpenCL:CPU:Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
    "%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=13   mean=205.35   median=205.63   min=200.35   stddev=2.82 (1.4%))
  • Intel(R) HD Graphics 530
    set OPENCV_OPENCL_DEVICE=Intel(R) OpenCL:GPU:Intel(R) HD Graphics 530
    "%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=13   mean=130.88   median=129.82   min=127.46   stddev=2.72 (2.1%))
  • GeForce GTX 1060 3GB
    "%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=13   mean=8.83   median=8.85   min=8.53   stddev=0.17 (1.9%))

The performance results for all the tests are shown together below.

The results in the figure show that for this specific test and hardware configuration (GTX 1060 vs i5-6500):

  1. If we ignore OpenCL the CUDA implementation on the GTX 1060 is comfortably faster than the MKL + TBB implementation executed on the CPU.
  2. The OpenCL implementation on the GTX 1060 is significantly slower than the CUDA version. This is expected but unfortunate considering the OpenCV CUDA routines have been moved from the main repository and may eventually be depreciated.
  3. OpenCL still has a long way to go, in addition to its poor performance when compared with CUDA on the same device the implementations on both the CPU (i5-6500) and the iGPU (HD Graphics 530) were an order of magnitude slower than the optimized MKL + TBB implementation on the CPU.

The above comparison is just for fun, to give an example of how to quickly check if using OpenCV with CUDA on your specific hardware combination is worth while. For a more indepth comparisson on several hardware configurations see OpenCV 3.4 GPU CUDA Performance Comparisson (nvidia vs intel)

Python CUDA performance

To quickly verify that the CUDA modules are being called from Python you can run the same GEMM test as before, this time from an Interactive Python session. Assuming that all of the steps in Including Python bindings completed successfully, open up the Anaconda3 prompt and issue the following to start the Python session and ensure that the path to OpenCV is set correctly.

set path=%openCvBuild%\install\x64\vc15\bin;%path%

Then run the GEMM test on the GPU with CUDA from within Python

import numpy as np
import cv2 as cv
npTmp = np.random.random((1024, 1024)).astype(np.float32)
npMat1 = np.stack([npTmp,npTmp],axis=2)
npMat2 = npMat1
cuMat1 = cv.cuda_GpuMat()
cuMat2 = cv.cuda_GpuMat()
%timeit cv.cuda.gemm(cuMat1, cuMat2,1,None,0,None,1)

You should see output similar to

4.47 ms ± 56.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

which is very close the to the result (3.70 ms on the GTX 1060 from C++) when the same test was called directly from C++. If you receive similar output then this confirms that you are running OpenCV from python on the GPU with CUDA.

For completeness you can run the same test on the CPU as

%timeit cv.gemm(npMat1,npMat2,1,None,0,None,1)

and confirm that the new result

27.9 ms ± 664 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

is comparable with the previous one (28.86 ms on the i5-6500 from C++).

You can also perform a quick sanity check to confirm that you are seeing good performance for the GEMM operation in OpenCV. An easy way to do this is to run the same operation again only this time in NumPy.

npMat3 = npTmp + npTmp*1j
npMat4 = npMat3
%timeit npMat3 @ npMat4

As you can see the data is structured in a slightly different way, however the timings

32 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

should hopefully be comparable to the OpenCV result (27.9 ms on the i5-6500 calling OpenCV from Python).

From the results of these quick tests it can be implied that:

  1. The OpenCV CUDA modules are being called from python.
  2. The overhead from using the CPU and/or CUDA python interface instead of directly calling from C++ is small.
  3. The GEMM operation in OpenCV is highly optimized if built with against Intel MKL.

Choosing the compute-capability

The default command line options given above implement NVIDIA’s recommended settings for future hardware compatibility. This means that any programs linked against the resulting opencv_world shared library should work on all GPU’s currently supported by CUDA 10.1 and all GPU’s released in the future. As mentioned above this comes at a cost, both in terms of compilation time and shared library size. Before discussing the CMake settings which can be used to reduce these costs we need to understand the following concepts:

  • Compute-capability – every GPU has a fixed compute-capability which determines its general specifications and features. In general the more recent the GPU the higher the compute-capability and the more features it will support. This is important because:
    • Each version of CUDA supports different compute-capabilities. Usually a new version of CUDA comes out to suppoort a new GPU architecture, in the case of CUDA 10.0, support was added for the Turing (compute 7.5) architecture. On the flip side support for older architechtures can be removed for example CUDA 9.0 removed support for the Fermi (compute 2.0) architecture. Therefore by choosing to build OpenCv with CUDA 10.1 we have limited ourselves to GPU’s of compute-capability >=3.0. Notice we have not limited ourselves to compute-capability GPU’s <=7.5, the reason for this is discussed in the next section.
    • You can build opencv_world400.dll to support one or many different compute-capabilities, depending on your specific requirements.
  • Supporting a compute-capability – to support a specific compute-capability you can do either of the following, or a combination of the two:
    • Generate architecture-specific cubin files, which are only forward-compatible with GPU architectures with the same major version number. This can be controlled by passing CUDA_ARCH_BIN to CMake. For example passing -DCUDA_ARCH_BIN=3.0 to CMake, will result in opencv_world shared library containing binary code which can only run on compute-capability 3.0, 3.5 and 3.7 devices. Futhermore it will not support any specific features of compute-capability 3.5 (e.g. dynamic parallelism) or 3.7 (e.g. 128 K 32 bit registers). In the case of OpenCV 4 this would not restrict any functionality because it only uses features from compute-capability 3.0 and below. This can be confirmed by a quick search of the contrib repository for the __CUDA_ARCH__ flag.
    • Generate forward-compatible PTX assembly for a virtual architecture, which is forward-compatable with all GPU architectures of greater than or equal compute-capability. This can be controlled by passing CUDA_ARCH_PTX to CMake. For example by passing -DCUDA_ARCH_PTX=7.5 to CMake, the opencv_world shared library will contain PTX code for compute-capability 7.5 which can be Just In Time (JIT) compiled to architecture-specific binary code by the CUDA driver, on any future GPU architectures. Because of the default CMake rules when CUDA_ARCH_BIN is not explicitly set it will also contain architecture-specific cubin files for GPU architectures 3.0-7.5.
  • PTX considerations – given that PTX code is forward-compatible and cubin binaries are not it would be tempting to only include the former. To understand why this might not be such a great idea, a things to be aware of when generating PTX code:
    1. As mentioned previously the CUDA driver JIT compiles PTX code at run time and cache’s the resulting cubin files so that the compile operation should in theory be a one-time delay, at least until the driver is updated. However if the cache is not large enough JIT compilation will happen every time, causing delay every time your program executes.To get an idea of this delay I passed -DCUDA_ARCH_BIN=3.0 and -DCUDA_ARCH_PTX=3.0 to CMake before building OpenCV. I then emptied the cache (default location %appdata%\NVIDIA\ComputeCache\) and ran the GEMM performance example on a GTX 1060 (compute-capability 6.1), to force JIT compilation. I measured an initial delay of over 3 minutes as the PTX code was JIT compiled before the program started to execute. Following that, the delay of subsequent executions was around a minute, because the default cache size (256 MB) was not large enough to store all the compiled PTX code. Given my compile options the only solution to remove this delay is to increase the size of the cache by setting the CUDA_CACHE_MAXSIZE environmental variable to a number of bytes greater than required. Unfortunately because, “Older binary codes are evicted from the cache to make room for newer binary codes if needed”, this is more of a band aid than a solution. This is because the maximum cache size is 4 GB, therefore your PTX compiled code can be evicted at any point in time if other programs on your machine are also JIT compiling from PTX, bringing back the “one-time” only delay.
    2. For maximum device coverage you should include PTX for the lowest possible GPU architecture you want to support.
    3. For maximum performance NVIDIA recommends including PTX for the highest possible architecture you can.

CMake command line options to control cubin/PTX content of the OpenCV shared library

Given (1)-(3) above, the command line options that you want to pass to CMake when building OpenCV will depend on your specific requirements. I have given some examples below for various scenarios given a main GPU of compute-capability 6.1:

  • Firstly stick with the defaults if compile time and shared library size are not an issue. This offers the greatest amount of flexibility from a development standpoint, avoiding the possibility of needing to recompile OpenCV when you switch GPU.
  • If your programs will always be run on your main GPU, just pass -DCUDA_ARCH_BIN=6.1 to CMake to target your architecture only. It should take around an hour to build, depending on your CPU and the resulting shared library should not be larger than 200 MB.
  • If you are going to deploy your application, but only to newer GPU’s pass -DCUDA_ARCH_BIN=6.1,7.0,7.5 and -DCUDA_ARCH_PTX=7.5 to CMake for maximum performance and future compatibility.This is advisable because you may not have any control over the size of the JIT cache on the target machine, therefore including cubin’s for all compute-capabilities you want to support, is the only way be sure to prevent JIT compilation delay on every invocation of your application.
  • If size is really an issue but you don’t know which GPU’s you want to run your application on then to ensure that your program will run on all current and future supported GPU’s pass -DCUDA_ARCH_BIN=6.1 and -DCUDA_ARCH_PTX=3.0 to CMake for maximum coverage.
Digiprove sealCopyright secured by Digiprove © 2020 James Bowley

45 thoughts on “Accelerating OpenCV 4.1.0 – build with CUDA, Intel MKL + TBB and python bindings

  1. Dear Sir,
    I am trying to build for CUDA only. I am using VS 2017 15.9.12, SDK 17763.
    When building with Visual Studio, I am stuck with lot of warnings:
    “field of class type without a DLL interface used in a class with a DLL interface”

    Please kindly help me to resolve this warning!

    1. Those warnings should not prevent the build from succeeding opencv_world410.dll build successfully?

      1. Thank you for the reply. Yes, I still can successfully build the opencv_world410.dll. But, when I try to call any CUDA function, my program crashes 🙁

        1. Hi, did you manage to get the gemm test described in the guide to work? What error do you see in the cmd window when your program crashes?

          1. Dear Sir,
            I just fixed the issue. The reason was my graphics driver. cuda 10.1 needs minimum 418.96. Mine was older. I updated to the recent driver and everything works now!
            Thank you for your amazing tutorial. Helped me a lot!!

  2. Thanks for your great post.
    I followed your instructions was able to build and install OpenCV with CUDA, MKL multi-threaded with TBB and TBB using Ninja on my Windows 10 machine.
    But get an error when I run the test app:

    ImportError: DLL load failed: The specified module could not be found.
    This is thrown on ‘import cv2’ so I assume the built DLLis not being found.
    And here’s a dumb question, does opencv 4 need to be installed prior to these steps or does this process completely install opencv 4?

    Thanks for your great post !


    1. Hi Randy,
      Did you set the path to the opencv_world410.dll in a similar way to
      set path=%openCvBuild%\install\x64\vc15\bin;%path%
      and successfully run
      python -c “import cv2; print(f’OpenCV: {cv2.__version__} for python installed and working’)”
      If so it may be that you opened up another Anaconda prompt to run your python code without setting the path before hand. I would suggest permanently adding the path to opencv_world410.dll to your user path environmental variable.

      1. Wow, thanks for your quick reply !
        Path is set correctly and the .pyd file has been copied to the specified folder as shown in the Anaconda shell output below.
        I did a clean Anaconda3 install for this effort.

        (base) C:\Users\rgrah>dir C:\Users\rgrah\Anaconda3\Lib\site-packages\cv2*
        Volume in drive C has no label.
        Volume Serial Number is 3A3F-C0DB

        Directory of C:\Users\rgrah\Anaconda3\Lib\site-packages

        05/22/2019 09:09 AM cv2
        05/21/2019 09:08 PM 9,580,544 cv2.cp37-win_amd64.pyd
        1 File(s) 9,580,544 bytes
        1 Dir(s) 53,553,811,456 bytes free

        (base) C:\Users\rgrah>where opencv_world410.dll

        (base) C:\Users\rgrah>python -c "import cv2"
        Traceback (most recent call last):
        File "", line 1, in
        File "C:\Users\rgrah\Anaconda3\lib\site-packages\cv2\", line 89, in
        File "C:\Users\rgrah\Anaconda3\lib\site-packages\cv2\", line 79, in bootstrap
        import cv2
        ImportError: DLL load failed: The specified module could not be found.

        (base) C:\Users\rgrah>

        1. It looks like you have the cv2 directory as well, which in OpenCV 4.1.0 gets created automatically when you build. The cv2.cp37-win_amd64.pyd file inside the cv directory should be identical to the one you manually copied across as you have a fresh install of Anaconda without OpenCV. Verify it is and remove the file you copied across. Then hopefully everything should work.

  3. Thanks again, but still no luck.
    I’ll run through your process for 4.0 and see how that goes since I don’t have a specific requirement for 4.1.

    Thanks again !

    1. Before you do that make sure that you have tbb.dll on your path, if not add

      C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\redist\intel64_win\tbb\vc_mt

      to your path variable.

  4. Hi, I followed this guide to build with CUDA and python bindings without using TBB or MKL and everything seemed to be working (your python CUDA performace test code ran fine).

    I’m trying to use cudacodec_VideoReader but as far as I can tell it’s silently failing.
    Here’s a little test code:

    vidCap = cv.cudacodec_VideoReader(r”C:\Users\PC\Desktop\test.mp4″)
    frame = vidCap.nextFrame()
    Output is this:

    Traceback (most recent call last):
    File “C:\Users\PC\Desktop\”, line 3, in
    frame = vidCap.nextFrame()
    TypeError: Incorrect type of self (must be ‘cudacodec_VideoReader’ or its derivative)

    I can’t see any way to test if it has actually loaded the file before checking for frames. I can’t see that there’s anything else I can do with this function.
    I’m flailing in the dark here. Any advice?

    1. Hi, I have updated the guide with instructions on how to include the modules for hardware video decoding on an Nvidia GPU.

      1. Thank you for the great work! I tried to access the HW decoding and I’m having trouble still. I confirm similar results with Steeve. I extracted the codec sdk according to instructions to the CUDA toolkit path. I also tried adding the lib folder to path with no avail. The error I’m getting is:
        > cv2.cudacodec.createVideoReader(path)
        cv2.error: OpenCV(4.1.0) D:\James\repos\opencv\modules\core\include\opencv2/core/private.cuda.hpp:113: error: (-213:The function/feature is not implemented) The called functionality is disabled for current build or platform in function ‘throw_no_cuda’

        1. Hi, that error implies that you have not built with nvcuvid. That said even if you have compiled successfully, the python interface to cudacodec does not work properly in OpenCV 4.1.0 and cudacodec is completely broken and currently being fixed in 4.1.1 so I would advise avoiding it until this happens. If however you really want to get something working then you can investigate your issue following the steps below.

          1) Make sure you are building against nvcuvid. In the CMake GUI when you expand the CUDA options is CUDA_nvcuvid_LIBRARY populated with the path to your ncuvid.lib, e.g. C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/lib/x64/nvcuvid.lib. Alternatively did you see similar output to below in the anaconda prompt

          — NVIDIA GPU arch: 30 35 37 50 52 60 61 70 75
          — NVIDIA PTX archs: 75

          If so did everything build correctly and are you sure you don’t have opencv installed through anaconda? I would suggest trying to run the install\x64\vc16\bin\opencv_test_cudacodecd.exe to check your build after verifying you have compiled against nvcuvid.lib.

          2) Fix the python bindings. Once you have confirmed you can build against nvcuvid.lib to use the python interface you need to modify the file to enable nextFrame() to be called from python shown here before you run CMake and build OpenCV. Then you should be able to call the HW decoder from python.

          Even with the above modification (2) you may still have problems with h265 decoding, and reading from ip camera’s. You can try using the pre-built binaries which I sent to Steeve (!CI4WlKZC!1sxXMQx3_3_jhV6E48c7HJbq5y6CNJl0gVLv3cCpg5Y) which work from python allowing you to stream h264[5] from ip camera’s but they are untested and may be buggy.

          1. Thank you! That helps a lot and I can live with the situation as it is now. I was hoping for performance improvements and I can wait until they really get fixed or I get inspired to look deeper into it. Again, fantastic intructions altogether!

  5. Dear James,
    I came across your post after searching high and low on how to install OpenCV with CUDA support for Python. Your post is fantastic; the only problem, for me, is that I am working on Linux (Ubuntu 18.04) and I haven’t been able to follow along quite well, for I am absolutely incompetent with windows and just a little bit less incompetent with Linux. If it isn’t much to ask, could you please point me to a guide that helps me understand in Linux terms what you’ve done in your post?

    Thank you for your post 🙂

    1. Hi, unfortunately I haven’t touched linux in a long time, so I can’t be of any help to you.

  6. Hi,
    Thanks for the great guide. I’ve managed to follow it up until the python bindings, where I’ve been stuck for hours. I’m a total novice. I know this isn’t supposed to be a support line, but any pointers would be super appreciated!

    I’m following a face clustering guide and it’s working ok on the cpu with opencv-python, but obviously I’m here because I’d like to use the gpu. After following this guide I’m getting “ModuleNotFoundError: No module named ‘cv2′” when running my script, which I’m guessing would be solved by successfully binding my opencv installation to python?

    I’m stuck here-
    “Follow the instructions from above to build your desired configuration, issuing all the commands to the Anaconda prompt instead of the default windows command prompt and appending the below to the CMake configuration before generating the build files.
    -DBUILD_opencv_python3=ON -DBUILD_opencv_hdf=OFF”

    What commands am I supposed to be issuing to anaconda prompt? I’ve tried “cmake -DBUILD_opencv_python3=ON -DBUILD_opencv_hdf=OFF”, and then generated in the gui.

    I didn’t want to ask because I know I’m missing something ‘incredibly simple’, but I just can’t figure it out.


    1. Hi, you need to follow the instructions in this with the adding the two options you mentioned to the cmake command line input and instead of issuing the commands to the standard command prompt issue them to the annaconda prompt so that your python paths are correctly set. But first i would check to see if there is a CUDA accelerated version of the face clustering example?

  7. I am fairly new to this and i have failed to build this in the past 3 days. I have almost given up but would rather just use the binaries you provided on the downloads page. How exactly do I install from the .7z zip file you provided? it just contains an install folder. Do i have to build the opencv files with cmake myself even with the binaries?

    1. Hi, to build against the binaries in Visual Studio you need to add EXTRACTED_LOCATION\install\include and EXTRACTED_LOCATION\install\x64\vc15\lib to your Additional Include Directories and Additional Library Directories and opencv_world410.lib to Additional Dependencies. Then add EXTRACTED_LOCATION\install\x64\vc15\bin to your user path.

      If I were you I would first open a command prompt, set the path to the .dll
      set path=EXTRACTED_LOCATION\install\x64\vc15\bin;%path%
      and follow the instructions for verifying OpenCv is CUDA accelerated to make sure you have all of the required dependencies.

  8. Hi,
    I have been trying the steps in your page but with Visual Studio 2015.(As I need to use VS2015 for some custom dlls)
    The build is successfully generating the opencv_world410.lib files.
    But the function getCudaEnabledDeviceCount () is returning 0 for me indicating there are no cuda cores.
    So I would like to know if this is the problem with Visual Studio version.

    1. Hi, looking at the source for getCudaEnabledDeviceCount () I would say that either 1) you haven’t compiled with CUDA or 2) for some reason you can’t access your CUDA device.

      A quick check for
      1) examine the size of the .dll, if its over say 150MB then it is likely you have compiled with CUDA, and
      2) verify that the deviceQuery.exe can find your cuda device by running the below from the command prompt

      “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\extras\demo_suite\deviceQuery.exe”

      1. Hi,
        Thank you for a quick reply.
        After the quick check,
        1)The size of opencv_world410.dll generated is around 952 MB.
        2)deviceQuery.exe is able find the cuda device as show below
        \\CUDA Device Query (Runtime API) version (CUDART static linking)
        \\Detected 1 CUDA Capable device(s)
        \\Device 0: “GeForce GTX 1060 with Max-Q Design”

        It seems like the OpenCV libraries are built with CUDA support.
        I also ran some CUDA sample programs like bandwithTest. The test is passed. So the CUDA part is also working fine.
        But still getCudaEnabledDeviceCount () returns 0. OpenCV is not using CUDA acceleration.

        The GEMM test is also passed with the below result:

        [ INFO ] Implementation variant: cuda.
        [ GPU INFO ] Run test suite on GeForce GTX 1060 with Max-Q Design GPU.
        Time compensation is 0
        [ GPU INFO ] Run on OS Windows x64.
        *** CUDA Device Query (Runtime API) version (CUDART static linking) ***

        Device count: 1

        Device 0: “GeForce GTX 1060 with Max-Q Design”
        CUDA Driver Version / Runtime Version 10.10 / 10.10
        CUDA Capability Major/Minor version number: 6.1
        Total amount of global memory: 6144 MBytes (6442450944 bytes)
        GPU Clock Speed: 1.48 GHz
        Max Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072,65536), 3D=(16384,16384,16384)
        Max Layered Texture Size (dim) x layers 1D=(32768) x 2048, 2D=(32768,32768) x 2048
        Total amount of constant memory: 65536 bytes
        Total amount of shared memory per block: 49152 bytes
        Total number of registers available per block: 65536
        Warp size: 32
        Maximum number of threads per block: 1024
        Maximum sizes of each dimension of a block: 1024 x 1024 x 64
        Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
        Maximum memory pitch: 2147483647 bytes
        Texture alignment: 512 bytes
        Concurrent copy and execution: Yes with 5 copy engine(s)
        Run time limit on kernels: Yes
        Integrated GPU sharing Host Memory: No
        Support host page-locked memory mapping: Yes
        Concurrent kernel execution: Yes
        Alignment requirement for Surfaces: Yes
        Device has ECC support enabled: No
        Device is using TCC driver mode: No
        Device supports Unified Addressing (UVA): Yes
        Device PCI Bus ID / PCI location ID: 1 / 0
        Compute Mode:
        Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

        deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.10, CUDA Runtime Version = 10.10, NumDevs = 1

        OpenCV version: 4.1.0
        OpenCV VCS version: unknown
        Build type: N/A
        WARNING: build value differs from runtime: Release
        Compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/VC/Tools/MSVC/14.16.27023/bin/Hostx86/x64/cl.exe (ver 19.16.27032.1)
        Parallel framework: ms-concurrency
        CPU features: SSE SSE2 SSE3 *SSE4.1 *SSE4.2 *FP16 *AVX *AVX2
        Intel(R) IPP version: ippIP AVX2 (l9) 2019.0.0 Gold (-) Jul 26 2018
        Note: Google Test filter = Sz_Type_Flags_GEMM.GEMM/29
        [==========] Running 1 test from 1 test case.
        [———-] Global test environment set-up.
        [———-] 1 test from Sz_Type_Flags_GEMM
        [ RUN ] Sz_Type_Flags_GEMM.GEMM/29, where GetParam() = (1024×1024, 32FC2, 0|cv::GEMM_1_T)
        [ PERFSTAT ] (samples=13 mean=4.02 median=4.04 min=3.94 stddev=0.05 (1.3%))
        [ OK ] Sz_Type_Flags_GEMM.GEMM/29 (691 ms)
        [———-] 1 test from Sz_Type_Flags_GEMM (704 ms total)

        [———-] Global test environment tear-down
        [==========] 1 test from 1 test case ran. (734 ms total)
        [ PASSED ] 1 test.

        I have set the location of .dll file to the user path. I have added the path of include directories properly in Visual Studio.
        And I repeated the entire procedure with, Visual Studio 2017 now. But the result is the same.

        I am not able to figure out what is going wrong.

        Thank you

  9. Dear Sir,
    Thank you for this great tutorial. I followed your tutorial for cuda support step by step except python binding part, because I don’t need this. I successfully finished every step and gpu test on your tutorial.
    Then, I tried an image stitching code that I found on the web.

    When I execute the code Command Panel :
    [ INFO:0] Initialize OpenCL runtime…
    [ INFO:0] Successfully initialized OpenCL cache directory: C:\Users\burak\AppData\Local\Temp\opencv\4.1\opencl_cache\
    [ INFO:0] Preparing OpenCL cache configuration for context: NVIDIA_Corporation–GeForce_GTX_1050_Ti–426_00

    Stitching happens but slowly. Cpu usage increase to %50 but Gpu usage increase only %1. I can’t understand why. Therefore, I will be grateful if you help.

    1. Hi, sorry for the late reply, without the code it is difficult for me to comment. What I would say from the output you have is that it looks like you are using OpenCL which in my experience runs a lot slower than CUDA on NVIDIA GPU’s. Unfortunately it also looks like the CUDA stitching code was reliant on the NPP graphcut implementation which was removed after CUDA 8.0, so your only option would be to compile with CUDA 8.0 and an earlier version of Visual Studio.

  10. Hi there and thanks for the great post!

    I’ve successfully followed your description until the 4th step of Python bindings. My problem is, that I don’t have a “lib” folder inside the %openCvBuild% folder, so there is no .pyd file to copy. The Anaconda prompt gives no error after the build.

    What am I doing wrong?

    Thank you

  11. Hi!
    So quite a strange issue here: I’ve built everything with Python binding, but when I try to run the Python code in “Python CUDA performance” chapter, I get the following error:
    OpenCV(4.1.0) C:\projects\opencv-python\opencv\modules\core\include\opencv2/core/private.cuda.hpp:107: error: (-216:No CUDA support) The library is compiled without CUDA support in function ‘throw_no_cuda’
    The strange part is that on my computer there is no path like “C:\projects\opencv-python\opencv\modules\core\include\opencv2”, but I still can see it in the error.
    Also, by looking in Cmake at the build, I see that WITH_CUDA variable is checked.
    Do you have any idea where could it come from?

    1. Hi,
      When you type

      where opencv_world410.dll

      into the anaconda prompt does it show the path to your compiled dll? If so what size is that dll? Do you have OpenCV already installed through pip or conda? Are you using python 3.7 and is cv2.cp37-win_amd64.pyd from lib\python3\ in the Anaconda3\Lib\site-packages\ directory?

  12. Hi, Sir,
    Thank you for your detail process about this CUDA with OpenCV setup. I try to follow your structure to setup only OpenCV with CUDA. I am stuck in “Building OpenCV 4 with CUDA and Intel MKL + TBB, with Visual Studio solution files from the command prompt (cmd)”. In the step 6, I use commond line of “C:\Program Files\CMake\bin\cmake.exe” –build %openCvBuild% –target INSTALL –config Debug” and I also try to open the file of “OpenCV.sln” with VS2017.

    They both show me the warning of “C:\OpenCV\sources\opencv-4.1.0\modules\core\include\opencv2/core/types.hpp(530): warning : field of class type without a DLL interface used in a class with a DLL interface”. There are different of this warnings with different No. (such as types.hpp(530), types.hpp(532), types.hpp(771), mat.hpp(257), mat.hpp(2681), mat.hpp(2682), mat.hpp(3547), persistence.hpp(454), etc.).

    These warning just keep repeating with no ending. I can not build the model sucessfully, and the Visual studio is always running, so I cannot stop it. Therefore, I force stop the Visual studio by using “Task Manager” finally.

    Do you know what sort of problem it is? How can I identify the problem and solution?

    My hardware:
    GeForce GTX 1070 Ti
    Control panel: version 441.08

    I install softwares:
    CUDA 10.1 update2: CUDA Version 10.1.243
    (by the way, my windows 10 also has other version of CUDA. They are 9.2 and 10.0. However, I think the commond line just set the CUDA installation path, I did not uninstall other version of CUDA.)
    Visual Studio 2017: 15.9.17
    OpenCV: 4.1.0 with opencv_contrib 4.1.0
    CMake: 3.13.2
    I download the “Nvidia Video Codec SDK”, but did not add it into the CUDA installation. I think is it not necessary.

    Please help me if you still aviable.

    1. Hi, you can ignore the warnings. In Visual Studio it can take up to 3 hours to build if you don’t restrict the GPU architecture you are building for. You can add -DCUDA_ARCH_BIN=6.1 to reduce the build time.

  13. Hi, Sir,

    Do you know any tutorial web page for openCV+CUDA+python. I can only find openCV+CUDA for C++. In openCV-python I try to use function of cv2.cuda.split(). I have a hsv colour with img_gpu format from code of “hsv_gpu = cv2.cuda.cvtColor(frame_gpu, cv2.COLOR_BGR2HSV)”. This function successful.

    After that I try split the 3 channels image to the 3 single-channel image. I try
    “r, g, b = cv2.cuda.split(hsv_gpu)”,
    but the error shows
    “TypeError: split() missing required argument ‘dst’ (pos 2)”.
    Then I try to build three objects
    “r = cv2.cuda_GpuMat(), g = cv2.cuda_GpuMat(), b = cv2.cuda_GpuMat()” ,
    I run the function with this way
    Then, I try to download one image with
    “hsvs =”.
    The error is
    ” error: (-215:Assertion failed) !empty() in function ‘cv::cuda::GpuMat::download’ ”
    I have no idea about the cv.cuda.split() function. Do you know how to use this function. In addition, do you know any python API info about cv.cuda.

    1. Hi, the python CUDA bindings are still new so there are a few functions like split which are a bit confusing to use because they do not output the destination argument. In time this should hopefully be fixed so that you can for example call split as
      dst = cv2.cuda.split(hsv_gpu)
      and then interrogate the size and type of dst. Then if you wish to be more efficient you can pass the pre-allocated dst array into the function instead of allowing the function to perform the costly allocation of dst on every invocation.

      In the mean time the best advice I can give you is to read the python help for the function, e.g.

      help (cv2.cuda.split)
      Help on built-in function split:
          split(src, dst[, stream]) -> None
          .   @brief Copies each plane of a multi-channel matrix into an array.
          .   @param src Source matrix.
          .   @param dst Destination array/vector of single-channel matrices.
          .   @param stream Stream for the asynchronous version.
          .   @sa split

      Any arguments which are not in [] are required. Furthermore if dst is one of those arguments then you have to pre-allocate the array to the correct size if you are passing it to the function. That is

      r = cv2.cuda_GpuMat(hsv_gpu.size(),cv2.CV_8UC1) 
      g = cv2.cuda_GpuMat(hsv_gpu.size(),cv2.CV_8UC1) 
      b = cv2.cuda_GpuMat(hsv_gpu.size(),cv2.CV_8UC1) 

      should work with cv2.cuda.split(hsv_gpu,dst=[r,g,b])

      Additional resources I could recommend are the OpenCV python CUDA tests, although these are not complete; and a guide I wrote on using CUDA streams from python with OpenCV.

      1. Hi, James,

        Thank you very much for your info. I use split function well with your suggestion. I got another problem. I try to segment a person from a image. I find there is no inRange() function to use in cv2.cuda, There for I split the HSV image and apply cv2.cuda.threshold() function in each channel. I finally got the mask image of the person. However, when I try to get the person pixels from the mask and original image. The bitwise_and() function does not work well. In the normal openCV, people use
        “result = cv2.bitwise_and( img, img, mask)”
        I follow this idea upload the image and mask to GPU and use cv2.cuda.bitwise_and():
        “img_gpu = cv2.cuda_GpuMat(img)”
        “result = cv2.cuda.bitwise_and(img_gpu, img_gpu, mask_gpu)”
        The img_gpu is the 3 channel image (BGR). The mask_gpu is single channel 8bits image. The back ground is 0, person is 255.
        However, the result shows a image which is similar to the original image. It does not take out the person’s pixels. I try to change the mask’s value, which the person pixels is 1. However, it doesn’t work. The output result is still similar to the original image.
        So far, I just got a solution with other functions. I split the BGR image to 3 single channels, then I use each channel multiple the mask image. This sets background pixels to 0 in each channel image. However, this cost 5ms to finish all three channels calculation, and the problem is merging of the 3 channels. The cv2.cuda.merge() function cost up to 30ms to combine 3 channels together. By the way, the image I deal with has 4K resolution. I want to process this on a video (25 frame rate) in real time in the future.
        Do you know the correct way to use cv2.cuda.bitwise_and() function to segment the person with mask ? Do you know any other simple function can do this job?

        1. Hi, I am not sure I follow exactly. From the help for cv2.cuda.bitwise_and shown below,

          Help on built-in function bitwise_and:
              bitwise_and(src1, src2[, dst[, mask[, stream]]]) -> dst
              .   @brief Performs a per-element bitwise conjunction of two matrices (or of matrix and scalar).
              .   @param src1 First source matrix or scalar.
              .   @param src2 Second source matrix or scalar.
              .   @param dst Destination matrix that has the same size and type as the input array(s).
              .   @param mask Optional operation mask, 8-bit single channel array, that specifies elements of the
              .   destination array to be changed. The mask can be used only with single channel images.
              .   @param stream Stream for the asynchronous version.

          it states that

          The mask can be used only with single channel images.

          so I am surprised you are not seeing errors when you are using a mask where src1 and src2 are 3 channel images?

          Either way I would suggest if possible, you use src2 as the mask with background 255 and person 0 as shown in this notebook.

          Regarding cv2.cuda.merge(). Unfortunately this is one of the functions where the help is wrong, if you don’t pass in the destination array then the function will download the GpuMat to the host causing a significant performance hit. In the notebook mentioned I have timed the execution of cv2.cuda.merge(). If I don’t pass in the destination array it takes ~13ms, however if I do it only takes ~0.2ms.

          1. Hi, James,

            Thank you very much for your help. I found a way to save time of using cv2.cuda.merge(). Firstly, merge the three channels:
            “cv2.cuda.merge(src=[b, g, r], dst=result)”
            then I download the result from GPU
            “result_img =”
            These two lines cost 7ms, which is much shorter than 30ms (I had) before.

            However, the cv2.cuda.bitwise_and() function give me error, when I try to use mask in src2:
            “res_gpu = cv2.cuda.bitwise_and(img_gpu, mask_gpu)”
            It shows an error:
            “cv2.error: OpenCV(4.1.0) C:\OpenCV\sources\opencv_contrib-4.1.0\modules\cudaarithm\src\element_operations.cpp:141: error: (-215:Assertion failed) !scalar.empty() || (src2.type() == src1.type() && src2.size() == src1.size()) in function ‘`anonymous-namespace’::arithm_op'”
            It looks like the src1 and src2 should have same size. The My “img_gpu” and “mask_gpu” have similar size in 2D. However, “img_gpu” has three channels, but “mask_gpu” is single channel. I guess the cv2.cuda.bitwise_…() functions is different from the normal cv2.bitwise_…() functions. It can only be used in single channel image. If my guess is true, I should check which function is faster, cv2.cuda.bitwise_and() or cv2.cuda.multiply().

          2. Yes both cv.cuda.multiply() and cv2.cuda.bitwise_and() require the second input src2 to be the same size as the first. In the notebook I shared I set

            mask = np.ones(frame.shape,dtype='uint8')*255

            to have the same number of channels as the frame. Therefore if you use either function you will need to increase the number of channels.

  14. Hello Sir, I followed your build for Open CV with python bindings. The build was successful for the most part because i have the opencv_world410.dll and the size is around 1.16GB. When the build is complete it throws out the following output:
    ========== Build: 493 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

    Once I traversed through the folders I realized that I didn’t have the cv2.cp37-win_amd64.pyd file in the folder mentioned: “%openCvBuild%\lib\python3\cv2.cp37-win_amd64.pyd”

    It is because of this that I get a ModuleNotFoundError in Anaconda prompt when assuring that open cv has been installed.

    I am a novice to this. Kindly request you to help me in this.

    1. Firstly I would see which project is failing to build?

      Then I would check whether the python bindings were supposed to be built by running
      and checking that python3 is listed under modules to be built as described in step (4) under the instructions for Including Python Bindings.

      If not, I would further inspect the output under Python3 in the CMake GUI (step (3)) or the output from opencv_version_win32.exe (near the bottom) to see if all the libraries and includes have been picked up and if not ensure they have been correctly specified in the inputs to CMake (step (1-2)).

      1. Sorry Sir, I found the step I was doing wrong. It was specified to build release not debug. Since I did it on debug, the .pyd file was not generated. Now the issue is resolved. Thank you for posting the tutorial.

Leave a Reply

Your email address will not be published.