# Accelerating OpenCV 4 – build with CUDA 10.0, Intel MKL + TBB and python bindings in Windows

##### OpenCV 4.1.0 which is compatible with CUDA 10.1 was released on 08/04/2019, see Accelerating OpenCV 4 – build with CUDA, Intel MKL + TBB and python bindings, for the updated guide.

Because the pre-built Windows libraries available for OpenCV 4.0.0 do not include the CUDA modules, or support for Intel’s Math Kernel Libraries (MKL) or Intel Threaded Building Blocks (TBB) performance libraries, I have included the build instructions, below for anyone who is interested. If you just need the Windows libraries then go to Download OpenCV 4.0.0 with CUDA 10.0. To get an indication of the performance boost from calling the OpenCV CUDA functions with these libraries see the OpenCV 3.4 GPU CUDA Performance Comparisson (nvidia vs intel).

The guide below details instructions on compiling the 64 bit version of OpenCV 4.0.0 shared libraries with Visual Studio 2017, CUDA 10.0, and optionally; the Intel Math Kernel Libraries (MKL); Intel Threaded Building Blocks (TBB); Python bindings for accessing OpenCV CUDA modules from withing Python.

The main topics covered are given below. Although most of the sections can be read in isolation I recommend reading the pre-build checklist first to check whether you will benefit from and/or need to compile OpenCV with CUDA support.

### Pre-build Checklist

Before continuing there are a few things to be aware of:

1. You can download all the pre-built binaries described in this guide from the downloads page. Unless you need an alternative configuration or just want to build OpenCV from scratch they are probably all you need.
2. The CUDA modules can now be called directly from Python, to include this support see the including Python bindings section.
3. The procedure outlined has been tested on Visual Studio Community 2017 (15.9.4).
4. The OpenCV DNN modules are not CUDA accelerated. I have seen other guides which include instructions to download cuDNN. This is completely unnecessary and will have no effect on performance.
5. If you have built OpenCV with CUDA support then to use those libraries and/or redistribute applications built with them on any machines without the CUDA toolkit installed, you will need to ensure those machines have,
• an Nvidia capable GPU with driver version of 411.31 or later, and
• the CUDA dll’s (cublas64_100.dll, nppc64_100.dll etc.) placed somewhere on the system or user path, or in the same directory as the executable. These can be located in the following directory.
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin
6. The latest version of Intel TBB uses a shared library, therefore if you build with Intel TBB you need to add
C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\redist\intel64_win\tbb\vc_mt

to your path variable, and make sure you redistribute that dll with any of your applications.

7. Depending on the hardware the build time can be over 3 hours. If this is an issue you can speed this up by generating the build files with ninja and/or targeting a specific CUDA compute capability.

### Prerequisites

There are a couple of components you need to download and/or install before you can get started, you first need to:

• Install Visual Studio 2017, selecting the “Desktop development with C++” workload shown in the image below. If you already have an installation ensure that the correct workload is installed and that you have updated to the latest version.
• Download the source files for both OpenCV and OpenCV contrib, available on GitHub. Either clone the git repos OpenCV and OpenCV Contrib making sure to checkout the 4.0.0 tag or download these archives OpenCV 4.0.0 and OpenCV Contrib 4.0.0 containing all the source file.
Note: I have seen lots of guides including instructions to download and use git to get the source files, however this is a completely unnecessary step. If you are a developer and you don’t already have git installed then, I would assume there is a good reason for this and I would not advise installing just to build OpenCV.
• Install CMake – Version 3.13.2 is used in the guide.
• Install The CUDA 10.0 Toolkit
• Optional – Install both the Intel MKL and TBB by registering for community licensing, and downloading for free. MKL version 2019.1.144 and TBB version 2019.2.144 are used in this guide, I cannot guarantee that other versions will work correctly.
• Optional – Install the x64 bit version of Anaconda to call OpenCV CUDA routines from Python, making sure to tick “Register Anaconda as my default Python ..” This guide has been tested against Anaconda 3.7 installed in the default location for a single user.
• Optional – Download the Ninja build system to reduce build times – Version 1.9.0 is used in this guide.

### Generating OpenCV build files with CMake

Before you can build OpenCV you have to generate the build files with CMake. There are two ways to do this, from the command prompt or with the CMake GUI, however by far the quickest and easiest way to proceed is to use the command prompt to generate the base configuration. Then if you want to add any additional configuration options, you can open up the build directory in the CMake GUI as described here.

In addition there are several ways to build OpenCV using Visual Studio. For simplicity only two methods are discussed here:

Finally instructions are included for building and using the Python bindings to access the OpenCV CUDA modules.

#### Building OpenCV 4.0.0 with CUDA 10.0 and Intel MKL + TBB, with Visual Studio solution files from the command prompt (cmd)

The next five steps will build the opencv_world400.dll shared library using NVIDIA’s recommended settings for future hardware compatibility. This does however have two drawbacks, first the build can take several hours to complete and second, the shared library will be at least 929MB depending on the configuration that you choose below. To find out how to reduce both the compilation time and size of opencv_world400.dll read choosing the compute-capability first and then continue as below. If you wish to build the Python bindings and/or use the Ninja build system then see section including python bindings and/or decreasing the build time with Ninja respectively before proceeding.

1. Open up the command prompt (windows key + r, then type cmd and press enter)
2. Ignore this step if you are not building with Intel MKL + TBB. Enter the below
"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\tbb\bin\tbbvars.bat" intel64

to temporarily set the environmental variables for locating your TBB installation.

3. Set the location of the source files and build directory, by entering the text shown below, first setting PATH_TO_OPENCV_SOURCE to the root of the OpenCV files you downloaded or cloned (the directory containing 3rdparty,apps,build,etc.) and PATH_TO_OPENCV_CONTRIB to the modules directory inside the contrib repo (the directory containing cudaarithm, cudabgsegm, etc).
set "openCvSource=PATH_TO_OPENCV_SOURCE"
set "openCVExtraModules=PATH_TO_OPENCV_CONTRIB"
set "openCvBuild=%openCvSource%\build"
set "buildType=Release"
set "generator=Visual Studio 15 2017 Win64"
4. Then choose your configuration from below and copy to the command prompt:
• OpenCV 4.0.0 with CUDA 10.0
"C:\Program Files\CMake\bin\cmake.exe" -B"%openCvBuild%/" -H"%openCvSource%/" -G"%generator%" -DCMAKE_BUILD_TYPE=%buildType% -DBUILD_opencv_world=ON -DBUILD_opencv_gapi=OFF -DWITH_NVCUVID=OFF -DWITH_CUDA=ON -DCUDA_FAST_MATH=ON -DWITH_CUBLAS=ON -DINSTALL_TESTS=ON -DINSTALL_C_EXAMPLES=ON -DBUILD_EXAMPLES=ON -DWITH_OPENGL=ON -DOPENCV_EXTRA_MODULES_PATH="%openCVExtraModules%/" -DOPENCV_ENABLE_NONFREE=ON -DCUDA_ARCH_PTX=7.5
• OpenCV 4.0.0 with CUDA 10.0 and MKL multi-threaded with TBB
"C:\Program Files\CMake\bin\cmake.exe" -B"%openCvBuild%/" -H"%openCvSource%/" -G"%generator%" -DCMAKE_BUILD_TYPE=%buildType% -DBUILD_opencv_world=ON -DBUILD_opencv_gapi=OFF -DWITH_NVCUVID=OFF -DWITH_CUDA=ON -DCUDA_FAST_MATH=ON -DWITH_CUBLAS=ON -DWITH_MKL=ON -DMKL_USE_MULTITHREAD=ON -DMKL_WITH_TBB=ON -DINSTALL_TESTS=ON -DINSTALL_C_EXAMPLES=ON -DBUILD_EXAMPLES=ON -DWITH_OPENGL=ON -DOPENCV_EXTRA_MODULES_PATH="%openCVExtraModules%/" -DOPENCV_ENABLE_NONFREE=ON -DCUDA_ARCH_PTX=7.5
• OpenCV 4.0.0 with CUDA 10.0, MKL multi-threaded with TBB and TBB
"C:\Program Files\CMake\bin\cmake.exe" -B"%openCvBuild%/" -H"%openCvSource%/" -G"%generator%" -DCMAKE_BUILD_TYPE=%buildType% -DBUILD_opencv_world=ON -DBUILD_opencv_gapi=OFF -DWITH_NVCUVID=OFF -DWITH_CUDA=ON -DCUDA_FAST_MATH=ON -DWITH_CUBLAS=ON -DWITH_MKL=ON -DMKL_USE_MULTITHREAD=ON -DMKL_WITH_TBB=ON -DWITH_TBB=ON -DINSTALL_TESTS=ON -DINSTALL_C_EXAMPLES=ON -DBUILD_EXAMPLES=ON -DWITH_OPENGL=ON -DOPENCV_EXTRA_MODULES_PATH="%openCVExtraModules%/" -DOPENCV_ENABLE_NONFREE=ON -DCUDA_ARCH_PTX=7.5
5. If you want to make any configuration changes before building, then you can do so now through the CMake GUI.
6. The OpenCV.sln solution file should now be in your PATH_TO_OPENCV_SOURCE/build directory. To build OpenCV you have two options depending on you preference you can:
• Build directly from the command line by simply entering the following (swaping Release for Debug to build a release version)
"C:\Program Files\CMake\bin\cmake.exe" --build %openCvBuild% --target INSTALL --config Debug
• Build through Visual Studio GUI by opening up the OpenCV.sln in Visual Studio, selecting your Configuration, clicking on Solution Explorer, expanding CMakeTargets, right clicking on INSTALL and clicking Build.

Either approach will both build the library and copy the necessary redistributable parts to the install directory, PATH_TO_OPENCV_SOURCE/build/install in this example. All that is required now to run any programs compiled against these libs is to add the directory containing opencv_world400.dll (and tbb.dll if you have build with Intel TBB) to you path environmental variable.

If everything was successful, congratulations, you now have OpenCV 4.0.0 built with CUDA 10.0. To quickly verify that the CUDA modules are working and check if there is any performance benefit on your specific hardware see below

#### Decreasing the build time with Ninja

The build time for OpenCV can be cut in half by utilizing the ninja build system instead of directly generating Visual Studio solution files. The only difference you may notice is that Ninja will only produce one configuration at a time, either a Debug or Release, therefore the buildType must be set before calling CMake. In the section above the configuration was set to Release, to change it to Debug simply replace Release with Debug as shown below

set "buildType=Debug"

Using ninja only requires a two extra configuration steps:

1. Setting both the path to the ninja executable and configuring Visual Studio Development tools. Both are achieved by entering the following into the command or Anaconda3 prompt before entering the CMake command, making sure to first set PATH_TO_NINJA to the directory containing ninja.exe, and changing Community to either Professional or Enterprise if necessary
"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvars64.bat"
set "ninjaPath=PATH_TO_NINJA"
set path=%ninjaPath%;%path%
2. Changing the generator from “Visual Studio 15 2017 Win64” to ninja
set "generator=Ninja"

For example entering the following into the Anaconda3 prompt will generate ninja build files to build OpenCV with CUDA 10.0 and Python bindings

"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvars64.bat"
set "ninjaPath=PATH_TO_NINJA"
set path=%ninjaPath%;%path%
set "openCvSource=PATH_TO_OPENCV_SOURCE"
set "openCVExtraModules=PATH_TO_OPENCV_CONTRIB"
set "openCvBuild=%openCvSource\build"
set "buildType=Release"
set "generator=Ninja"
"C:\Program Files\CMake\bin\cmake.exe" -B"%openCvBuild%/" -H"%openCvSource%/" -G"%generator%" -DCMAKE_BUILD_TYPE=%buildType% -DBUILD_opencv_world=ON -DBUILD_opencv_gapi=OFF -DWITH_NVCUVID=OFF -DWITH_CUDA=ON -DCUDA_FAST_MATH=ON -DWITH_CUBLAS=ON -DINSTALL_TESTS=ON -DINSTALL_C_EXAMPLES=ON -DBUILD_EXAMPLES=ON -DWITH_OPENGL=ON -DOPENCV_EXTRA_MODULES_PATH="%openCVExtraModules%/" -DOPENCV_ENABLE_NONFREE=ON -DCUDA_ARCH_PTX=7.5 -DBUILD_opencv_python3=ON

The build can then be started in the same way as before dropping the –config option as

"C:\Program Files\CMake\bin\cmake.exe" --build %openCvBuild% --target install

Once you have generated the base Visual Studio solution file from the command prompt the easiest way to make any aditional configuration changes is through the CMake GUI. To do this:

1. Fire up the CMake GUI.
2. Making sure that the Grouped checkbox is ticked, click on the browse build button
and navigate to your PATH_TO_OPENCV_SOURCE/build directory. If you have selected the correct directory the main CMake window should resemble the below.
3. Now any additional configuration changes can be made by just expanding any of the grouped items and ticking or unticking the values displayed. Once you are happy just press Configure,
if the bottom window displays configuration successful press Generate, and you should see
Now you can open up the Visual Studio solution file and proceed as before.
4. Troubleshooting:
• Make sure you have the latest version of Visual Studio 2017 (>= 15.8)
• Not all options are compatible with each other and the configuration step may fail as a result. If so examine the error messages given in the bottom window and look for a solution.
• If the build is failing after making changes to the base configuration, I would advise you to remove the build directory and start again making sure that you can at least build the base Visual Studio solution files produces from the command line

#### Including Python bindings

Building and installing python support is incredibly simple:

1. Open up the Anaconda3 command prompt.
2. Follow the instructions from above to build your desired configuration, issuing all the commands to the Anaconda prompt instead of the default windows command prompt and appending the below to the CMake configuration before generating the build files.
-DBUILD_opencv_python3=ON
3. Make sure you build release, python bindings will not be generated for a debug configuration.
4. Once generated copy the bindings to your copy of python. The following assumes you have python 3.7 installed through Anaconda in the default location for a single user.
copy "%openCvBuild%\install\python\cv2\python-3.7\cv2.cp37-win_amd64.pyd" "%USERPROFILE%\Anaconda3\Lib\site-packages\cv2.cp37-win_amd64.pyd"
5. Include the path to the opencv_world400.dll in your system path.
set path=%openCvBuild%\install\x64\vc15\bin;%path%
6. Test the freshly compiled python module can be located and loads correctly by entering
python -c "import cv2; print(f'OpenCV: {cv2.__version__} for python installed and working')"

and checking the output for

OpenCV: 4.0.0 for python installed and working

If there were no errors from the above steps the Python bindings should be installed correctly. To use on a permanent basis don’t forget to permanently add the path to opencv_world400.dll to your user or system path. To quickly verify that the CUDA modules can be called and check if there is any performance benefit on your system continue below.

### Verifying OpenCV is CUDA accelerated

The easiest way to quickly verify that everything is working is to check that one of the inbuilt CUDA performance tests passes. For this I have chosen the GEMM test which;

• runs without any external data;
• should be highly optimized on both the GPU and CPU making it “informative” to compare the performance timings later on, and;
• has OpenCL versions.

To run the CUDA performance test simply enter the following into the existing command prompt

"%openCvBuild%\install\x64\vc15\bin\opencv_perf_cudaarithm.exe" --gtest_filter=Sz_Type_Flags_GEMM.GEMM/29

the full output is shown below. To verify that everything is working look for the green [ PASSED ] text in the image below.

The above test performed matrix multiplication on a 1024x1024x2 single precision matrix using a midrange GTX 1060 GPU 100 times, with a mean execution time of 4.01 ms, which can be seen in the following output taken from the image above.

[ PERFSTAT ]    (samples=100   mean=4.01   median=4.03   min=3.47   stddev=0.24 (6.0%))

If the test has passed then we can confirm that the above code was successfully run on the GPU using CUDA. Next it would be interesting to compare these results to the same test run on a CPU to check we are getting a performance boost, on the specific hardware set up we have.

#### CPU (i5-6500) Performance

The standard opencv core GEMM performance test does not use 1024×1024 matrices, therefore for this comparison we can simply change the GEMM tests inside opencv_perf_core.exe to process this size instead of 640×640. This is achieved by simply changing the following line to be

::testing::Values(Size(1024, 1024), Size(1280, 1280)),

Denoting the the modified executable as opencv_perf_core_1024.exe, the corresponding CPU test can be run as

set OPENCV_OPENCL_DEVICE=disabled
"%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3

resulting in the following output on a midrange i5-6500.

[ PERFSTAT ]    (samples=10   mean=1990.56   median=1990.67   min=1962.95   stddev=16.56 (0.8%))

The execution time is thee orders of magnitude greater than on the GPU so what is wrong with our CPU? As it turns out nothing is wrong, to get a baseline result, I purposely ran this without building OpenCV against any optimized BLAS. To demonstrate the performance benefit of building OpenCV with Intel’s MKL (which includes optimized BLAS) and TBB I have run the same test again with two different levels of optimization, OpenCV built against:

[ PERFSTAT ]    (samples=10   mean=90.77   median=90.15   min=89.64   stddev=1.98 (2.2%))
2. Intel MKL multi-threaded with TBB
[ PERFSTAT ]    (samples=100   mean=28.86   median=28.37   min=27.34   stddev=1.33 (4.6%))

This demonstrates the importance using multi-threaded MKL and brings the gap between CPU and GPU performance down significantly. Now we are ready to compare with OpenCL.

#### OpenCL Performance

In OpenCV 4.0 the CUDA modules were moved from the main to the contrib repository, presumably because OpenCL will be used for GPU acceleration going forward. To examine the implications of this I ran the same performance tests as above again, only this time on each of my three OpenCL devices. The results for each device are given below including the command to run each test.

• Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
set OPENCV_OPENCL_DEVICE=Intel(R) OpenCL:CPU:Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
"%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
[ PERFSTAT ]    (samples=13   mean=205.35   median=205.63   min=200.35   stddev=2.82 (1.4%))
• Intel(R) HD Graphics 530
set OPENCV_OPENCL_DEVICE=Intel(R) OpenCL:GPU:Intel(R) HD Graphics 530
"%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
[ PERFSTAT ]    (samples=13   mean=130.88   median=129.82   min=127.46   stddev=2.72 (2.1%))
• GeForce GTX 1060 3GB
set OPENCV_OPENCL_DEVICE=NVIDIA CUDA:GPU:GeForce GTX 1060 3GB
"%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
[ PERFSTAT ]    (samples=13   mean=8.83   median=8.85   min=8.53   stddev=0.17 (1.9%))

The performance results for all the tests are shown together below.

The results in the figure show that for this specific test and hardware configuration (GTX 1060 vs i5-6500):

1. If we ignore OpenCL the CUDA implementation on the GTX 1060 is comfortably faster than the MKL + TBB implementation executed on the CPU.
2. The OpenCL implementation on the GTX 1060 is significantly slower than the CUDA version. This is expected but unfortunate considering the OpenCV CUDA routines have been moved from the main repository and may eventually be depreciated.
3. OpenCL still has a long way to go, in addition to its poor performance when compared with CUDA on the same device the implementations on both the CPU (i5-6500) and the iGPU (HD Graphics 530) were an order of magnitude slower than the optimized MKL + TBB implementation on the CPU.

The above comparison is just for fun, to give an example of how to quickly check if using OpenCV with CUDA on your specific hardware combination is worth while. For a more indepth comparisson on several hardware configurations see OpenCV 3.4 GPU CUDA Performance Comparisson (nvidia vs intel)

#### Python CUDA performance

To quickly verify that the CUDA modules are being called from Python you can run the same GEMM test as before, this time from an Interactive Python session. Assuming that all of the steps in Including Python bindings completed successfully, open up the Anaconda3 prompt and issue the following to start the Python session and ensure that the path to OpenCV is set correctly.

set path=%openCvBuild%\install\x64\vc15\bin;%path%
ipython

Then run the GEMM test on the GPU with CUDA from within Python

import numpy as np
import cv2 as cv
npTmp = np.random.random((1024, 1024)).astype(np.float32)
npMat1 = np.stack([npTmp,npTmp],axis=2)
npMat2 = npMat1
cuMat1 = cv.cuda_GpuMat()
cuMat2 = cv.cuda_GpuMat()
cuMat1.upload(npMat1)
cuMat2.upload(npMat2)
%timeit cv.cuda.gemm(cuMat1, cuMat2,1,None,0,None,1)

You should see output similar to

4.47 ms ± 56.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

which is very close the to the result (4.01 ms on the GTX 1060 from C++) when the same test was called directly from C++. If you receive similar output then this confirms that you are running OpenCV from python on the GPU with CUDA.

For completeness you can run the same test on the CPU as

%timeit cv.gemm(npMat1,npMat2,1,None,0,None,1)

and confirm that the new result

27.9 ms ± 664 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

is comparable with the previous one (28.86 ms on the i5-6500 from C++).

You can also perform a quick sanity check to confirm that you are seeing good performance for the GEMM operation in OpenCV. An easy way to do this is to run the same operation again only this time in NumPy.

npMat3 = npTmp + npTmp*1j
npMat4 = npMat3
%timeit npMat3 @ npMat4

As you can see the data is structured in a slightly different way, however the timings

32 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

should hopefully be comparable to the OpenCV result (27.9 ms on the i5-6500 calling OpenCV from Python).

From the= results of these quick tests it can be implied that:

1. The OpenCV CUDA modules are being called from python.
2. The overhead from using the CPU and/or CUDA python interface instead of directly calling from C++ is small.
3. The GEMM operation in OpenCV is highly optimized if built with against Intel MKL.

### Choosing the compute-capability

The default command line options given above implement NVIDIA’s CUDA 10.0 recommended settings for future hardware compatibility. This means that any programs linked against the resulting opencv_world400.dll should work on all GPU’s currently supported by CUDA 10.0 and all GPU’s released in the future. As mentioned above this comes at a cost, both in terms of compilation time and shared library size. Before discussing the CMake settings which can be used to reduce these costs we need to understand the following concepts:

• Compute-capability – every GPU has a fixed compute-capability which determines its general specifications and features. In general the more recent the GPU the higher the compute-capability and the more features it will support. This is important because:
• Each version of CUDA supports different compute-capabilities. Usually a new version of CUDA comes out to suppoort a new GPU architecture, in the case of CUDA 10.0, support was added for the Turing (compute 7.5) architecture. On the flip side support for older architechtures can be removed for example CUDA 9.0 removed support for the Fermi (compute 2.0) architecture. Therefore by choosing to build OpenCv with CUDA 10.0 we have limited ourselves to GPU’s of compute-capability >=3.0. Notice we have not limited ourselves to compute-capability GPU’s <=7.5, the reason for this is discussed in the next section.
• You can build opencv_world400.dll to support one or many different compute-capabilities, depending on your specific requirements.
• Supporting a compute-capability – to support a specific compute-capability you can do either of the following, or a combination of the two:
• Generate architecture-specific cubin files, which are only forward-compatible with GPU architectures with the same major version number. This can be controlled by passing CUDA_ARCH_BIN to CMake. For example passing -DCUDA_ARCH_BIN=3.0 to CMake, will result in opencv_world400.dll containing binary code which can only run on compute-capability 3.0, 3.5 and 3.7 devices. Futhermore it will not support any specific features of compute-capability 3.5 (e.g. dynamic parallelism) or 3.7 (e.g. 128 K 32 bit registers). In the case of OpenCV 4.0.0 this would not restrict any functionality because it only uses features from compute-capability 3.0 and below. This can be confirmed by a quick search of the contrib repository for the __CUDA_ARCH__ flag.
• Generate forward-compatible PTX assembly for a virtual architecture, which is forward-compatable with all GPU architectures of greater than or equal compute-capability. This can be controlled by passing CUDA_ARCH_PTX to CMake. For example by passing -DCUDA_ARCH_PTX=7.5 to CMake, opencv_world400.dll will contain PTX code for compute-capability 7.5 which can be Just In Time (JIT) compiled to architecture-specific binary code by the CUDA driver, on any future GPU architectures. Because of the default CMake rules when CUDA_ARCH_BIN is not explicitly set it will also contain architecture-specific cubin files for GPU architectures 3.0-7.5.
• PTX considerations – given that PTX code is forward-compatible and cubin binaries are not it would be tempting to only include the former. To understand why this might not be such a great idea, a things to be aware of when generating PTX code:
1. As mentioned previously the CUDA driver JIT compiles PTX code at run time and cache’s the resulting cubin files so that the compile operation should in theory be a one-time delay, at least until the driver is updated. However if the cache is not large enough JIT compilation will happen every time, causing delay every time your program executes.To get an idea of this delay I passed -DCUDA_ARCH_BIN=3.0 and -DCUDA_ARCH_PTX=3.0 to CMake before building OpenCV. I then emptied the cache (default location %appdata%\NVIDIA\ComputeCache\) and ran the GEMM performance example on a GTX 1060 (compute-capability 6.1), to force JIT compilation. I measured an initial delay of over 3 minutes as the PTX code was JIT compiled before the program started to execute. Following that, the delay of subsequent executions was around a minute, because the default cache size (256 MB) was not large enough to store all the compiled PTX code. Given my compile options the only solution to remove this delay is to increase the size of the cache by setting the CUDA_CACHE_MAXSIZE environmental variable to a number of bytes greater than required. Unfortunately because, “Older binary codes are evicted from the cache to make room for newer binary codes if needed”, this is more of a band aid than a solution. This is because the maximum cache size is 4 GB, therefore your PTX compiled code can be evicted at any point in time if other programs on your machine are also JIT compiling from PTX, bringing back the “one-time” only delay.
2. For maximum device coverage you should include PTX for the lowest possible GPU architecture you want to support.
3. For maximum performance NVIDIA recommends including PTX for the highest possible architecture you can.

#### CMake command line options to control cubin/PTX content of the OpenCV shared library

Given (1)-(3) above, the command line options that you want to pass to CMake when building OpenCV will depend on your specific requirements. I have given some examples below for various scenarios given a main GPU of compute-capability 6.1:

• Firstly stick with the defaults if compile time and shared library size are not an issue. This offers the greatest amount of flexibility from a development standpoint, avoiding the possibility of needing to recompile OpenCV when you switch GPU.
• If your programs will always be run on your main GPU, just pass -DCUDA_ARCH_BIN=6.1 to CMake to target your architecture only. It should take around an hour to build, depending on your CPU and the resulting shared library should not be larger than 200 MB.
• If you are going to deploy your application, but only to newer GPU’s pass -DCUDA_ARCH_BIN=6.1,7.0,7.5 and -DCUDA_ARCH_PTX=7.5 to CMake for maximum performance and future compatibility.This is advisable because you may not have any control over the size of the JIT cache on the target machine, therefore including cubin’s for all compute-capabilities you want to support, is the only way be sure to prevent JIT compilation delay on every invocation of your application.
• If size is really an issue but you don’t know which GPU’s you want to run your application on then to ensure that your program will run on all current and future supported GPU’s pass -DCUDA_ARCH_BIN=6.1 and -DCUDA_ARCH_PTX=3.0 to CMake for maximum coverage.

#### 32 thoughts on “Accelerating OpenCV 4 – build with CUDA 10.0, Intel MKL + TBB and python bindings in Windows”

1. sidd23 says:

Thanks a lot for the build steps! They are really helpful.
Just a suggestion (correction): In the Visual Studio solution files generation from the command line section, your article mentions the following command, set “openCvBuild=%openCvSource\build”, however, that didn’t work for me and instead the following command did, set “openCvBuild=%openCvSource%\build”.

1. ParallelVision says:

Thank you for reporting the typo, I’m glad that the guide was useful to you.

2. bubu says:

Thanks for your kind sharing. In this paragraph “If you are going to deploy your application, but only to newer GPU’s pass -DCUDA_ARCH_PTX=6.1,7.0,7.5 and -DCUDA_ARCH_PTX=7.5 to CMake for maximum performance and future compatibility.” —— did you mean “.. DCUDA_ARCH_BIN=6.1,7.0,7.5 and ..” ?

1. ParallelVision says:

Thanks for the feedback, you are of course correct, i’ve updated the post.

3. Lance says:

Do the pre-built binaries come with the CUDA bindings for python?

4. Wassouf says:

Hello I am using the prebuilt libraries for opencv4.0.0 with cuda 10.1
I got errors while compiling my project related to cvstd_wrapper.hpp
any solution?

1. ParallelVision says:

Which pre-built binaries?
Can you share the code you are trying to compile and the error messages?

5. Wassouf says:

Here is the code, it is template matching with cuda ,:
#include “cuda_runtime.h”
#include “device_launch_parameters.h”
#include
#include
#include “opencv2/opencv.hpp”
#include “opencv2/core/core.hpp”
#include “opencv2/imgproc/imgproc.hpp”
#include “opencv2/highgui/highgui.hpp”
#include
#include
#include

using namespace std;
using namespace cv;

/// Global Variables
Mat img; Mat templ; Mat result;
char* image_window = “Source Image”;
char* result_window = “Result window”;

int match_method;
int max_Trackbar = 5;

void MatchingMethod(int, void*);

cuda::GpuMat templateMatching(cuda::GpuMat d_src, cuda::GpuMat d_templ) {

cuda::GpuMat d_dst;

auto t_start = std::chrono::high_resolution_clock::now();
cv::Ptr alg;
alg = cv::cuda::createTemplateMatching(d_templ.type(), cv::TM_SQDIFF);
alg->match(d_src, d_templ, d_dst);
cv::cuda::normalize(d_dst, d_dst, 0, 1, cv::NORM_MINMAX, -1);
/// Localizing the best match with minMaxLoc
double minVal; double maxVal; Point minLoc; Point maxLoc;
Point matchLoc;

cuda::minMaxLoc(d_dst, &minVal, &maxVal, &minLoc, &maxLoc);
/// For SQDIFF and SQDIFF_NORMED, the best matches are lower values. For all the other methods, the higher the better
if (match_method == cv::TM_SQDIFF || match_method == cv::TM_SQDIFF_NORMED)
{
matchLoc = minLoc;
}
else
{
matchLoc = maxLoc;
}
/// Show me what you got
Rect rect(matchLoc, Point(matchLoc.x + templ.cols, matchLoc.y + templ.rows));
cuda::GpuMat final = d_src(rect);
auto t_end = std::chrono::high_resolution_clock::now();
double elaspedTimeMs = std::chrono::duration(t_end – t_start).count();
cout << "Time difference is " << elaspedTimeMs << " milliSeconds" << endl;

Mat result;
imwrite("C:/Users/Én/Documents/Visual Studio 2015/Projects/blol/blol/samples/data/final.bmp", result);
rectangle(img, matchLoc, Point(matchLoc.x + templ.cols, matchLoc.y + templ.rows), Scalar::all(0), 2, 8, 0);
/// Create windows
namedWindow(image_window, cv::WINDOW_AUTOSIZE);
namedWindow(result_window, cv::WINDOW_AUTOSIZE);
resize(img, img, cv::Size(600, 600));
resize(result, result, cv::Size(600, 600));
imshow(image_window, img);
imshow(result_window, result);

return final;
}
/** @function main */
int main(int argc, char** argv)
{
img = imread("C:/Users/Én/Documents/Visual Studio 2015/Projects/blol/blol/samples/data/envelope.bmp", 1);
templ = imread("C:/Users/Én/Documents/Visual Studio 2015/Projects/blol/blol/samples/data/w2.bmp", 1);

cuda::GpuMat d_src, d_templ, d_dst;

cuda::GpuMat final1, final2;

final1 = templateMatching(d_src, d_templ);
final2 = templateMatching(d_src, d_templ);
waitKey(0);
return 0;
}

/**
* @function MatchingMethod
* @brief Trackbar callback
*/
void MatchingMethod(int, void*)
{
/// Source image to display
Mat img_display;
img.copyTo(img_display);

/// Create the result matrix
int result_cols = img.cols – templ.cols + 1;
int result_rows = img.rows – templ.rows + 1;

result.create(result_rows, result_cols, CV_32FC1);

/// Do the Matching and Normalize
matchTemplate(img, templ, result, match_method);
normalize(result, result, 0, 1, NORM_MINMAX, -1, Mat());

/// Localizing the best match with minMaxLoc
double minVal; double maxVal; Point minLoc; Point maxLoc;
Point matchLoc;

minMaxLoc(result, &minVal, &maxVal, &minLoc, &maxLoc, Mat());

/// For SQDIFF and SQDIFF_NORMED, the best matches are lower values. For all the other methods, the higher the better
if (match_method == cv::TM_SQDIFF || match_method == cv::TM_SQDIFF_NORMED)
{
matchLoc = minLoc;
}
else
{
matchLoc = maxLoc;
}

/// Show me what you got
rectangle(img_display, matchLoc, Point(matchLoc.x + templ.cols, matchLoc.y + templ.rows), Scalar::all(0), 2, 8, 0);
rectangle(result, matchLoc, Point(matchLoc.x + templ.cols, matchLoc.y + templ.rows), Scalar::all(0), 2, 8, 0);
resize(img_display, img_display, cv::Size(600, 600));
resize(result, result, cv::Size(600, 600));
imshow(image_window, img_display);
imshow(result_window, result);

return;
}

errors:
Severity Code Description Project File Line Suppression State
Severity Code Description Project File Line Suppression State
Error template instantiation resulted in unexpected function type of "std::true_type (std::integral_constant *)” (the meaning of a name may have changed since the template declaration — the type of the template is “std::true_type (std::is_same<std::decay<decltype(())>::type, void>::type *)”) Cuda1 c:\users\Ún\downloads\opencv_4_0_0_cuda_10_0\install\include\opencv2\core\cvstd_wrapper.hpp 49
Severity Code Description Project File Line Suppression State
Error name followed by “::” must be a class or namespace name Cuda1 c:\users\Ún\downloads\opencv_4_0_0_cuda_10_0\install\include\opencv2\core\cvstd_wrapper.hpp 52

1. ParallelVision says:

Because the version of OpenCV you are using was compiled against CUDA 10 and you are including CUDA headers in your program you need to install CUDA 10 not CUDA 10.1 on your system.

6. Erick says:

Hi,

I’m having this error:

CMake Error at D:/opencv/sources/cmake/OpenCVUtils.cmake:704 (pkg_check_modules):
Unknown CMake command “pkg_check_modules”.
Call Stack (most recent call first):
D:/opencv_contrib/modules/freetype/CMakeLists.txt:6 (ocv_check_modules)

1. ParallelVision says:

Are you building from the 4.1.0 tag with the exact same build options as in the guide?

1. Erick says:

Hi,

No. I’m installing OpenCV 4.0.1 with CUDA 10.

This is part of the installation:

— AVX_512F is not supported by C++ compiler
— AVX512_SKX is not supported by C++ compiler
— Dispatch optimization AVX512_SKX is not available, skipped
— libjpeg-turbo: VERSION = 1.5.3, BUILD = opencv-4.0.1-libjpeg-turbo
— Looking for Mfapi.h
— Looking for Mfapi.h – found
— found Intel IPP (ICV version): 2019.0.0 [2019.0.0 Gold]
— at: D:/Vision/opencv-master/opencv/build/3rdparty/ippicv/ippicv_win/icv
— found Intel IPP Integration Wrappers sources: 2019.0.0
— at: D:/Vision/opencv-master/opencv/build/3rdparty/ippicv/ippicv_win/iw
— CUDA detected: 10.0
— CUDA NVCC target flags: -gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-D_FORCE_INLINES;-gencode;arch=compute_7,code=compute_7
— Could not find OpenBLAS include. Turning OpenBLAS_FOUND off
— Could not find OpenBLAS lib. Turning OpenBLAS_FOUND off
— Could NOT find BLAS (missing: BLAS_LIBRARIES)
— LAPACK requires BLAS
— VTK is not found. Please set -DVTK_DIR in CMake to VTK build directory, or to VTK install subdirectory with VTKConfig.cmake file
— OpenCV Python: during development append to PYTHONPATH: D:/Vision/opencv-master/opencv/build/python_loader
— Caffe: NO
— Protobuf: NO
— Glog: NO
CMake Error at D:/Vision/opencv-master/opencv/cmake/OpenCVUtils.cmake:704 (pkg_check_modules):
Unknown CMake command “pkg_check_modules”.
Call Stack (most recent call first):
D:/Vision/opencv-master/opencv_contrib/modules/freetype/CMakeLists.txt:6 (ocv_check_modules)

— Configuring incomplete, errors occurred!

1. Erick says:

What should I do with the the prebuilt binaries in your downloads :OpenCV 4.0 x64, VS2017 with CUDA 10.0 and python bindings? Do I need to replace the install folder with this folder?

1. Erick says:

Nevermind, I figured it out. Thanks!

7. josefin heilig says:

Is there any chance that this will work with without a Intel cpu? I have a AMD Ryzen 5 2600 and am having a bit of a hard time getting it compiled and running

1. ParallelVision says:

Hi Josefin, are you referring to Intel MKL or CUDA? Intel MKL should run on AMD CPU’s however CUDA will only run on NVidia GPU’s.

1. josefin heilig says:

Hi,
AMD is my CPU, I do have an NVidia GPU. I was just wondering if your cmake instructions or even the precompiled opencv were only working with an Intel CPU.
But actually by now I managed to at least make the OpenCV 4.1.0 with CUDA for Windows work. So it seems Intel is not needed.

Am I understanding your answer correctly that the Intel MKL also improves AMD’s CPU? Or is it just installed and does not harm but also does not give any benefits?

1. ParallelVision says:

Intel MKL will improve the speed of a very limited number of functions in OpenCV on both Intel and AMD CPU’s, specifically functions like GEMM used in the guide.

8. Thank you for the posting, which is very helpful. it is straight forward to integrate your pre-build binaries into my own VS2017 VC++ project. I just changed the OpenCV include directory, OpenCV library file, and OpenCV library directory from the old “opencv340” folder to “OpenCV_4_1_0_cuda_10_1_python” folder. The code can compile and run immediately.
However, I am not sure if the new library and DLL files are actual CUDA accelerated on my computer, since the performance is the same as using the old opencv340.
My system: windows10 laptop; i7-7700HQ@2.80GHz; Nvidia GeForce 1060
Any way to check if the GPU is actually kicked-in for the computation?

1. ParallelVision says:

The CUDA functions have a different API (not all standard OpenCV functions are supported). You will need to re-write your code to take advantage of the speed increase.

You may want to try using UMat instead of Mat and the transparent API first which uses OpenCl. All the OpenCV functions are supported but only certain ones will use the OpenCl device. Before you execute the code you can choose the device (CPU/iGPU/Nvidia GPU etc.) you want to run the OpenCl implemented functions on by setting the environmental variable OPENCV_OPENCL_DEVICE. Note: it won’t be as fast as using the CUDA api but you may find it quicker to implement.

1. Peter Yin says:

Hi James,
Thank you for the comment. It helps me to clarify the scope of the your original post. Following your suggestion, I have tested UMat vs. Mat in my VS2017 project using pre-build binaries (opencv3.4.0) from OpenCV.org. It seems to me that the pre-build binaries are fully parallelized. The window task manager indicates 100% CPU usage when using Mat, and UMat is about twice faster than Mat.
The test is carried out using a large loop( 2000 times) on:
cv::warpAffine(src, dst, RMat, dsize, cv::INTER_LINEAR, cv::BORDER_CONSTANT, cv::Scalar());
Where src, dst, and RMat are declared as UMat and Mat respectively. I am still working on replace Mat using UMat in my project. But it is not as easy as it seems. So far, the workload is mostly related with templated Mat (Mat3b) and pixel access (at), which is not compatible with UMat.

9. John says:

Hi, I have followed the steps religiously. After the below step, its building but is not creating any libraries.
“C:\Program Files\CMake\bin\cmake.exe” –build %openCvBuild% –target INSTALL –config Debug. When I try to build from visual studio, I get the below error LINK : fatal error LNK1104: cannot open file ‘..\..\lib\Debug\opencv_world410d.lib.
I am sure I have followed all the steps correctly, I am using Cuda 10.1 and OpenCV4.1.
Will this result in any issue ?

1. Mohammed Faizanulla says:

Hi John, I am facing the similar issue. Where you able to solve it ?

1. ParallelVision says:

Hi, can you add the commands you used to call CMake and the build output?

1. Chuan says:

This is awesome. Thanks a lot!

2. Chuan says:

Hey, James

I wonder which OpenCV commit did you use to build this: https://mega.nz/#!CI4WlKZC!1sxXMQx3_3_jhV6E48c7HJbq5y6CNJl0gVLv3cCpg5Y ?

In this build the Python cudacodec API works for both local mp4 and remote rtsp stream. However, my build only work for local mp4. It throws an error when I call “reader = cv.cudacodec.createVideoReader(rtsp_url)”:

Traceback (most recent call last):
File “test_camera.py”, line 51, in
cv2.error: OpenCV(4.1.1-dev) C:\Users\Windows\lambda-dev\opencv_contrib\modules\cudacodec\src\ffmpeg_video_source.cpp:102: error: (-215:Assertion failed) init_MediaStream_FFMPEG() in function ‘cv::cudacodec::detail::FFmpegVideoSource::FFmpegVideoSource’

These are what I used to build:
https://github.com/cudawarped/opencv/tree/19b8ed52a48b3c342e37157f0efe7b2175d497e0opencv_contrib-78518a137372de374501812f3088100d86358960
https://github.com/cudawarped/opencv_contrib/commit/78518a137372de374501812f3088100d86358960
Nvidia Video Codec SDK 9.0.20
CUDA 10.0

BTW, didn’t realize you are cudawarped on Github. Really appreciate the work you did for the community!

Best

1. ParallelVision says:

Hi, if RTSP works on the modified build but not on the your build you will need to apply the change described here. Unfortunately to do this on windows you need to build OpenCV against ffmpeg, instead of using the provided shared library (opencv_ffmpeg410_64.dll).

This involves:
1) Downloading the ffmpeg Dev archive, altering the cmake script to find the headers and lib’s contained in it. A quick way to do this would be to add the following to detect_ffmpeg.cmake


if(FFMPEG_ROOT_DIR AND WIN32 AND NOT ARM)
find_path(AVCODEC_INCLUDE_DIR libavcodec/avcodec.h PATHS ${FFMPEG_ROOT_DIR}/include/) find_library(AVCODEC_LIBRARY lib/avcodec.lib PATHS${FFMPEG_ROOT_DIR})
set(FFMPEG_INCLUDE_DIRS ${AVCODEC_INCLUDE_DIR}) set(FFMPEG_LIBRARIES${AVCODEC_LIBRARY})

find_path(AVFORMAT_INCLUDE_DIR libavformat/avformat.h PATHS ${FFMPEG_ROOT_DIR}/include/) find_library(AVFORMAT_LIBRARY lib/avformat.lib PATHS${FFMPEG_ROOT_DIR})
list(APPEND FFMPEG_INCLUDE_DIRS ${AVFORMAT_INCLUDE_DIR}) list(APPEND FFMPEG_LIBRARIES${AVFORMAT_LIBRARY})

find_path(AVUTIL_INCLUDE_DIR libavutil/avutil.h PATHS ${FFMPEG_ROOT_DIR}/include/) find_library(AVUTIL_LIBRARY lib/avutil.lib PATHS${FFMPEG_ROOT_DIR})
list(APPEND FFMPEG_INCLUDE_DIRS ${AVUTIL_INCLUDE_DIR}) list(APPEND FFMPEG_LIBRARIES${AVUTIL_LIBRARY})

find_path(AVDEVICE_INCLUDE_DIR libavdevice/avdevice.h PATHS ${FFMPEG_ROOT_DIR}/include/) find_library(AVDEVICE_LIBRARY lib/avdevice.lib PATHS${FFMPEG_ROOT_DIR})
list(APPEND FFMPEG_INCLUDE_DIRS ${AVDEVICE_INCLUDE_DIR}) list(APPEND FFMPEG_LIBRARIES${AVDEVICE_LIBRARY})

find_path(SWSCALE_INCLUDE_DIR libswscale/swscale.h PATHS ${FFMPEG_ROOT_DIR}/include/) find_library(SWSCALE_LIBRARY lib/swscale.lib PATHS${FFMPEG_ROOT_DIR})
list(APPEND FFMPEG_INCLUDE_DIRS ${SWSCALE_INCLUDE_DIR}) list(APPEND FFMPEG_LIBRARIES${SWSCALE_LIBRARY})

set(HAVE_FFMPEG TRUE)
endif()


3) Telling cmake where to find the ffmpeg libs

-DFFMPEG_ROOT_DIR="PATH_TO_FFMPEG_DEV"

As this is untested I would not recommend doing this or using the binaries I provided if you need stability.

Best

10. SJ says:

Hi James,
I am trying to use the functions in your pre-built binary as follows:
compute = cv2.cuda_SparsePyrLKOpticalFlow.create()
p1, st, err = compute.calc(temI, serI, p0, None)

But get the following error
p1, st, err = compute.calc(temI, serI, p0, None)
cv2.error: OpenCV(4.1.0) D:\James\repos\opencv\modules\core\src\matrix_wrap.cpp:359: error: (-213:The function/feature is not implemented) getGpuMat is available only for cuda::GpuMat and cuda::HostMem in function ‘cv::_InputArray::getGpuMat’

I want to use the cuda functionality for the LK optical flow algorithm in OpenCV. Any guidance would be appreciated.

1. ParallelVision says:

Hi, I suspect your problem may be that you are passing standard cv::Mat’s to compute.calc instead of GpuMat’s. I tried this myself by running the OpenCV example using CUDA calls and everything seemed to work. If you want to try this yourself the notebook I was using can be found here. If you are still having issues then please let me know.

11. thanks man for great tutorial , took 4 hours to build it 🙂