Build OpenCV 4.0.0 with CUDA 10.0 and Intel MKL +TBB in Windows

Because the pre-built Windows libraries available for OpenCV 4.0.0 do not include the CUDA modules, or support for Intel’s Math Kernel Libraries (MKL) or Intel Threaded Building Blocks (TBB) performance libraries, I have included the build instructions, below for anyone who is interested. If you just need the Windows libraries then go to Download OpenCV 4.0.0 with CUDA 10.0. To get an indication of the performance boost from calling the OpenCV CUDA functions with these libraries see the OpenCV 3.4 GPU CUDA Performance Comparisson (nvidia vs intel).

The guide below details instructions on compiling the 64 bit version of OpenCV 4.0.0 shared libraries with Visual Studio 2017, CUDA 10.0, support for both the Intel Math Kernel Libraries (MKL) and Intel Threaded Building Blocks (TBB).

Before continuing there are a few things to be aware of:

  1. CUDA 10.0 is now supported by the latest versions of Visual Studio 2017, 15.8 onward, to follow the guide it is best to install or upgrade to the latest version.
  2. The procedure outlined has only been tested on Visual Studio Community 2017 (15.9.4).
  3. The OpenCV DNN modules are not CUDA accelerated.  I have seen other guides which include instructions to download cuDNN.  This is completely unnecessary and will have no effect on performance.
  4. I have not included instructions for compiling the python bindings because you cannot call the CUDA modules from within python. If you require python support it is better to install directly through pip or conda.
  5. If you have built OpenCV with CUDA support then to use those libraries and/or redistribute applications built with them on any machines without the CUDA toolkit installed, you will need to ensure those machines have,
    • an Nvidia capable GPU with driver version of 411.31 or later, and
    • the CUDA dll’s (cublas64_100.dll, nppc64_100.dll etc.) placed somewhere on the system or user path, or in the same directory as the executable. These can be located in the following directory.
      C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin
  6. The latest version of Intel TBB uses a shared library, therefore if you build with Intel TBB you need to add
    C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\redist\intel64_win\tbb\vc_mt

    to your path variable, and make sure you redistribute that dll with any of your applications.


There are a couple of components you need to download and/or install before you can get started, you first need to:

  • Install Visual Studio 2017, selecting the “Desktop development with C++” workload shown in the image below. If you already have an installation ensure that the correct workload is installed and that you have updated to the latest version.
  • Download the source files for both OpenCV and OpenCV contrib, available on GitHub. Either clone the git repos OpenCV and OpenCV Contrib making sure to checkout the 4.0.0 tag or download these archives OpenCV 4.0.0 and OpenCV Contrib 4.0.0 containing all the source file.
    Note: I have seen lots of guides including instructions to download and use git to get the source files, however this is a completely unnecessary step. If you are a developer and you don’t already have git installed then, I would assume there is a good reason for this and I would not advise installing just to build OpenCV.
  • Install CMake – Version 3.13.2 is used in the guide.
  • Install The CUDA 10.0 Toolkit
  • Optional – Install both the Intel MKL and TBB by registering for community licensing, and downloading for free. MKL version 2019.1.144 and TBB version 2019.2.144 are used in this guide, I cannot guarantee that other versions will work correctly.


Generating OpenCV Visual Studio solution files with CMake

In the next section we are going to generate the Visual Studio solution files with CMake. There are two ways to do this, from the command prompt or with the CMake GUI, however by far the quickest and easiest way to proceed is to use the command prompt to generate the base configuration. Then if you want to add any additional configuration options, you can open up the build directory in the CMake GUI as described here.

Generating Visual Studio solution files for OpenCV 4.0.0 with CUDA 10.0 and Intel MKL + TBB, from the command prompt (cmd)

The next five steps will build the opencv_world400.dll shared library using NVIDIA’s recommended settings for future hardware compatibility. This does however have two drawbacks, first the build can take several hours to complete and second, the shared library will be at least 929MB depending on the configuration that you choose below. To find out how to reduce both the compilation time and size of opencv_world400.dll read choosing the compute-capability.

  1. Open up the command prompt (windows key + r, then type cmd and press enter)
  2. Ignore this step if you are not building with Intel MKL + TBB. Enter the below
    "C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\tbb\bin\tbbvars.bat" intel64

    to temporarily set the environmental variables for locating your TBB installation.

  3. Set the location of the source files, your build directory and your Visual Studio edition, by entering the text shown below, first setting PATH_TO_OPENCV_SOURCE to the root of the OpenCV files you downloaded or cloned (the directory containing 3rdparty,apps,build,etc.) and PATH_TO_OPENCV_CONTRIB to the modules directory inside the contrib repo (the directory containing cudaarithm, cudabgsegm, etc).
    set "openCvSource=PATH_TO_OPENCV_SOURCE"
    set "openCVExtraModules=PATH_TO_OPENCV_CONTRIB"
    set "openCvBuild=%openCvSource\build"
  4. Then choose your configuration from below and copy to the command prompt:
    • OpenCV 4.0.0 with CUDA 10.0
      "C:\Program Files\CMake\bin\cmake.exe" -B"%openCvBuild%/" -H"%openCvSource%/" -G"Visual Studio 15 2017 Win64" -DBUILD_opencv_world=ON -DBUILD_opencv_gapi=OFF -DWITH_NVCUVID=OFF -DWITH_CUDA=ON -DCUDA_FAST_MATH=ON -DWITH_CUBLAS=ON -DINSTALL_TESTS=ON -DINSTALL_C_EXAMPLES=ON -DBUILD_EXAMPLES=ON -DOPENCV_EXTRA_MODULES_PATH="%openCVExtraModules%/" -DOPENCV_ENABLE_NONFREE=ON -DCUDA_ARCH_PTX=7.5
    • OpenCV 4.0.0 with CUDA 10.0 and MKL multi-threaded with TBB
    • OpenCV 4.0.0 with CUDA 10.0, MKL multi-threaded with TBB and TBB
  5. The OpenCV.sln solution file should now be in your PATH_TO_OPENCV_SOURCE/build directory. To build OpenCV you have two options depending on you preference you can:
    • Build directly from the command line by simply entering the following (swaping Release for Debug to build a release version)
      "C:\Program Files\CMake\bin\cmake.exe" --build %openCvBuild% --target INSTALL --config Debug
    • Build through Visual Studio GUI by opening up the OpenCV.sln in Visual Studio, selecting your Configuration, clicking on Solution Explorer, expanding CMakeTargets, right clicking on INSTALL and clicking Build.

    Either approach will both build the library and copy the necessary redistributable parts to the install directory, PATH_TO_OPENCV_SOURCE/build/install in this example. All that is required now to run any programs compiled against these libs is to add the directory containing opencv_world400.dll (and tbb.dll if you have build with Intel TBB) to you path environmental variable.

If everything was successful, congratulations, you now have OpenCV 4.0.0 built with CUDA 10.0. To quickly verify that the CUDA modules are working and check if there is any performance benefit on your specific hardware see below

Adding additional configuration options with the CMake GUI

Once you have generated the base Visual Studio solution file from the command prompt the easiest way to make any aditional configuration changes is through the CMake GUI. To do this:

  1. Fire up the CMake GUI.
  2. Making sure that the Grouped checkbox is ticked, click on the browse build buttonand navigate to your PATH_TO_OPENCV_SOURCE/build directory. If you have selected the correct directory the main CMake window should resemble the below.
  3. Now any additional configuration changes can be made by just expanding any of the grouped items and ticking or unticking the values displayed. Once you are happy just press Configure,if the bottom window displays configuration successful press Generate, and you should seeNow you can open up the Visual Studio solution file and proceed as before.
  4. Troubleshooting:
    • Make sure you have the latest version of Visual Studio 2017 (>= 15.8)
    • Not all options are compatible with each other and the configuration step may fail as a result. If so examine the error messages given in the bottom window and look for a solution.
    • If the build is failing after making changes to the base configuration, I would advise you to remove the build directory and start again making sure that you can at least build the base Visual Studio solution files produces from the command line

Verifying OpenCV is CUDA accelerated

The easiest way to quickly verify that everything is working is to check that one of the inbuilt CUDA performance tests passes. For this I have chosen the GEMM test which;

  • runs without any external data;
  • should be highly optimized on both the GPU and CPU making it “informative” to compare the performance timings later on, and;
  • has OpenCL versions.

To run the CUDA performance test simply enter the following into the existing command prompt

"%openCvBuild%\install\x64\vc15\bin\opencv_perf_cudaarithm.exe" --gtest_filter=Sz_Type_Flags_GEMM.GEMM/29

the full output is shown below. To verify that everything is working look for the green [ PASSED ] text in the image below.

The above test performed matrix multiplication on a 1024x1024x2 single precision matrix using a midrange GTX 1060 GPU 100 times, with a mean execution time of 4.01 ms, which can be seen in the following output taken from the image above.

[ PERFSTAT ]    (samples=100   mean=4.01   median=4.03   min=3.47   stddev=0.24 (6.0%))

If the test has passed then we can confirm that the above code was successfully run on the GPU using CUDA. Next it would be interesting to compare these results to the same test run on a CPU to check we are getting a performance boost, on the specific hardware set up we have.

CPU (i5-6500) Performance
The standard opencv core GEMM performance test does not use 1024×1024 matrices, therefore for this comparison we can simply change the GEMM tests inside opencv_perf_core.exe to process this size instead of 640×640. This is achieved by simply changing the following line to be

::testing::Values(Size(1024, 1024), Size(1280, 1280)),

Denoting the the modified executable as opencv_perf_core_1024.exe, the corresponding CPU test can be run as

"%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3

resulting in the following output on a midrange i5-6500.

[ PERFSTAT ]    (samples=10   mean=1990.56   median=1990.67   min=1962.95   stddev=16.56 (0.8%))

The execution time is thee orders of magnitude greater than on the GPU so what is wrong with our CPU? As it turns out nothing is wrong, to get a baseline result, I purposely ran this without building OpenCV against any optimized BLAS. To demonstrate the performance benefit of building OpenCV with Intel’s MKL (which includes optimized BLAS) and TBB I have run the same test again with two different levels of optimization, OpenCV built against:

  1. Intel MKL without multi-threading
    [ PERFSTAT ]    (samples=10   mean=90.77   median=90.15   min=89.64   stddev=1.98 (2.2%))
  2. Intel MKL multi-threaded with TBB
    [ PERFSTAT ]    (samples=100   mean=28.86   median=28.37   min=27.34   stddev=1.33 (4.6%))

This demonstrates the importance using multi-threaded MKL and brings the gap between CPU and GPU performance down significantly. Now we are ready to compare with OpenCL.

OpenCL Performance
In OpenCV 4.0 the CUDA modules were moved from the main to the contrib repository, presumably because OpenCL will be used for GPU acceleration going forward. To examine the implications of this I ran the same performance tests as above again, only this time on each of my three OpenCL devices. The results for each device are given below including the command to run each test.

  • Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
    set OPENCV_OPENCL_DEVICE=Intel(R) OpenCL:CPU:Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
    "%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=13   mean=205.35   median=205.63   min=200.35   stddev=2.82 (1.4%))
  • Intel(R) HD Graphics 530
    set OPENCV_OPENCL_DEVICE=Intel(R) OpenCL:GPU:Intel(R) HD Graphics 530
    "%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=13   mean=130.88   median=129.82   min=127.46   stddev=2.72 (2.1%))
  • GeForce GTX 1060 3GB
    "%openCvBuild%\install\x64\vc15\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=13   mean=8.83   median=8.85   min=8.53   stddev=0.17 (1.9%))

The performance results for all the tests are shown together below.

The results in the figure show that for this specific test and hardware configuration (GTX 1060 vs i5-6500):

  1. If we ignore OpenCL the CUDA implementation on the GTX 1060 is comfortably faster than the MKL + TBB implementation executed on the CPU.
  2. The OpenCL implementation on the GTX 1060 is significantly slower than the CUDA version. This is expected but unfortunate considering the OpenCV CUDA routines have been moved from the main repository and may eventually be depreciated.
  3. OpenCL still has a long way to go, in addition to its poor performance when compared with CUDA on the same device the implementations on both the CPU (i5-6500) and the iGPU (HD Graphics 530) were an order of magnitude slower than the optimized MKL + TBB implementation on the CPU.

The above comparison is just for fun, to give an example of how to quickly check if using OpenCV with CUDA on your specific hardware combination is worth while. For a more indepth comparisson on several hardware configurations see OpenCV 3.4 GPU CUDA Performance Comparisson (nvidia vs intel)


Choosing the compute-capability

The default command line options given above implement NVIDIA’s CUDA 10.0 recommended settings for future hardware compatibility. This means that any programs linked against the resulting opencv_world400.dll should work on all GPU’s currently supported by CUDA 10.0 and all GPU’s released in the future. As mentioned above this comes at a cost, both in terms of compilation time and shared library size. Before discussing the CMake settings which can be used to reduce these costs we need to understand the following concepts:

  • Compute-capability – every GPU has a fixed compute-capability which determines its general specifications and features. In general the more recent the GPU the higher the compute-capability and the more features it will support. This is important because:
    • Each version of CUDA supports different compute-capabilities. Usually a new version of CUDA comes out to suppoort a new GPU architecture, in the case of CUDA 10.0, support was added for the Turing (compute 7.5) architecture. On the flip side support for older architechtures can be removed for example CUDA 9.0 removed support for the Fermi (compute 2.0) architecture. Therefore by choosing to build OpenCv with CUDA 10.0 we have limited ourselves to GPU’s of compute-capability >=3.0. Notice we have not limited ourselves to compute-capability GPU’s <=7.5, the reason for this is discussed in the next section.
    • You can build opencv_world400.dll to support one or many different compute-capabilities, depending on your specific requirements.
  • Supporting a compute-capability – to support a specific compute-capability you can do either of the following, or a combination of the two:
    • Generate architecture-specific cubin files, which are only forward-compatible with GPU architectures with the same major version number. This can be controlled by passing CUDA_ARCH_BIN to CMake. For example passing -DCUDA_ARCH_BIN=3.0 to CMake, will result in opencv_world400.dll containing binary code which can only run on compute-capability 3.0, 3.5 and 3.7 devices. Futhermore it will not support any specific features of compute-capability 3.5 (e.g. dynamic parallelism) or 3.7 (e.g. 128 K 32 bit registers). In the case of OpenCV 4.0.0 this would not restrict any functionality because it only uses features from compute-capability 3.0 and below. This can be confirmed by a quick search of the contrib repository for the __CUDA_ARCH__ flag.
    • Generate forward-compatible PTX assembly for a virtual architecture, which is forward-compatable with all GPU architectures of greater than or equal compute-capability. This can be controlled by passing CUDA_ARCH_PTX to CMake. For example by passing -DCUDA_ARCH_PTX=7.5 to CMake, opencv_world400.dll will contain PTX code for compute-capability 7.5 which can be Just In Time (JIT) compiled to architecture-specific binary code by the CUDA driver, on any future GPU architectures. Because of the default CMake rules when CUDA_ARCH_BIN is not explicitly set it will also contain architecture-specific cubin files for GPU architectures 3.0-7.5.
  • PTX considerations – given that PTX code is forward-compatible and cubin binaries are not it would be tempting to only include the former. To understand why this might not be such a great idea, a things to be aware of when generating PTX code:
    1. As mentioned previously the CUDA driver JIT compiles PTX code at run time and cache’s the resulting cubin files so that the compile operation should in theory be a one-time delay, at least until the driver is updated. However if the cache is not large enough JIT compilation will happen every time, causing delay every time your program executes.

      To get an idea of this delay I passed -DCUDA_ARCH_BIN=3.0 and -DCUDA_ARCH_PTX=3.0 to CMake before building OpenCV. I then emptied the cache (default location %appdata%\NVIDIA\ComputeCache\) and ran the GEMM performance example on a GTX 1060 (compute-capability 6.1), to force JIT compilation. I measured an initial delay of over 3 minutes as the PTX code was JIT compiled before the program started to execute. Following that, the delay of subsequent executions was around a minute, because the default cache size (256 MB) was not large enough to store all the compiled PTX code. Given my compile options the only solution to remove this delay is to increase the size of the cache by setting the CUDA_CACHE_MAXSIZE environmental variable to a number of bytes greater than required. Unfortunately because, “Older binary codes are evicted from the cache to make room for newer binary codes if needed”, this is more of a band aid than a solution. This is because the maximum cache size is 4 GB, therefore your PTX compiled code can be evicted at any point in time if other programs on your machine are also JIT compiling from PTX, bringing back the “one-time” only delay.

    2. For maximum device coverage you should include PTX for the lowest possible GPU architecture you want to support.
    3. For maximum performance NVIDIA recommends including PTX for the highest possible architecture you can.

Possible cubin/PTX combinations
Given (1)-(3) above, the command line options that you want to pass to CMake when building OpenCV will depend on your specific requirements. I have given some examples below for various scenarios given a main GPU of compute-capability 6.1:

  • Firstly stick with the defaults if compile time and shared library size are not an issue. This offers the greatest amount of flexibility from a development standpoint, avoiding the possibility of needing to recompile OpenCV when you switch GPU.
  • If your programs will always be run on your main GPU, just pass -DCUDA_ARCH_PTX=6.1 to CMake to target your architecture only. It should take around an hour to build, depending on your CPU and the resulting shared library should not be larger than 200 MB.
  • If you are going to deploy your application, but only to newer GPU’s pass -DCUDA_ARCH_PTX=6.1,7.0,7.5 and -DCUDA_ARCH_PTX=7.5 to CMake for maximum performance and future compatibility.This is advisable because you may not have any control over the size of the JIT cache on the target machine, therefore including cubin’s for all compute-capabilities you want to support, is the only way be sure to prevent JIT compilation delay on every invocation of your application.
  • If size is really an issue but you don’t know which GPU’s you want to run your application on then to ensure that your program will run on all current and future supported GPU’s pass -DCUDA_ARCH_BIN=6.1 and -DCUDA_ARCH_PTX=3.0 to CMake for maximum coverage.

Leave a Reply

Your email address will not be published. Required fields are marked *