OpenCV

OpenCV MKL/TBB vs cuBLAS

To investigate the impact of building OpenCV with Intel MKL/TBB, I have compared the perfomance of the BLAS level 3 GEMM routine (cv::gemm) with and without MKL/TBB optimization with the corresponding cuBLAS (cv::cuda::gemm) implementation.

The first comparisson is performed using the standard C++ interface and the inbuilt OpenCV perfomance tests. This is then compared with the results obtained from accessing OpenCV using the python interface (cv2.gemm() and cv2.cuda.gemm()).

To achieve the above objectives the performance of the following three implementations is examined:

  1. cv::cuda::gemm() run on a GTX 1060 GPU.
  2. The C++ implementation of cv::gemm() run on an i5-6500 CPU with and without MKL/TBB.
  3. The OpenCL implementation of cv::gemm() run on both the CPU (i5-6500) & the GPU (GTX 1060 & HD Graphics 530).

The main area’s covered are listed below:

OpenCV Perfomance Tests

OpenCV comes with a set of performance tests and luckily these include a GEMM test on both the CPU (cv::gemm) and GPU (cv::cuda::gemm). As a result there is no need to write any test code if you want to perform the same tests on your own system. The below does however assume you have built or downloaded OpenCV with CUDA support for Visual Studio 2019 (vc16).

GPU (GTX 1060) Performance

As we would expect the GPU to outperform the CPU in this test we first get a GPU baseline result which we will try to compete with by progressivly increasing the level of CPU optimization.
To run the CUDA performance test simply enter the following into command prompt

"%openCvBuild%\install\x64\vc16\bin\opencv_perf_cudaarithm.exe" --gtest_filter=Sz_Type_Flags_GEMM.GEMM/29

(where %openCvBuild% is your build directory, or the directory which you extracted the downloaded binaries to) the full output is shown below. To verify that everything is working look for the “ [ PASSED ] 1 test” text, shown in the image below. Note: If you have set OPENCV_TEST_DATA_PATH then this will fail the sanity check since CUDA 11.0.

The above test performed matrix multiplication on a 1024x1024x2 single precision matrix using an GTX 1060 GPU 100 times, with a mean execution time of 3.86 ms, which can be seen in the following output taken from the image above.

[ PERFSTAT ]   (samples=100   mean=3.86   median=3.85   min=3.13   stddev=0.40 (10.3%)) 

Next we compare these results to the same test run on a CPU to see if it can compete, on the specific hardware set up we have.

CPU (i5-6500) Performance

The standard opencv core GEMM performance test does not use 1024×1024 matrices, therefore for this comparison we can simply change the GEMM tests inside opencv_perf_core.exe to process this size instead of 640×640. This is achieved by simply changing the following line to be

::testing::Values(Size(1024, 1024), Size(1280, 1280)),

Denoting the the modified executable as opencv_perf_core_1024.exe, the corresponding CPU test can be run as

set OPENCV_OPENCL_DEVICE=disabled
"%openCvBuild%\install\x64\vc16\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3

resulting in the following output on a midrange i5-6500.

[ PERFSTAT ]    (samples=10   mean=2212.55   median=2210.16   min=2195.34   stddev=9.46 (0.4%))

The execution time is three orders of magnitude greater than on the GPU so what is wrong with our CPU? As it turns out nothing is wrong, to get a baseline CPU result, I purposely ran this without building OpenCV against any optimized BLAS. To demonstrate the performance benefit of building OpenCV with Intel’s MKL (which includes optimized BLAS) and TBB I have run the same test again with two different levels of optimization, OpenCV built against:

  1. Intel MKL without multi-threading
    [ PERFSTAT ]    (samples=100   mean=90.87   median=89.88   min=86.02   stddev=2.81 (3.1%))
  2. Intel MKL multi-threaded with TBB
    [ PERFSTAT ]    (samples=100   mean=32.78   median=31.06   min=28.17   stddev=4.62 (14.1%))

This demonstrates the impact using multi-threaded MKL and brings the gap between CPU and GPU performance down significantly. Now we are ready to compare with OpenCL.

OpenCL Performance

In OpenCV 4.0 the CUDA modules were moved from the main to the contrib repository, presumably because OpenCL will be used for GPU acceleration going forward. To examine the implications of this I ran the same performance tests as above again, only this time on each of my three OpenCL devices. The results for each device are given below including the command to run each test.

  • Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
    set OPENCV_OPENCL_DEVICE=:CPU:
    "%openCvBuild%\install\x64\vc16\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=100   mean=233.26   median=230.26   min=224.78   stddev=8.00 (3.4%))
  • Intel(R) HD Graphics 530
    set OPENCV_OPENCL_DEVICE=:iGPU:
    "%openCvBuild%\install\x64\vc16\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=100   mean=116.97   median=114.63   min=112.52   stddev=5.47 (4.7%))
  • GeForce GTX 1060 3GB
    set OPENCV_OPENCL_DEVICE=:dGPU:
    "%openCvBuild%\install\x64\vc16\bin\opencv_perf_core_1024.exe" --gtest_filter=OCL_GemmFixture_Gemm.Gemm/3
    [ PERFSTAT ]    (samples=13   mean=8.75   median=8.82   min=8.25   stddev=0.22 (2.5%))

Performance Test Results

The performance results for all the tests are shown together below.

The results in the figure show that for this specific test and hardware configuration (GTX 1060 vs i5-6500):

  1. If we ignore OpenCL the CUDA implementation on the GTX 1060 is comfortably faster than the MKL + TBB implementation executed on the CPU.
  2. The OpenCL implementation on the GTX 1060 is significantly slower than the CUDA version. This is expected but unfortunate considering the OpenCV CUDA routines have been moved from the main repository and may eventually be depreciated.
  3. OpenCL still has a long way to go, in addition to its poor performance when compared with CUDA on the same device the implementations on both the CPU (i5-6500) and the iGPU (HD Graphics 530) were an order of magnitude slower than the optimized MKL + TBB implementation on the CPU.

The above comparison is just for fun, to give an example of how to quickly check if using OpenCV with CUDA on your specific hardware combination is worth while. For a more indepth comparisson on several hardware configurations see OpenCV 3.4 GPU CUDA Performance Comparisson (nvidia vs intel)

Python Perormance Tests

The python modules do not have any inbuilt performance tests, however the same comparisson as above can be quickly performed using an Interactive Python session and the %timeit builtin magic command. First make sure that you have built or downloaded OpenCV with CUDA support and python bindings, and you have copied the bindings to your python site-packages direcory. Then open up the Anaconda3 or windows command prompt and issue the following to start the Python session and ensure that the path to OpenCV is set correctly.

set path=%openCvBuild%\install\x64\vc16\bin;%path%
ipython
import cv2 as cv

Python GPU (GTX 1060) Performance

To run the GEMM test on the GPU with CUDA from within Python enter the following into the ipython prompt

import numpy as np
npTmp = np.random.random((1024, 1024)).astype(np.float32)
npMat1 = npMat2 = npMat3 = npDst = np.stack([npTmp,npTmp],axis=2)
cuMat1 = cuMat2 = cuMat3 = cuDst = cv.cuda_GpuMat(npMat1)
%timeit cv.cuda.gemm(cuMat1, cuMat2,1,cuMat3,1,cuDst,1)

You should see output similar to

3.97 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

which is very close the to the result (3.86 ms on the GTX 1060 from C++) when the same test was called directly from C++.

Python CPU (i5-6500) Performance

For completeness you can run the same test on the CPU as

%timeit cv.gemm(npMat1,npMat2,1,npMat3,1,npDst,1)

and confirm that the new result

29.4 ms ± 786 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

is comparable with the previous one (32.78 ms on the i5-6500 from C++).

You can also perform a quick sanity check to confirm that you are seeing good performance for the GEMM operation in OpenCV. An easy way to do this is to run the same operation again only this time in NumPy.

npMat3 = npMat4 = npMat5 = npTmp + npTmp*1j
%timeit npMat3.T @ npMat4 + npMat5

As you can see the data is structured in a slightly different way, however the timings

29.3 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

should hopefully be comparable to the OpenCV result (29.4 ms on the i5-6500 calling OpenCV from Python).

From the results of these quick tests it can be implied that:

  1. The overhead from using the CPU and/or CUDA python interface instead of directly calling from C++ is small.
  2. The GEMM operation in OpenCV is highly optimized if built with against Intel MKL/TBB.

Leave a Reply

Your email address will not be published. Required fields are marked *