OpenCV 3.4 GPU CUDA Performance Comparison (nvidia vs intel)

In this post I am going to use the OpenCV’s performance tests to compare the CUDA and CPU implementations. The idea, is to get an indication of which OpenCV and/or Computer Vision algorithms, in general, benefit the most from GPU acceleration, and therefore, under what circumstances it might be a good idea to invest in a GPU.

Test Setup
  • Software: OpenCV 3.4 compiled on Visual Studio 2017 with CUDA 9.1, Intel MKL with TBB, and TBB. To generate the CPU results I simply ran the CUDA performance tests with CUDA disabled, so that the fall back CPU functions were called, by changing the following

    #define PERF_RUN_CUDA()  false //::perf::GpuPerf::targetDevice()

    on line 228 of


    The performance tests cover 104 of the OpenCV functions, with each function being tested for a number of different configurations (function arguments). The total number of different CUDA performance configurations/tests which run successfully are 6031, of which only 5300 configurations are supported by both the GPU and CPU.

  • Hardware: Four different hardware configurations were tested, consisting of 3 laptops and 1 desktop, the CPU/GPU combinations are listed below:

    1. CPU: i5-4120U, GPU: 730m (laptop)
    2. CPU: i5-5200U, GPU: 840m (laptop)
    3. CPU: i7-6700HQ, GPU: GTX 980m (laptop)
    4. CPU: i5-6500, GPU: GTX 1060 (desktop)

GPU Specifications

The GPU’s tested comprise three different micro-architectures, ranging from a low end laptop (730m) to a mid range desktop (GTX 1060) GPU. The full specifications are shown below, where I have also included the maximum theoretical speedup, if the OpenCV function were bandwidth or compute limited. This value is just included to give an indication of what should be possible if architectural improvements, SM count etc. don’t have any impact on performance. In “general” most algorithms will be bandwidth limited implying that the average speed up of the OpenCV functions could be somewhere between these two values. If you are not familiar with this concept then I would recommend watching Memory Bandwidth Bootcamp: Best Practices, Memory Bandwidth Bootcamp: Beyond Best Practices and Memory Bandwidth Bootcamp: Collaborative Access Patterns by Tony Scudiero for a good overview.

CPU Specifications

The CPU’s tested also comprise three different micro-architectures, ranging from a low end laptop dual core (i5-4120U) to a mid range desktop quad core (i5-6500) CPU. The full specifications are shown below, where I have again included the maximum theoretical speedup depending on whether the OpenCV functions are limited by the CPU bandwidth or clock speed (I could not find any Intel published GFLOPS information).

Benchmark Results

The results for all tests are available here, where you can check if a specific configuration benefits from an improvement in performance when moved to the GPU.

To get an overall picture of the performance increase which can be achieved from using the CUDA functions over the standard CPU ones, the speedup of each CPU/GPU over the least powerful CPU (i5_4210U), is compared. The below figure shows the speedup averaged over all 5300 tests (All Configs). Because the average speedup is influenced by the number of different configurations tested per OpenCV function, two additional measures are also shown (which only consider one configuration per function) on the below figure:

  • GPU Min – the average speedup, taken over all OpenCV functions for the configuration where the GPU speedup was smallest.
  • GPU Max – the average speedup, taken over all OpenCV functions for the configuration where the GPU speedup was greatest.

The results demonstrate that the configuration (function arguments), makes a massive difference to the CPU/GPU performance. That said even the slowest configurations on the slowest GPU’s are in the same ball park, performance wise, as the fastest configurations on the most powerful CPU’s in the test. This combined with a higher average performance for all GPU’s tested, implies that you should nearly always see an improvement when moving to the GPU, if you have several OpenCV functions in your pipeline (as long as you don’t keep moving your data to and from the GPU), even if you are using a low end two generation old laptop GPU (730m).

Now lets examine some individual OpenCV functions. Because each function has many configurations, for each function the average execution time over all configurations tested, is used to calculate the speedup over the i5-4120U. This will provides a guide to the expected performance of a function irrespective of the specific configuration. The next figure shows the top 20 functions where the GPU speedup, was largest. It is worth noting that the speedup of the GTX 1060 over all of the CPU’s is so large that it has to be shown on a log scale.

Next, the bottom 20 functions where the GPU speedup, was smallest.

The above figure demonstrates that, although the CUDA implementations are on average much quicker, some functions are significantly quicker on the CPU. Generally this is due to the function using the Intel Integrated Performance Primitives for Image processing and Computer Vision (IPP-ICV) and/or SIMD instructions. That said the above results also show, that some of these slower functions, do benefit from the parallelism of the GPU, but a more powerful GPU is required to leverage this.

Finally lets examine which OpenCV functions took the longest. This is importanti f you are using one of these functions, as you may consider calling its CUDA counterpart, even if it is the only OpenCV function you need. The below figure contains the execution time for the 20 functions which took the longest on the i5-4120U, again this has to be shown on a log scale because the GPU execution time is much smaller than the CPU execution time.

Given the possible performance increases shown in the results, if you were performing mean shift filtering with OpenCV, on a laptop with only low end i5-4120U, the execution time of nearly 7 seconds may encourage you to upgrade your hardware. From the above it is clear that it is much better to invest in a tiny GPU (730m) which will reduce your processing time by a factor of 10 to a more tolerable 0.6 seconds, or a mid range GPU (GTX 1060), reducing your processing time by a factor of 100 to 0.07 seconds, rather than a mid range i7 which will give you less than a 30% reduction.

To conclude I would just reiterate that, the benefit you will get from moving your processing to the GPU with OpenCV will depend on the function you call and configuration that you use, in addition to your processing pipeline. That said from, what I have observed, on average the CUDA functions are much much quicker than their CPU counterparts. Please let me know if there are any mistakes in my results and/or analysis.

Digiprove sealCopyright secured by Digiprove © 2020 James Bowley

17 thoughts on “OpenCV 3.4 GPU CUDA Performance Comparison (nvidia vs intel)

  1. It would be great if you have prepared set of binaries and source files to do test on other HW. i.e. I can do the tests on other cards.

    1. Thank you for your comment, I had considered including all the code for generating the results, however the performance tests require python, and the per-compiled OpenCV binaries rely on CUDA 9.1 being installed or re-distribution of some of its dll’s, which I am not sure if I can host. This means I cannot package the tests as a simple set of self contained executable to be run on lots of different machines, as I had hoped and as you suggested.

      That said it is simple enough to do it yourself. As detailed in the post, the results are generated using the OpenCV performance tests which are included in the pre-combiled binaries for Visual Studio 2017 (choose the ones without MKL and TBB to reduce the dependencies), on my downloads page. To easily generate the results see this guide which uses the python scripts included in the OpenCV source code here.

      If you test on other HW, please share the results.

  2. Thank you for sharing this article! Do you have any experience with OpenCV on other platforms? If you had to build the ultimate OpenCV/Cuda rig, would you go with an i9-7980XE and Titan Vs or would you go with dual Xeons and Tesla V100s?

    1. Hi, whilst I don’t have any experience on the most modern Intel architectures or the performance of TBB on a dual socket set up, from the conclusions I drew I would have to suggest that the ultimate set up would have to include the best GPU you can afford, with the choice of CPU being secondary. That said it very much depends on the functions you are going to use and the size of the data which you are going to process.

  3. Thank you for doing this work! I am trying to reproduce the results.
    CUDA tests run fine. But when I put the PERF_RUN_CUDA() to false and re-compile performance tests project – all CUDA tests fail with “No regression data for cpu_dst argument”.
    Have you encountered this issue (and know the workaround)?

    1. Hi, the error implies that you don’t have the opencv results data or you have not set the environmental variable OPENCV_TEST_DATA_PATH to point to the test data.
      You will need to download the test data by cloning the opencv_extra repository here and then add the following environmental variable or set it in the command prompt before running the tests, as below

      set OPENCV_TEST_DATA_PATH=your_opencv_extra_location\testdata

      before running the tests.
      I had to generate separate data for the CPU tests because the results for some tests are slightly different to the CUDA ones. If you don’t do this some of the tests will fail and you won’t get a result for the execution time. As a quick workaround I simply copied the testdata, to a new location and generated the cpu results following the guide here under the heading “How to update perf data”. Then I set the OPENCV_TEST_DATA_PATH environmental variable in the command prompt to point to this location directly before running the tests.

      1. Thank you for your helpful reply.
        The OPENCV_TEST_DATA_PATH points to the correct test data on my computer.

        The issue I see is that CUDA-related validation files provided in testsdata\perf\ do not include any validation rules for “cpu_dst” target.
        Do you suggest to remove these files from testdata\perf?
        They are distributed together with testsdata – not generated from previous CUDA tests on my computer.
        Another way would be to rename gpu_dst to cpu_dst in all such files.

        @”The OPENCV_TEST_DATA_PATH points to the correct test data on my computer.”
        Meaning that OPENCV_TEST_DATA_PATH points to fresh tests data downloaded from github

        1. I only get that error if I do not set the OPENCV_TEST_DATA_PATH correctly, and python cannot find the xml files stored in testdata\perf.
          Can you confirm that you have the following file testdata\perf\cudaarithm.xml?
          Do the CUDA performance tests produce the same errors?
          Have you checked out the 3.4 tag in the opencv_extra repository?
          When you

          echo %OPENCV_TEST_DATA_PATH%

          directly before calling does it output something similar to


          1. @”Can you confirm that you have the following file testdata\perf\cudaarithm.xml?”
            Yes, it is there. The issue is that it has validation data only for GPU, see sections ….
            There is no such data for CPU tests (it would look like: …).
            That is why I see “No regression data for cpu_dst argument”.

            (A) I was able to overcome this by removing all these cudaXXX.xml files from testdata folder altogether.
            Now tests run fine without error messages. The timings are not so good compared to GPU (as expected).
            However there is no way to validate if results are correct.

            (B) If I rename the … to … in cudaXXX.xml than tests also run fine without error messages. But accuracy of the results are higher than epsilon (also expected) and we cannot use validation data from cudaXXX.xml files at all.

            I am using OpenCV 3.4.2 with CUDA 9.2, latest ICC, MKL, etc.

            In any case, thank you very much for your quick help.
            I didn’t pay attention to xml files in testdata\perf directory before you told me. That was the key to the solution.

          2. That is strange, I am pretty sure just I just used the cudaXXX.xml files which come from the repo to run the standard CUDA compiled perf tests. I then just copied the folder and regenerated cudaXXX.xml files for the CPU run from the CPU compiled cuda perf tests opencv_perf_cudaXXX.exe and everything worked, as long as I remembered to switch the environmental variable from to point to the correct test results.

  4. Thanks for posting this!

    You mentioned that the speedup time is OpenCV function execution time. Just wanted to know that if you have included the image upload time and image download time from and to gpu as well for calculating those speedups?

    Thanks again

    1. Hi, I’m glad you enjoyed the post.

      All the results are generated using the inbuilt OpenCV performance tests. Therefore the execution time should always be the OpenCV function execution time, calculated using timers in the CPU thread as an average of several runs. It should not include data transfers to and from the device.

      If you want to verify this for a particular function you are interested in then you can check the source code. For example the performance test for the cv::cuda::pow() can be found here.

    1. Hi, unfortunately I have not run a comparison of the OpenCV routines which utilize MKL + TBB vs their CUDA counterparts. The only comparison I have looked at is that of GEMM when Verifying OpenCV is CUDA accelerated. As you can see there it is very important to build with MKL + TBB if you are using BLAS routines.

  5. Hi , I need to accelarate my opencv python application using CUDA. Can you help out from where i could get on details using the cuda supporting opencv python function and use it for my code? The syntax and how to use it. IT would be great if I could get these details..

    1. Hi, I would check the python tests for the functions which you want to use, because not all of the CUDA functions have python bindings. This should give you a good guide to whats available. These are stored in the opencv_contrib/modules//misc/python/test/ directory for each CUDA module. E.g. the tests for the cudaarithm python modules are in test_cudaarithm.

      For an example see Accelerating OpenCV with CUDA streams in Python.

Leave a Reply

Your email address will not be published.