Uncategorized

Accelerating OpenCV with CUDA streams in Python

Overview

Since Aug 2018 the OpenCV CUDA api has been exposed to python (for details of the api call’s see test_cuda.py). To get the most from this new functionality you need to have a basic understanding of CUDA (most importantly that it is data not task parallel) and its interaction with OpenCV. Below I have tried to introduce these topics with an example of how you could optimize a toy video processing pipeline. The actual functions called in the pipeline are not important, they are simply there to simulate a common processing pipeline consisting of work performed on both the host (CPU) and device (GPU).

This guide is taken from a Jupyter Notebook which can be cloned from here. The procedure is as follows, following some quick initialization, we start with a naiive implementation on both the CPU and [GPU(#gpu_naiive) to get a baseline result. We then proceed to incrementaly improve the implementation by using the information provided by the Nvidia Visual Profiler.

On a laptop GTX2080 paired with an i7-8700 the final CUDA incarnation resulted in a speed up of ~29x and ~9x over the naiive CPU and GPU implementations.

Init

In [9]:
#export
import os
import time
import numpy as np
from functools import partial
import matplotlib.pyplot as plt
import cv2 as cv
In [10]:
def CheckFg(fg_gs,fg):
    if (len(fg_gs) != len(fg)):
        return False
    for i in range(0,len(fg)):
        if(np.sum(fg_gs[i]!=fg[i]) != 0):
            return i
    return True
In [11]:
#export
# globals
vidPath = os.environ['OPENCV_TEST_DATA_PATH'] + '/cv/video/768x576.avi'
lr = 0.05
rows_big = 1440
cols_big = 2560
check_res = False
frame_device = cv.cuda_GpuMat()

Naive implementations

In [12]:
#export
def ProcVid0(proc_frame_func,lr):
    cap = cv.VideoCapture(vidPath)
    if (cap.isOpened()== False): 
        print("Error opening video stream or file")
        return
    n_frames = 0
    start = time.time()
    while(cap.isOpened()):
        ret, frame = cap.read()
        if ret == True:
            n_frames += 1 
            proc_frame_func(frame,lr)
        else:
            break
    end = time.time()
    cap.release()
    return (end - start)*1000/n_frames, n_frames;

CPU

In [13]:
#export
bgmog2 = cv.createBackgroundSubtractorMOG2()
def ProcFrameCPU0(frame,lr,store_res=False):
    frame_big = cv.resize(frame,(cols_big,rows_big))
    fg_big = bgmog2.apply(frame_big,learningRate = lr)
    fg_small = cv.resize(fg_big,(frame.shape[1],frame.shape[0]))
    if(store_res):
        cpu_res.append(np.copy(fg_small))
In [15]:
#export
cpu_res = []
cpu_time_0, n_frames = ProcVid0(partial(ProcFrameCPU0,store_res=check_res),lr)
print(f'CPU 0 (naive): {n_frames} frames, {cpu_time_0:.2f} ms/frame')
CPU 0 (naive): 100 frames, 30.41 ms/frame

GPU

In [16]:
#export
bgmog2_device = cv.cuda.createBackgroundSubtractorMOG2()
def ProcFrameCuda0(frame,lr,store_res=False):
    frame_device.upload(frame)
    frame_device_big = cv.cuda.resize(frame_device,(cols_big,rows_big))
    fg_device_big = bgmog2_device.apply(frame_device_big,lr,cv.cuda.Stream_Null())
    fg_device = cv.cuda.resize(fg_device_big,frame_device.size())
    fg_host = fg_device.download()
    if(store_res):
        gpu_res.append(np.copy(fg_host))
In [22]:
#export
gpu_res = []
gpu_time_0, n_frames = ProcVid0(partial(ProcFrameCuda0,store_res=check_res),lr)
print(f'GPU 0 (naive): {n_frames} frames, {gpu_time_0:.2f} ms/frame')
print(f'Speedup over CPU: {cpu_time_0/gpu_time_0:.2f}')
GPU 0 (naive): 100 frames, 3.75 ms/frame
Speedup over CPU: 8.11

Analysis

gpu_naive

Observations:
The image above shows the Nvidia Visual Profiler output from processing 3 of the 100 frames. Important things to be aware of here are:

  1. The runtime API calls in brown which in this example represent the time the host (CPU) spends waiting for the device (GPU) calls to return.
  2. The remaining blocks which show the time spent on the device. This is split according to the operation (kernel, memset, MemCpy(HtoD), MemCpy(DtoH)) aswell as by the CUDA stream which the operations are issued to. In this case everything is issued to the Default stream.
  3. In this naiive implementation all device calls from the host are synchronous and as a result the difference between (1) and (2) can be interpreted as periods where no useful work is being performed on either the host or the device. The host is blocking waiting for the device to return and the device is also idle, alocating or freeing memory.
  4. The gap in between the blocks of runtime API calls representing the time spent executing code on the host, here that is the time taken for OpenCV to read and decode each video frame, frame = cap.read().

Taking (1) and (4) into account from left to right the output from the profiler can be mapped to the python calls as:

  • (1166ms-1168.5ms) proc_frame_func(frame,lr): calls to the device to process the first frame
  • (1168.5ms-1169.5ms) frame = cap.read(): read and decode the second video frame on the host
  • (1169.5ms) proc_frame_func(frame,lt): calls to the device to process the second frame

Clearly from (3) a lot of time is wasted waiting for the device calls to return, and from (4) there is room for improvement if we are able to overlap host and device computation.

Hypothesis:
The causes of (3) are the blocking calls to both

  • cudaMallocPitch() – OpenCV in python automatically allocates any arrays (numpy or GpuMat) which are returned from a function call. That is on every iteration

    ret, frame = cap.read()causes memory for the numpy array frame to be allocated and destroyed on the host
    and
    frame_device_big = cv.cuda.resize(frame_device,(cols_big,rows_big))
    fg_device_big = bgmog2_device.apply(frame_device_big,lr,cv.cuda.Stream_Null())
    fg_device = cv.cuda.resize(fg_device_big,frame_device.size())
    causes memory for frame_device_big, fg_device_big and fg_device to be allocated and destroyed on the device.

  • cudaDeviceSynchronise() – if you don’t explicitly pass in a CUDA stream to an OpenCv CUDA function, the default stream will be used and cudaDeviceSynchronize() called before the function exits.

Action:
First address the uneccessay calls to cudaMallocPitch(), by pre-allocating any output arrays and passing them as input arguments.

Pre-allocation of return arrays

In [23]:
#export
def ProcVid1(proc_frame,lr):
    cap = cv.VideoCapture(vidPath)
    if (cap.isOpened()== False): 
        print("Error opening video stream or file")
        return
    n_frames = 0
    start = time.time()
    while(cap.isOpened()):
        ret,_ = cap.read(proc_frame.Frame())
        if ret == True:
            n_frames += 1 
            proc_frame.ProcessFrame(lr)
        else:
            break
    end = time.time()
    cap.release()
    return (end - start)*1000/n_frames, n_frames;

CPU

In [24]:
#export
class ProcFrameCpu1:
    def __init__(self,rows_small,cols_small,rows_big,cols_big,store_res=False):
        self.rows_small, self.cols_small, self.rows_big, self.cols_big = rows_small,cols_small,rows_big,cols_big
        self.store_res = store_res
        self.res = []
        self.bgmog2 = cv.createBackgroundSubtractorMOG2()
        self.frame = np.empty((rows_small,cols_small,3),np.uint8)
        self.frame_big = np.empty((rows_big,cols_big,3),np.uint8)
        self.fg_big = np.empty((rows_big,cols_big),np.uint8)
        self.fg_small = np.empty((rows_small,cols_small),np.uint8)
        
    def ProcessFrame(self,lr):
        cv.resize(self.frame,(self.cols_big,self.rows_big),self.frame_big)
        self.bgmog2.apply(self.frame_big,self.fg_big,learningRate = lr)
        cv.resize(self.fg_big,(self.cols_small,self.rows_small),self.fg_small)
        if(self.store_res):
            self.res.append(np.copy(self.fg_small))
        
    def Frame(self):
        return self.frame
    
cap = cv.VideoCapture(vidPath)
if (cap.isOpened()== False): 
  print("Error opening video stream or file")
ret, frame = cap.read()
cap.release()
rows_small,cols_small = frame.shape[:2]
proc_frame_cpu1 = ProcFrameCpu1(rows_small,cols_small,rows_big,cols_big,check_res)
In [25]:
#export
cpu_time_1, n_frames = ProcVid1(proc_frame_cpu1,lr)
print(f'CPU 1 (pre-allocation): {n_frames} frames, {cpu_time_1:.2f} ms/frame')
print(f'Speedup over CPU baseline: {cpu_time_0/cpu_time_1:.2f}')
CPU 1 (pre-allocation): 100 frames, 27.81 ms/frame
Speedup over CPU baseline: 1.09
In [26]:
if check_res: CheckFg(cpu_res,proc_frame_cpu1.res)

GPU

In [27]:
#export
class ProcFrameCuda1:
    def __init__(self,rows_small,cols_small,rows_big,cols_big,store_res=False):
        self.rows_small, self.cols_small, self.rows_big, self.cols_big = rows_small,cols_small,rows_big,cols_big
        self.store_res = store_res
        self.res = []
        self.bgmog2 = cv.cuda.createBackgroundSubtractorMOG2()
        self.frame = np.empty((rows_small,cols_small,3),np.uint8)
        self.frame_device = cv.cuda_GpuMat(rows_small,cols_small,cv.CV_8UC3)
        self.frame_device_big = cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC3)        
        self.fg_device_big = cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC1)
        self.fg_device_big.setTo(0)
        self.fg_device = cv.cuda_GpuMat(np.shape(frame)[0],np.shape(frame)[1],cv.CV_8UC1)
        self.fg_host = np.empty((rows_small,cols_small),np.uint8)
        
    def ProcessFrame(self,lr):
        self.frame_device.upload(self.frame)
        cv.cuda.resize(self.frame_device,(cols_big,rows_big),self.frame_device_big)
        self.bgmog2.apply(self.frame_device_big,lr,cv.cuda.Stream_Null(),self.fg_device_big)
        cv.cuda.resize(self.fg_device_big,self.fg_device.size(),self.fg_device)
        self.fg_device.download(self.fg_host)
        if(self.store_res):
            self.res.append(np.copy(self.fg_host))
        
    def Frame(self):
        return self.frame
    
proc_frame_cuda1 = ProcFrameCuda1(rows_small,cols_small,rows_big,cols_big,check_res)
In [28]:
#export
gpu_time_1, n_frames = ProcVid1(proc_frame_cuda1,lr)
print(f'GPU 1 (pre-allocation): {n_frames} frames, {gpu_time_1:.2f} ms/frame')
print(f'Incremental speedup: {gpu_time_0/gpu_time_1:.2f}')
print(f'Speedup over CPU: {cpu_time_1/gpu_time_1:.2f}')
GPU 1 (pre-allocation): 100 frames, 2.03 ms/frame
Incremental speedup: 1.85
Speedup over CPU: 13.69
In [29]:
if check_res: CheckFg(gpu_res,proc_frame_cuda1.res)

Analysis

title

Observations: Pre-allocating the arrays has successfully removed the calls to cudaMallocPitch() and significantly (3 frames are now processed instead of 1.5) reduced (3), the time the host spends waiting for the CUDA runtime to return control to it.

Hypothesis: As mentioned above by not specifying a stream all calls are placed in the “Default” stream which can be seen at the bottom of the figure. This means that following each asynchronous kernel launch there will be a synchronizing call to cudaDeviceSynchronize() shown below:

cv.cuda.resize(frame_device,(cols_big,rows_big),frame_device_big) async kernel 1,
cudaDeviceSynchronize()
bgmog2_device.apply(frame_device_big,lr,cv.cuda.Stream_Null(),fg_device_big) async kernel 2,
cudaDeviceSynchronize()
cv.cuda.resize(fg_device_big,fg_device.size(),fg_device) async kernel 3,

cudaDeviceSynchronize()
fg_device.download(fg_host) synchronous copy from device to host

Action: Pass a non default CUDA stream to each OpenCV CUDA function.

CUDA Streams

Replacing the default stream

In [31]:
#export
class ProcFrameCuda2:
    def __init__(self,rows_small,cols_small,rows_big,cols_big,store_res=False):
        self.rows_small, self.cols_small, self.rows_big, self.cols_big = rows_small,cols_small,rows_big,cols_big
        self.store_res = store_res
        self.res = []
        self.bgmog2 = cv.cuda.createBackgroundSubtractorMOG2()
        self.stream = cv.cuda_Stream()
        self.frame = np.empty((rows_small,cols_small,3),np.uint8)
        self.frame_device = cv.cuda_GpuMat(rows_small,cols_small,cv.CV_8UC3)
        self.frame_device_big = cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC3)
        self.fg_device_big = cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC1)
        self.fg_device = cv.cuda_GpuMat(np.shape(frame)[0],np.shape(frame)[1],cv.CV_8UC1)
        self.fg_host = np.empty((rows_small,cols_small),np.uint8)
        
    def ProcessFrame(self,lr):
        self.frame_device.upload(self.frame,self.stream)
        cv.cuda.resize(self.frame_device,(cols_big,rows_big),self.frame_device_big,stream=self.stream)
        self.bgmog2.apply(self.frame_device_big,lr,self.stream,self.fg_device_big)
        cv.cuda.resize(self.fg_device_big,self.fg_device.size(),self.fg_device,stream=self.stream)
        self.fg_device.download(self.stream,self.fg_host)
        self.stream.waitForCompletion()  # imidiate wait
        if(self.store_res):
            self.res.append(np.copy(self.fg_host))
        
    def Frame(self):
        return self.frame
    
proc_frame_cuda2 = ProcFrameCuda2(rows_small,cols_small,rows_big,cols_big,check_res)
In [32]:
#export
gpu_time_2, n_frames = ProcVid1(proc_frame_cuda2,lr)
print(f'GPU 2 (replacing the default stream): {n_frames} frames, {gpu_time_2:.2f} ms/frame')
print(f'Incremental speedup: {gpu_time_1/gpu_time_2:.2f}')
print(f'Speedup over GPU baseline: {gpu_time_0/gpu_time_2:.2f}')
print(f'Speedup over CPU: {cpu_time_1/gpu_time_2:.2f}')
GPU 2 (replacing the default stream): 100 frames, 2.19 ms/frame
Incremental speedup: 0.93
Speedup over GPU baseline: 1.71
Speedup over CPU: 12.71
In [33]:
if check_res: CheckFg(gpu_res,proc_frame_cuda2.res)

Analysis

title

Observations: The calls to cudaDeviceSyncronize() have now been removed and the gaps between the device calls removed, however it looks like the calls to cudaDeviceSyncronize() have just been replaced by calls to cudaMemcpy2DAsync().

Hypothesis: What has actually happened is we have tried to use asynchronous copies to and from the device without first pinning the host memory. Therefore what is shown are three asynchronous kernel launches and a synchronous copy from the device to the host, which blocks the host thread until all the previous work on the device is complete:

cv.cuda.resize(frame_device,(cols_big,rows_big),frame_device_big,stream=stream) async kernel 1
bgmog2.apply(frame_device_big,lr,stream,fg_device_big) acync kernel 2
cv.cuda.resize(fg_device_big,fg_device.size(),fg_device,stream=stream) acync kernel 3
fg_device.download(stream,fg_host) synchronous copy

Action: Pin the host memory to address this issue.

Overlap host and device computation – attempt 1

In [34]:
#export
# host mem not implemented, manually pin memory
class PinnedMem(object):
    def __init__(self, size, dtype=np.uint8):
        self.array = np.empty(size,dtype)
        cv.cuda.registerPageLocked(self.array)
        self.pinned = True
    def __del__(self):
        cv.cuda.unregisterPageLocked(self.array)
        self.pinned = False
    def __repr__(self):
        return f'pinned = {self.pinned}'
In [35]:
#export
class ProcFrameCuda3:
    def __init__(self,rows_small,cols_small,rows_big,cols_big,store_res=False):
        self.rows_small, self.cols_small, self.rows_big, self.cols_big = rows_small,cols_small,rows_big,cols_big
        self.store_res = store_res
        self.res = []
        self.bgmog2 = cv.cuda.createBackgroundSubtractorMOG2()
        self.stream = cv.cuda_Stream()
        self.frame = PinnedMem((rows_small,cols_small,3))
        self.frame_device = cv.cuda_GpuMat(rows_small,cols_small,cv.CV_8UC3)
        self.frame_device_big = cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC3)
        self.fg_device_big = cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC1)
        self.fg_device = cv.cuda_GpuMat(np.shape(frame)[0],np.shape(frame)[1],cv.CV_8UC1)
        self.fg_host = PinnedMem((rows_small,cols_small))
        
    def ProcessFrame(self,lr):
        self.frame_device.upload(self.frame.array,self.stream)
        cv.cuda.resize(self.frame_device,(cols_big,rows_big),self.frame_device_big,stream=self.stream)
        self.bgmog2.apply(self.frame_device_big,lr,self.stream,self.fg_device_big)
        cv.cuda.resize(self.fg_device_big,self.fg_device.size(),self.fg_device,stream=self.stream)
        self.fg_device.download(self.stream,self.fg_host.array)
        self.stream.waitForCompletion() # imidiate wait
        if(self.store_res):
            self.res.append(np.copy(self.fg_host.array))
        
    def Frame(self):
        return self.frame.array
    
proc_frame_cuda3 = ProcFrameCuda3(rows_small,cols_small,rows_big,cols_big,check_res)
In [36]:
#export
gpu_time_3, n_frames = ProcVid1(proc_frame_cuda3,lr)
print(f'GPU 3 (overlap host and device - attempt 1): {n_frames} frames, {gpu_time_3:.2f} ms/frame')
print(f'Incremental speedup: {gpu_time_2/gpu_time_3:.2f}')
print(f'Speedup over GPU baseline: {gpu_time_0/gpu_time_3:.2f}')
print(f'Speedup over CPU: {cpu_time_1/gpu_time_3:.2f}')
GPU 3 (overlap host and device - attempt 1): 100 frames, 1.72 ms/frame
Incremental speedup: 1.27
Speedup over GPU baseline: 2.18
Speedup over CPU: 16.18
In [37]:
if check_res: CheckFg(gpu_res,proc_frame_cuda3.res)

Analysis

title

Observations: The output is now more intuative, that said all that we have done is replace the calls to cudaDeviceSyncronize() with calls to cudaStreamSyncronize().

Hypothesis: We are issuing asynchronous calls to the device and then imidiately waiting on the host for them to complete.

cv.cuda.resize(frame_device,(cols_big,rows_big),frame_device_big,stream=stream) async kernel 1
bgmog2.apply(frame_device_big,lr,stream,fg_device_big) async kernel 2
cv.cuda.resize(fg_device_big,fg_device.size(),fg_device,stream=stream) acync kernel 3
fg_device.download(stream,fg_host.array) async copy
stream.waitForCompletion() block until kernel 1-3 and copy have finished

What we really want to do is overlap host and device computation by issuing asynchronous calls to the device and then performing processing on the host, before waiting for the asynchronous device calls to return. For two frames this would be:

frame_device.upload(frame[0].array,stream) async copy HtoD, frame 0
cv.cuda.resize(frame_device,(n_cols_big,n_rows_big),frame_device_big,stream=stream) async kernel 1, frame 0
bgmog2.apply(frame_device_big,lr,stream,fg_device_big) async kernel 2, frame 0
cv.cuda.resize(fg_device_big,fg_device.size(),fg_device,stream=stream) acync kernel 3, frame 0
fg_device.download(stream,fg_host.array) async copy DtoH, frame 0
ret,_ = cap.read(frame[1].array) host read frame 1
stream.waitForCompletion() block until kernel 1-3 and copy have finished for frame 0

Next: Move the position of the syncronization point to after a new frame has been read as described above. To do this We also need to increase the number of host frame containers to two because moving the sync point means frame 0 may still be in the process of being uploaded when we decode frame 1. That is, when we call

ret,_ = cap.read(frame[1].array) we have not synced, and we have no way to know if the previous call to frame_device.upload(frame[0].array,stream) has finished, hence we need to write to frame[1].array

Overlap host and device computation – attempt 2

In [38]:
#export
def ProcVid2(proc_frame,lr,simulate=False):
    cap = cv.VideoCapture(vidPath)
    if (cap.isOpened()== False): 
        print("Error opening video stream or file")
        return
    n_frames = 0
    start = time.time()    
    while(cap.isOpened()):
        ret,_ = cap.read(proc_frame.Frame())
        if ret == True:
            n_frames += 1
            if not simulate:
                proc_frame.ProcessFrame(lr)
        else:
            break
    proc_frame.Sync()
    end = time.time()    
    cap.release()
    return (end - start)*1000/n_frames, n_frames;
In [39]:
#export
class ProcFrameCuda4:
    def __init__(self,rows_small,cols_small,rows_big,cols_big,store_res=False):
        self.rows_small, self.cols_small, self.rows_big, self.cols_big = rows_small,cols_small,rows_big,cols_big
        self.store_res = store_res
        self.res = []
        self.bgmog2 = cv.cuda.createBackgroundSubtractorMOG2()
        self.stream = cv.cuda_Stream()
        self.frame_num = 0
        self.i_writable_mem = 0
        self.frames_in = [PinnedMem((rows_small,cols_small,3)),PinnedMem((rows_small,cols_small,3))]
        self.frame_device = cv.cuda_GpuMat(rows_small,cols_small,cv.CV_8UC3)
        self.frame_device_big = cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC3)
        self.fg_device_big = cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC1)
        self.fg_device = cv.cuda_GpuMat(rows_small,cols_small,cv.CV_8UC1)
        self.fg_host = PinnedMem((rows_small,cols_small))
        
    def ProcessFrame(self,lr):
        self.frame_num += 1
        if(self.frame_num > 1):
            self.stream.waitForCompletion() # wait after we have read the next frame
            if(self.store_res):
                self.res.append(np.copy(self.fg_host.array))
        self.frame_device.upload(self.frames_in[self.i_writable_mem].array, self.stream)
        cv.cuda.resize(self.frame_device, (cols_big,rows_big), self.frame_device_big, stream=self.stream)
        self.bgmog2.apply(self.frame_device_big, lr, self.stream, self.fg_device_big )
        cv.cuda.resize(self.fg_device_big, self.fg_device.size(), self.fg_device, stream=self.stream)
        self.fg_device.download(self.stream,self.fg_host.array)
        
    def Frame(self):
        self.i_writable_mem = (self.i_writable_mem + 1) % len(self.frames_in)
        return self.frames_in[self.i_writable_mem].array
    
    def Sync(self):
        self.stream.waitForCompletion()
        if(self.store_res):
            self.res.append(np.copy(self.fg_host.array))
    
proc_frame_cuda4 = ProcFrameCuda4(rows_small,cols_small,rows_big,cols_big,check_res)
In [42]:
#export
gpu_time_4, n_frames = ProcVid2(proc_frame_cuda4,lr)
print(f'GPU 4 (overlap host and device - attempt 2): {n_frames} frames, {gpu_time_4:.2f} ms/frame')
print(f'Incremental speedup: {gpu_time_3/gpu_time_4:.2f}')
print(f'Speedup over GPU baseline: {gpu_time_0/gpu_time_4:.2f}')
print(f'Speedup over CPU: {cpu_time_1/gpu_time_4:.2f}')
GPU 4 (overlap host and device - attempt 2): 100 frames, 1.72 ms/frame
Incremental speedup: 1.00
Speedup over GPU baseline: 2.18
Speedup over CPU: 16.18
In [43]:
if check_res: CheckFg(gpu_res,proc_frame_cuda4.res)

Analysis

title

Observations: Changing the synchronization point seems to have done just that moved the wait to before the frame is processed instead of after as we had before. If we examine the profiler output the runtime api calls still line up perfectly with the device calls in Stream 1616, implying that we are not seeing any host/device processing overlap.

Interestingly CudaStreamSynchronize() (stream.waitForCompletion()) appears to happen directly before processing each frame on the device and not sometime after as we would expect from moving it to after host frame has been read on the host, so whats going on?

Hypothesis: This is most likely to be because we are working on Windows where the GPU is a Windows Display Driver Model device. See below for more details.

CUDA driver has a software queue for WDDM devices to reduce the average overhead of submitting command buffers to the WDDM KMD driver

This would cause all the device calls from the pervious frame to be qued and then issued when we call stream.waitForCompletion() and could explain the profiler output.

Next: Test the hypothesis by forcing the CUDA driver to dispatch all qued calls by issueing a call to stream.queryIfComplete() as shown below.

frame_device.upload(frames_in[0].array, stream) async copy HtoD, frame 0
cv.cuda.resize(frame_device,(n_cols_big,n_rows_big),frame_device_big,stream=stream) async kernel 1, frame 0
bgmog2.apply(frame_device_big, lr, stream, fg_device_big ) async kernel 2, frame 0
cv.cuda.resize(fg_device_big,fg_device.size(),fg_device,stream=stream) acync kernel 3, frame 0
fg_device.download(stream,fg_host.array) async copy DtoH, frame 0
stream.queryIfComplete() force WDDM to dispatch any qued device calls
ret,_ = cap.read(frame[1].array) host read frame 1
stream.waitForCompletion() block until kernel 1-3 and copy have finished for frame 0

Overlap host and device computation – attempt 3

In [44]:
#export
class ProcFrameCuda5:
    def __init__(self,rows_small,cols_small,rows_big,cols_big,store_res=False):
        self.rows_small, self.cols_small, self.rows_big, self.cols_big = rows_small,cols_small,rows_big,cols_big
        self.store_res = store_res
        self.res = []
        self.bgmog2 = cv.cuda.createBackgroundSubtractorMOG2()
        self.stream = cv.cuda_Stream()
        self.frame_num = 0
        self.i_writable_mem = 0
        self.frames_in = [PinnedMem((rows_small,cols_small,3)),PinnedMem((rows_small,cols_small,3))]
        self.frame_device = cv.cuda_GpuMat(rows_small,cols_small,cv.CV_8UC3)
        self.frame_device_big = cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC3)
        self.fg_device_big = cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC1)
        self.fg_device = cv.cuda_GpuMat(rows_small,cols_small,cv.CV_8UC1)
        self.fg_host = PinnedMem((rows_small,cols_small))
        
    def ProcessFrame(self,lr):
        self.frame_num += 1
        if(self.frame_num > 1):
            self.stream.waitForCompletion() # wait after we have read the next frame
            if(self.store_res):
                self.res.append(np.copy(self.fg_host.array))
        self.frame_device.upload(self.frames_in[self.i_writable_mem].array, self.stream)
        cv.cuda.resize(self.frame_device, (cols_big,rows_big), self.frame_device_big, stream=self.stream)
        self.bgmog2.apply(self.frame_device_big, lr, self.stream, self.fg_device_big )
        cv.cuda.resize(self.fg_device_big, self.fg_device.size(), self.fg_device, stream=self.stream)
        self.fg_device.download(self.stream,self.fg_host.array)
        self.stream.queryIfComplete() # kick WDDM
        
    def Frame(self):
        self.i_writable_mem = (self.i_writable_mem + 1) % len(self.frames_in)
        return self.frames_in[self.i_writable_mem].array
    
    def Sync(self):
        self.stream.waitForCompletion()
        if(self.store_res):
            self.res.append(np.copy(self.fg_host.array))
    
proc_frame_cuda5 = ProcFrameCuda5(rows_small,cols_small,rows_big,cols_big,check_res)
In [47]:
#export
gpu_time_5, n_frames = ProcVid2(proc_frame_cuda5,lr)
print(f'GPU 5 (overlap host and device - attempt 3): {n_frames} frames, {gpu_time_5:.2f} ms/frame')
print(f'Incremental speedup: {gpu_time_4/gpu_time_5:.2f}')
print(f'Speedup over GPU baseline: {gpu_time_0/gpu_time_5:.2f}')
print(f'Speedup over CPU: {cpu_time_1/gpu_time_5:.2f}')
GPU 5 (overlap host and device - attempt 3): 100 frames, 1.25 ms/frame
Incremental speedup: 1.37
Speedup over GPU baseline: 3.00
Speedup over CPU: 22.25
In [48]:
if check_res:  CheckFg(gpu_res,proc_frame_cuda5.res)

Analysis

title

Observations: It appears as though the WDDM driver was at fault, by including the extra call to stream.queryIfComplete() we have finally overlapped host and device processing. This can be observed in the profiler output where the gap between runtime api calls (host processing described in 4) overlaps work being performed on the device in Stream 2017. More importantly the device is almost saturated with only a small gap in between the device calls for each frame in Stream 2017. So what is causing this small gap.

Hypothesis: From the profiler output is appears that we are still waiting on the host for the device processing to finish stream.waitForCompletion(). Therefore we need a way to issue more work to the device before waiting on the host. This can easily be achieved by using multiple device streams. As shown below with just two streams we can to issue commands to the device for frame 0 and 1 before synchronizing on frame 0 leaving even more time for the processing on the device to complete.

frame_device.upload(frame,stream) async copy HtoD, frame 0
cv.cuda.resize(frame_device,(n_cols_big,n_rows_big),frame_device_big,stream=stream) async kernel 1, frame 0
bgmog2_device.apply(frame_device_big,lr,fg_device_big,stream) async kernel 2, frame 0
cv.cuda.resize(fg_device_big,fg_device.size(),fg_device,stream=stream) acync kernel 3, frame 0
fg_device.download(fg_small,stream) async copy DtoH, frame 0
ret,_ = cap.read(frame) host read frame 1
stream.waitForCompletion() block until kernel 1-3 and copy have finished for frame 0

Next: Use multiple streams.

Overlap host and device computation – multiple streams

In [49]:
#export
class SyncType():
    none = 1
    soft = 2
    hard = 3
    
class ProcFrameCuda6:
    def __init__(self,rows_small,cols_small,rows_big,cols_big,n_streams,store_res=False,sync=SyncType.soft,device_timer=False):
        self.rows_small, self.cols_small, self.rows_big, self.cols_big = rows_small,cols_small,rows_big,cols_big
        self.n_streams = n_streams
        self.store_res = store_res        
        self.sync = sync
        self.bgmog2 = cv.cuda.createBackgroundSubtractorMOG2()
        self.frames_device = []
        self.frames_device_big = []
        self.fgs_device_big = []
        self.fgs_device = []
        self.fgs_small = []   
        self.streams = []
        self.frames = []
        self.InitMem()
        self.InitStreams()
        self.res = []
        self.i_stream = 0        
        self.n_frames = 0
        self.device_timer = device_timer
        if self.device_timer:
            self.events_start = []
            self.events_stop = []
            self.InitEvents()
            self.device_time = 0
        
    def InitMem(self):
        for i in range(0,self.n_streams):
            self.frames.append(PinnedMem((rows_small,cols_small,3))) 
            self.frames_device.append(cv.cuda_GpuMat(rows_small,cols_small,cv.CV_8UC3))
            self.frames_device_big.append(cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC3))
            self.fgs_device_big.append(cv.cuda_GpuMat(rows_big,cols_big,cv.CV_8UC1))
            self.fgs_device.append(cv.cuda_GpuMat(rows_small,cols_small,cv.CV_8UC1))
            self.fgs_small.append(PinnedMem((rows_small,cols_small)))
            
    def InitStreams(self):
        for i in range(0,self.n_streams): 
            if self.sync == SyncType.hard:
                self.streams.append(cv.cuda.Stream_Null())
            elif self.sync == SyncType.soft:
                self.streams.append(cv.cuda_Stream())
                
    def InitEvents(self):
        for i in range(0,self.n_streams):
            self.events_start.append(cv.cuda_Event())
            self.events_stop.append(cv.cuda_Event()) 
            
    def IncStream(self):
        self.i_stream = (self.i_stream+1)%self.n_streams
        
    def ProcessFrame(self,lr):
        self.n_frames += 1
        i = self.i_stream
        self.IncStream()
        stream = self.streams[i]
        if(self.n_frames > self.n_streams and self.sync != SyncType.none):            
            stream.waitForCompletion() # wait once both streams are used               
            #self.events_stop[i].waitForCompletion()
            if self.device_timer:  self.device_time += cv.cuda_Event.elapsedTime(self.events_start[i],self.events_stop[i])
            #print(f'Dev Time: {self.deviceTime}')
            if(self.store_res):
                self.res.append(np.copy(self.fgs_small[i].array))
        if self.device_timer: self.events_start[i].record(stream)
        self.frames_device[i].upload(self.frames[i].array,stream)
        cv.cuda.resize(self.frames_device[i], (cols_big,rows_big), self.frames_device_big[i], stream=stream)
        self.bgmog2.apply(self.frames_device_big[i], lr, stream, self.fgs_device_big[i])
        cv.cuda.resize(self.fgs_device_big[i], self.fgs_device[i].size(), self.fgs_device[i], stream=stream)
        self.fgs_device[i].download(stream, self.fgs_small[i].array)
        if self.device_timer: self.events_stop[i].record(stream)
        stream.queryIfComplete() # kick WDDM       
        
    def Frame(self):
        return self.frames[self.i_stream].array
    
    def Sync(self):
        # sync on last frames
        if (self.sync == SyncType.none):
            return
        
        for i in range(0,self.n_streams):
            if(not self.streams[self.i_stream].queryIfComplete()):
                self.streams[self.i_stream].waitForCompletion()
            if(self.store_res):
                self.res.append(np.copy(self.fgs_small[self.i_stream].array))
            self.IncStream()        
            
    def FrameTimeMs(self):
        if self.device_timer:
            return self.device_time/self.n_frames
        else:
            return 0
            
proc_frame_cuda6 = ProcFrameCuda6(rows_small,cols_small,rows_big,cols_big,2,check_res,SyncType.soft)
In [50]:
#export
gpu_time_6, n_frames = ProcVid2(proc_frame_cuda6,lr)
print(f'GPU 6 (multiple streams): {n_frames} frames, {gpu_time_6:.2f} ms/frame')
print(f'Incremental speedup: {gpu_time_5/gpu_time_6:.2f}')
print(f'Speedup over GPU baseline: {gpu_time_0/gpu_time_6:.2f}')
print(f'Speedup over CPU: {cpu_time_1/gpu_time_6:.2f}')
GPU 6 (multiple streams): 100 frames, 1.25 ms/frame
Incremental speedup: 1.00
Speedup over GPU baseline: 3.00
Speedup over CPU: 22.25
In [51]:
if check_res: CheckFg(gpu_res,proc_frame_cuda6.res)
Analysis

title

Observations: The device is now completely saturated with memory operations overlaped with kernel executions. Additionally the video decoding on the host is now run in parallel with the device code. This is probably the best we can do for this toy execution pipeline.

Hypothesis: Now that the host/device and kernel/memory operations are overlapped the average time to process each frame should be less than the average time required to process each frame on the device.

Next:

  1. Time the execution on the device using device timers to get the average time required to process each frame on the device. Unfortunately this introduces some overhead so we will have to compare this to the average time required to process each frame calculated without the device timers. This may mean that we may not see the difference that we expect.
  2. Calculate the theoretical average time to process each frame on the host and then the device without overlap, to see what we have gained from host/device and kernel/memory overlap.
  3. Calculate the average wasted time on the host (time where the host could be perfoming useful operations without increasing the average processing time).

Analysis without the profiler

In [52]:
#export
proc_frame_cuda7 = ProcFrameCuda6(rows_small,cols_small,rows_big,cols_big,2,check_res,SyncType.soft,True)
ProcVid2(proc_frame_cuda7,lr)
print(f'Mean times calculated over {n_frames} frames:')
print(f'Time to process each frame on the device: {proc_frame_cuda7.FrameTimeMs():.2f} ms/frame')
print(f'Time to process each frame (host/device): {gpu_time_6:.2f} ms/frame')
print(f'-> Gain from memcpy/kernel overlap if device is saturated: {proc_frame_cuda7.FrameTimeMs()-gpu_time_6:.2f} ms/frame')
hostTime, n_frames = ProcVid2(proc_frame_cuda6, lr, True)
print(f'Time to read and decode each frame on the host: {hostTime:.2f} ms/frame')
print(f'-> Total processing time host + device: {proc_frame_cuda7.FrameTimeMs()+hostTime:.2f} ms/frame')
print(f'-> Gain from host/device overlap: {proc_frame_cuda7.FrameTimeMs()+hostTime - gpu_time_6:.2f} ms/frame')
print(f'-> Currently waisted time on host: {gpu_time_6-hostTime:.2f} ms/frame')
Mean times calculated over 100 frames:
Time to process each frame on the device: 0.97 ms/frame
Time to process each frame (host/device): 1.25 ms/frame
-> Gain from memcpy/kernel overlap if device is saturated: -0.28 ms/frame
Time to read and decode each frame on the host: 0.62 ms/frame
-> Total processing time host + device: 1.59 ms/frame
-> Gain from host/device overlap: 0.34 ms/frame
-> Currently waisted time on host: 0.63 ms/frame

Summary

When calling OpenCV CUDA functions the most effective optimizations (in order of effectiveness/ease to implement) for this toy problem are given below. Whilst (1) will always be effective, the other optimizations will heavily depend on the CPU/GPU specifications, data size and the amount of processing which can be perfomed on the device before returing to the host. Therefore it is always beneficial to use a tool such as the Nvidia visual profiler to analyze your pipeline as you make changes.

  1. Pre-allocate and pass all GpuMat arrays (making sure they are the correct size) as function arguments to avoid them being allocated each time the function is called.
  2. Try to design a processing pipeline which allows memory copies to overlap kernel calls and work to be performed on both the host and the device at the same time.
  3. Use CUDA streams with pinned host memory and if you are working on windows consider calling stream.queryIfComplete() to force the WDDM driver to dispatch the CUDA calls.
  4. Use multiple streams.

Run outside the notebook

In [58]:
# taken from https://github.com/fastai/fastai_docs/blob/master/dev_nb/notebook2script.py
!python notebook2script.py "opencv4-cuda-streams.ipynb"
Converted opencv4-cuda-streams.ipynb to exp\nb_opencv4-cuda-streams.py
In [61]:
! python exp/nb_opencv4-cuda-streams.py
CPU 0 (naive): 100 frames, 29.84 ms/frame
GPU 0 (naive): 100 frames, 8.90 ms/frame
Speedup over CPU: 3.35
CPU 1 (pre-allocation): 100 frames, 27.49 ms/frame
Speedup over CPU baseline: 1.09
GPU 1 (pre-allocation): 100 frames, 2.03 ms/frame
Incremental speedup: 4.38
Speedup over CPU: 13.54
GPU 2 (replacing the default stream): 100 frames, 1.87 ms/frame
Incremental speedup: 1.08
Speedup over GPU baseline: 4.75
Speedup over CPU: 14.66
GPU 3 (overlap host and device - attempt 1): 100 frames, 1.72 ms/frame
Incremental speedup: 1.09
Speedup over GPU baseline: 5.18
Speedup over CPU: 16.00
GPU 4 (overlap host and device - attempt 2): 100 frames, 1.72 ms/frame
Incremental speedup: 1.00
Speedup over GPU baseline: 5.18
Speedup over CPU: 16.00
GPU 5 (overlap host and device - attempt 3): 100 frames, 1.09 ms/frame
Incremental speedup: 1.57
Speedup over GPU baseline: 8.14
Speedup over CPU: 25.14
GPU 6 (multiple streams): 100 frames, 0.94 ms/frame
Incremental speedup: 1.17
Speedup over GPU baseline: 9.50
Speedup over CPU: 29.33
Mean times calculated over 100 frames:
Time to process each frame on the device: 0.92 ms/frame
Time to process each frame (host/device): 0.94 ms/frame
-> Gain from memcpy/kernel overlap if device is saturated: -0.02 ms/frame
Time to read and decode each frame on the host: 0.62 ms/frame
-> Total processing time host + device: 1.55 ms/frame
-> Gain from host/device overlap: 0.61 ms/frame
-> Currently waisted time on host: 0.31 ms/frame
[ INFO:0] global E:\Dev\Repos\opencv_fork_1\modules\videoio\src\videoio_registry.cpp (187) cv::`anonymous-namespace'::VideoBackendRegistry::VideoBackendRegistry VIDEOIO: Enabled backends(7, sorted by priority): FFMPEG(1000); GSTREAMER(990); INTEL_MFX(980); MSMF(970); DSHOW(960); CV_IMAGES(950); CV_MJPEG(940)

Leave a Reply

Your email address will not be published. Required fields are marked *