Skip to content

Latest commit

 

History

History
141 lines (81 loc) · 7.21 KB

File metadata and controls

141 lines (81 loc) · 7.21 KB

Optimization for Video Processing

Paper : https://dl.acm.org/citation.cfm?id=3339326

Optimization is the process of the action of making the best or most effective use of a situation or resource. In this scenario optimization is used to solve the issue of the high time requirement of the algorithm, software and hardware optimization was performed on the program to a point where it was able to in real-time on the hardware board used. Here, we discuss the entire procedure of the process implemented.

Software Optimization

The best way to run a process faster is to divide the main process into multiple threads that can run in parallel with respect to each other. This way we can make maximum utilization of the CPU, thereby decreasing the processing time. It’s also important to take into consideration the number of threads that can be handled efficiently by the operating system. Two threads were found to be the optimum number of threads that could produce the transformation with the least processing time. With more than two threads the communication time between the threads slows the process down. The memory model architecture is depicted in the below figure. The above memory model was developed for video processing application. First, the individual frames of the video are pushed into the input buffer one at a time. The first thread takes the first frame followed by the second thread that takes the next thread and the same pattern follows till the end of the video. In this manner the task is divided among the threads which are executed concurrently. After the threads are done with their job, the processed frames are stored in an output queue from where they are accessed to be shown on the output display.

Threading

Multi threading is a very efficient way of implementing parallelism into the code.Generally categorized into,

  • Thread
  • PThread
  • WinThread

After threading each thread can be run on different core to achieve 100% CPU utilization

In my code I have used following type of memory model where each thread independently process each frame reducing problems like deadlock.

thread memory model

Hardware Optimization

CPU Optimization

The most time-consuming process of the Retinex algorithm is the computation of the log functions. Also, the log functions need to be calculated for the R, G and B spectral components of the frames separately. In spite of running multiple threads together, computing the log functions for each frames’ RGB components is still a laborious task for the processor. To overcome this, the log values were calculated at the beginning of the program for the required range and stored in the memory as a lookup table. Instead of calling the log function, the pre-defined values are accessed directly. Accessing a memory happens faster than computing a set of equations, which therefore increases the performance speed of the algorithm. With this adjustment the speed of the algorithm doubled compared to when multi- threading was done with regular log computation.

GPU Optimization

GPUs are generally used for accelerating programs by providing high throughput. Thus, highly parallelizable tasks can be run on GPU efficiently. For e.g., Gaming, Artificial Neural Networks etc.

Heterogeneous Computing

Heterogeneous computing refers to systems that use more than one kind of processor or cores. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks. Using GPUs alone to process a complex computational problem is one part of improving performance drastically. However, to get close to the theoretical peak performance of the system, the CPU has to be utilized in a useful way, too, instead of it having only act as a supervisor for the GPU. Combining the different paradigms and programming models of CPUs and GPUs to work together at the same problem must be the ultimate goal. When ever we deploy a process on Heterogenous or parallel manner we need to keep a thread or ideally main program to be master to control other slave device and thread. Master will be looking into ,

  1. Process execution status
  2. Availability of data
  3. Presence of redundant wait time

Consider the diagram given below this is one of the simple heterogeneous computation method.

GPU memory model

Implemented Video Processing Algorithms

  1. SSR
  2. MSR

note : please refer the research papers for explanation of algorithms

Results Obtained

The Above algorithms are implemented on various embedded hardware boards and the result obtained is as below

1. Single Scale Retinex

-without optimzation on UDOOx86 Ultra

Frames Duration FPS
500 126.10s 4
900 227.32s 4

-with optimzation on UDOOx86 Ultra

Frames Duration FPS
500 14.56s 34
900 26.46s 34

-without optimzation on Jetson Tk1

Frames Duration FPS
500 156.10s 3
900 283.32s 3

-with optimzation on Jetson Tk1

Frames Duration FPS
500 17.73s 28
900 32.02s 28

2. Multi Scale Retinex

-without optimzation on UDOOx86 Ultra

Frames Duration FPS
500 243.10s 2
900 464.32s 2

-with optimzation on UDOOx86 Ultra

Frames Duration FPS
500 18.02s 28
900 31.96s 28

-without optimzation on Jetson Tk1

Frames Duration FPS
500 256.10s 2
900 483.32s 2

-with optimzation on Jetson Tk1

Frames Duration FPS
500 20.23s 25
900 35.98s 25

Download

Download

Citation

Majumdar, Jharna & C, Adarsh & Singh, Harshpreet & C, Rahul. (2019). Real-time performance analysis of retinex algorithm on embedded boards for robotics application. ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research. 1-8. 10.1145/3339311.3339326.