Edge Inference Toolkit: Openvino vs TensorRT

That is no doubt that artificial intelligence has created real value in the digital industries such as media and advertising, finance and retail with great promises. While deep learning has been widely applied in all walks of life, the hard truth is that using multiple, gigantic datasets to train or infer them still takes a ton of processing power. Processing an image or other data with a deep neural implies billions of operations, huge matrices of many dimensions being multiplied, inversed, reshaped, and fourier transformed.

Clearly, for real-time applications such as facial recognition or the detection of defective products in a production line. It is important that the result is generated as quickly as possible, and without the need for an expensive and power hungry CPU or GPU. So I try to find the best way to solve these problem on edge device.

In this paper, I will introduce Openvino and TensorRT for you, which are the deep learning inferencd engines on CPU or GPU in lower cost edge device. But firstly, you need to train your model by other deep learning platform such as Tensorflow or Pytorch. As for me, I obtain the original model trained by Pytorch in my local host with Nvidia 1080Ti and export it to ONNX format model for converting easily.

Test Environment:

Host PC:
- CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
- GPU: Nvidia 1080Ti
- Pytorch 1.5.0 with CUDA 10.1 and cudnn7.6.2
- onnx 1.7.0
Openvino edge device:
- CPU: Intel Celeron J4105 Processor @ 1.50GHz
- GPU: Intel® UHD Graphics 600
- Openvino version 2020.3
- OpenCL 1.2 NEO
TensorRT edge device:
- Jetson Nano
- CPU: Quad-core ARM® Cortex®-A57 MPCore processor
- GPU: NVIDIA Maxwell™ architecture with 128 NVIDIA CUDA® cores 0.5 TFLOPs (FP16)
- JetPack 4.3 with CUDA 10.2 and TensorRT 7.1.0.16

Pytorch To ONNX:

Pytorch has a easy way to export model to onnx.

import torch

model_file = './DFA_model_latest.pkl'
model = torch.load(model_file)

# set the model to inference mode
model.eval()

# device depend on your train data type, such as cpu or cuda.
device = torch.device('cuda')

# Input to the model, eg: [None, 1, 288, 512]
x = torch.randn(batch_size, 1, 288, 512, requires_grad=True)

# Set input list and ouput list.
input_names = ['input']
output_names =['output']

# Export the model
torch.onnx.export(model,
                  x,
                  'DFA_model_latest.onnx',
                  export_params=True,
                  opset_version=10, # default is 9.
                  do_constant_folding=True,
                  input_names=input_names,
                  output_names=output_names)

Note: There are some pytorch operators which are not supported opset_version 9 and 10, and the lastest version is 11.(bilinear need opset_version 11. But for the openvino and tensorrt, the opset_version is not supported perfectly. You can try to export the model to 9, 10 and 11, and convert it all.)

OpenVino:

OpenVINO™ toolkit quickly deploys applications and solutions that emulate human vision. Based on Convolutional Neural Networks (CNNs), the toolkit extends computer vision (CV) workloads across Intel® hardware, maximizing performance. The OpenVINO™ toolkit includes the Deep Learning Deployment Toolkit (DLDT).

OpenVINO™ toolkit:

Enables CNN-based deep learning inference on the edge
Supports heterogeneous execution across an Intel® CPU, Intel® Integrated Graphics, Intel® FPGA, Intel® Movidius™ Neural Compute Stick, Intel® Neural Compute Stick 2 and Intel® Vision Accelerator Design with Intel® Movidius™ VPUs
Speeds time-to-market via an easy-to-use library of computer vision functions and pre-optimized kernels
Includes optimized calls for computer vision standards, including OpenCV* and OpenCL™

1. Install:

# 1. Download the openvino toolkit tar file from the following link.
https://software.intel.com/en-us/openvino-toolkit/choose-download
# 2. Unpack the .tgz file.
tar -zvxf l_openvino_toolkit_p_<version>.tgz
# 3. Go to the l_openvino_toolkit_p_<version> directory:
cd ./l_openvino_toolkit_p_<version>
# For me, I try this in docker, so I install the toolkit by silent.
# 4. change the config file and install by script file.
sed -i 's/decline/accept/g' silent.cfg
./install.sh -s silent.cfg

# 5. Install External Software Dependencies
cd /opt/intel/openvino/install_dependencies
sudo -E ./install_openvino_dependencies.sh

# 6. Set the Environment Variables. Append to .bashrc file.
source /opt/intel/openvino/bin/setupvars.sh

# 7. Configure the Model Optimizer
cd /opt/intel/openvino/deployment_tools/model_optimizer/install_prerequisites
sudo ./install_prerequisites.sh

2. Enable Intel GPU device in openvino:

Note: If you wanna run Intel GPU in docker, please run the docker image with --device /dev/dri to make Intel GPU available in the container. Verify your intel gpu and opencl version by clinfo which can be installed with apt sudo apt install clinfo. There are two ways to install opencl for intel gpu, one is beignet, other is intel neo. For some old cpus, you only can install beignet-dev > 1.3 to enable opencl for them.

NEO:

# 1. Go to the install_dependencies directory:
cd /opt/intel/openvino/install_dependencies/
# 2. Install the Intel® Graphics Compute Runtime for OpenCL™ driver components required to use the GPU plugin and write custom layers for Intel® Integrated Graphics:
sudo -E ./install_NEO_OCL_driver.sh
# 3. Verify opencl infomation:
clinfo

Beignet:

####################
#   Ubuntu 16.04   #
####################
# For Ubuntu16.04, the default beignet package version is 1.1.2,
# it is not support for some old cpus, so you need to install beignet > 1.3 as following:
sudo add-apt-repository ppa:ikuya-fruitsbasket/beignet
sudo apt-get update
sudo apt install beignet
sudo apt install beignet-dev
# Verify opencl:
clinfo

####################
#   Ubuntu 18.04   #
####################
# For Ubuntu18.04, the default beignet package version is > 1.3.
sudo apt install beignet-dev
clinfo

3. Convert ONNX model to openvino model:

For me, I try to run onnx model in openvino, if you want to run Tensorflow or other model in openvino, you can see more detail in (https://docs.openvinotoolkit.org/latest/_docs_MO_DG_Deep_Learning_Model_Optimizer_DevGuide.html)

1
2
3

cd /opt/intel/openvino/deployment_tools/model_optimizer

python3 mo.py --input_model <PATH> --data_type {FP16,FP32,half,float} --output_dir <OUTPUT PATH>

4. Inference Demo:

from openvino.inference_engine import IECore, IENetwork


# 1. Load IECore object
OpenVinoIE = IECore()
print("Available Devices: ", OpenVinoIE.available_devices)

# 2. Load CPU Extensions, if necessary.
if 'CPU' in device:
    OpenVinoIE.add_extension('/opt/intel/openvino/inference_engine/lib/intel64/libcpu_extension.so', "CPU")

# 3. Load Network
net = OpenVinoIE.read_network(model='<openvino_model>.xml',
                weights='<openvino_model>.bin')

###########################
# The following is Optional
###########################

# Get Input Layer Infomation
inputLayer = next(iter(net.inputs))
print("Input Layer: ", inputLayer)

# Get Output Layer Information
outputLayer = next(iter(net.outputs))
print("Output Layer: ", outputLayer)

# Get Input Shape of Model
inputShape = net.inputs[inputLayer].shape
print("Input Shape: ", inputShape)

# Get Output Shape of Model
outputShape = net.outputs[outputLayer].shape
print("Output Shape: ", outputShape)

# 4. Load Executable Network
exec_net = OpenVinoIE.load_network(network=net, device_name='GPU or CPU')

# 5. Infer network
results = exec_net.infer(inputs={inputLayer: input})

5. Infer Speed test:

About 13~14 fps for infer with FP32 dtype model in Openvino edge device.

TensorRT:

NVIDIA TensorRT™ is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.

TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.
TensorRT is built on CUDA, NVIDIA’s parallel programming model, and enables you to optimize inference for all deep learning frameworks leveraging libraries, development tools and technologies in CUDA-X for artificial intelligence, autonomous machines, high-performance computing, and graphics.
TensorRT provides INT8 and FP16 optimizations for production deployments of deep learning inference applications such as video streaming, speech recognition, recommendation and natural language processing. Reduced precision inference significantly reduces application latency, which is a requirement for many real-time services, auto and embedded applications.
You can import trained models from every deep learning framework into TensorRT. After applying optimizations, TensorRT selects platform specific kernels to maximize performance on Tesla GPUs in the data center, Jetson embedded platforms, and NVIDIA DRIVE autonomous driving platforms.

1. Install:

Note: For me, I use Jetson Nano to run TensorRT, which already has JetPack 4.3 with TensorRT 7.1 in OS. And if you convert the onnx model on other device, make sure that the host pc TensorRT version is as same as the infer device. For more detail about installation, please visit (https://github.com/NVIDIA/TensorRT).

2. Convert onnx model to TensorRT model:

Note: There are two ways to convert the onnx model to TensorRT model,
one is the command-line programs trtexec which can find in your TensorRT install path, other is using C++ or python api to convert it.

Command-Line Programs, see more detail in (https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec):

1	./trtexec --onnx=model.onnx

Python API:

Note: the common file will show in the end.

def build_engine(onnx_file_path):
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(common.EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
        builder.max_workspace_size = 1 << 30 # 1 Gib.
        builder.max_batch_size = 1
        # FP16
        builder.fp16_mode = True
        #builder.strict_type_constraints = True
        with open(onnx_file_path, 'rb') as model:
            print('Beginning ONNX file parsing')
            if not parser.parse(model.read()):
                print('[ERROR]: Failed to parse the ONNX file.')
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return None
        network.get_input(0).shape = [1,4,288,512]
        engine = builder.build_cuda_engine(network)
        return engine

3. Inference Demo:

Note: common file see the following, and pycuda install by pip.

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

import common # See the following.

TRT_LOGGER = trt.Logger()

# The following depend on float16, FP16!!!
onnx_file_path = '../models/DFA_model_simple.onnx'
i_file = '../test/rs02_1564022347_05.jpg'
d_file = '../test/rs02_1564022347_05.pgm'

def build_engine(onnx_file_path):
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(common.EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
        builder.max_workspace_size = 1 << 30
        builder.max_batch_size = 1
        # FP16
        builder.fp16_mode = True
        #builder.strict_type_constraints = True
        with open(onnx_file_path, 'rb') as model:
            print('Beginning ONNX file parsing')
            if not parser.parse(model.read()):
                print('[ERROR]: Failed to parse the ONNX file.')
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return None
        network.get_input(0).shape = [1,4,288,512]
        engine = builder.build_cuda_engine(network)
        return engine

engine = build_engine(onnx_file_path)
context = engine.create_execution_context()

inputs, outputs, bindings, stream = common.allocate_buffers(engine)

from PIL import Image

i_raw = Image.open(i_file)
i_raw = i_raw.resize((512, 288)) # WHC
i_raw = np.asarray(i_raw, dtype=np.float32, order='C') # HWC
d_raw = Image.open(d_file)
d_raw = d_raw.resize((512, 288))
d_raw = np.asarray(d_raw, dtype=np.int16, order='C')

# The numpy's toTensor and normalize operation copy from Pytorch.
def toTensor(data):
    '''Change the data range to [0.0, 1.0]
    Return [B, C, H, W]
    '''
    assert data.ndim == 2 or data.ndim == 3
    if data.ndim == 2:
        # depth
        data = data[..., np.newaxis]
        mask = data > 10000
        data[mask] = 0
        data = data / 10000.
    elif data.ndim == 3:
        # image
        data = data / 255.
    data = np.transpose(data, (2, 0, 1))
    data = data[np.newaxis, ...]
    return data

def normalize(data, mean, std):
    assert data.ndim == 4
    assert data.shape[1] == len(mean) == len(std)
    batch_size = data.shape[0]
    mean, std = np.asarray(mean), np.asarray(std)
    mean = mean[..., np.newaxis, np.newaxis]
    std = std[..., np.newaxis, np.newaxis]
    for i in range(batch_size):
        data[i, ...] = (data[i, ...] - mean) / std
    return data


##############################
# Start Process Input data   #
##############################

data = toTensor(d_raw)
data = normalize(data, [0.1864497], [0.07711394])
img_data = toTensor(i_raw)
img_data = normalize(img_data, [0.368, 0.393, 0.404], [0.286, 0.290, 0.296])

# [1, 4, 288, 512] NCHW
test = np.concatenate((img_data, data), axis=1)
# FP16
test = test.astype(np.float16)
###################################
# Make sure the flags of the
# input data [C_CONTIGUOUS = True]
#
# If not, try np.ascontiguousarray
# test = np.ascontiguousarray(test)
###################################

# Inputs
inputs[0].host = test
trt_outputs = common.do_inference_v2(context,
                                     bindings=bindings,
                                     inputs=inputs,
                                     outputs=outputs,
                                     stream=stream)

common.py file:

from itertools import chain
import argparse
import os

import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

import tensorrt as trt

try:
    # Sometimes python2 does not understand FileNotFoundError
    FileNotFoundError
except NameError:
    FileNotFoundError = IOError

EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

def GiB(val):
    return val * 1 << 30


def add_help(description):
    parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    args, _ = parser.parse_known_args()


def find_sample_data(description="Runs a TensorRT Python sample", subfolder="", find_files=[]):
    '''
    Parses sample arguments.

    Args:
        description (str): Description of the sample.
        subfolder (str): The subfolder containing data relevant to this sample
        find_files (str): A list of filenames to find. Each filename will be replaced with an absolute path.

    Returns:
        str: Path of data directory.
    '''

    # Standard command-line arguments for all samples.
    kDEFAULT_DATA_ROOT = os.path.join(os.sep, "usr", "src", "tensorrt", "data")
    parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument("-d", "--datadir", help="Location of the TensorRT sample data directory, and any additional data directories.", action="append", default=[kDEFAULT_DATA_ROOT])
    args, _ = parser.parse_known_args()

    def get_data_path(data_dir):
        # If the subfolder exists, append it to the path, otherwise use the provided path as-is.
        data_path = os.path.join(data_dir, subfolder)
        if not os.path.exists(data_path):
            print("WARNING: " + data_path + " does not exist. Trying " + data_dir + " instead.")
            data_path = data_dir
        # Make sure data directory exists.
        if not (os.path.exists(data_path)):
            print("WARNING: {:} does not exist. Please provide the correct data path with the -d option.".format(data_path))
        return data_path

    data_paths = [get_data_path(data_dir) for data_dir in args.datadir]
    return data_paths, locate_files(data_paths, find_files)

def locate_files(data_paths, filenames):
    """
    Locates the specified files in the specified data directories.
    If a file exists in multiple data directories, the first directory is used.

    Args:
        data_paths (List[str]): The data directories.
        filename (List[str]): The names of the files to find.

    Returns:
        List[str]: The absolute paths of the files.

    Raises:
        FileNotFoundError if a file could not be located.
    """
    found_files = [None] * len(filenames)
    for data_path in data_paths:
        # Find all requested files.
        for index, (found, filename) in enumerate(zip(found_files, filenames)):
            if not found:
                file_path = os.path.abspath(os.path.join(data_path, filename))
                if os.path.exists(file_path):
                    found_files[index] = file_path

    # Check that all files were found
    for f, filename in zip(found_files, filenames):
        if not f or not os.path.exists(f):
            raise FileNotFoundError("Could not find {:}. Searched in data paths: {:}".format(filename, data_paths))
    return found_files

# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream

# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

# This function is generalized for multiple inputs/outputs for full dimension networks.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference_v2(context, bindings, inputs, outputs, stream):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

4. Infer Speed test:

Jetson Nano: {‘Inference FPS’: {‘FP32’: ~11.52, ‘FP16’: ~14.25}}