Jeonghun (James) Lee: TensorRT 5 Python 기본구조 파악

8/17/2019

TensorRT 5 Python 기본구조 파악

1. TensorRT Python 분석

TensorRT C++ 기준으로 작성된 Inference 엔진이며, 최신 Version 부터 Python를 제공하고 있어 쉽게 TensorRT를 동작방식가능하고 이를 수정가능하다.
이 TensorRT Python을 이용하여 C++ 대신 손쉽게 TensorRT를 Control 하는 방법들을 알고자하여 테스트 진행한다
일단 C++ API는 골치아파서 넘어가자

1.1 TensorRT Python 관련사항정리

TensorRT 5.0 기본구조 및 동작 (C++/Python 참조)

TensorRT는 기본동작을 C++과 Python을 이용하여 설명
https://ahyuo79.blogspot.com/2019/06/tensorrt-50-jetson-agx-xavier.html

Jetpack 4.2.1 설치기준

  https://ahyuo79.blogspot.com/2019/07/jetpack-421.html

우선 기존에도 설명을 했지만, TensorRT는 각각의 Layer를 지원하며, 이에 맞추어 각각의 Model에 맞는 network를 지원을 해준다고 한다
그러므로, Model 안에 TensorRT의 Layer가 지원이 안되는 경우는 직접 Custom Layer로 구현을 해야한다 (이 부분 이전에도 설명)
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#extending
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/Plugin/pyPlugin.html

DeepStream 의 Yolo의 Custom Layer 부분참조 ( C++ 로 구현) 와 IPlugin 확인
  https://ahyuo79.blogspot.com/2019/08/ds-sdk-40-test4-iplugin-sample.html
  https://ahyuo79.blogspot.com/2019/08/deepstream-sdk-40-plugin-gstreamer.html

TensorRT Support Matrix
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html

TensorRT Layer
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/Graph/Layers.html

1.2 TensorRT 기본구조

TensorRT는 NVIDIA에서 제공하는 Inference 를 위한 Engine이며, 기본소스가 C++로 구성되며, TensorRT의 기본구조는 C++/Python API 동일하며, 두가지 모드로 나누어 생각해볼 수 있다.

Caffe/UFF/Onnx Parser 이용방법

기존의 Framwork과 호환성을 위해서 Training된 Model를 존재하다면, TensorRT의 UFF/Caffe/Onnx Parser를 이용하여 Model의 Network를 쉽게 구성하고
이를 기반으로 TensorRT Engine을 만들어 추론으로 바로 가능하다.
다만, TensorRT가 지원되지 않는 Layer는 Custom Layer로 구현하여 연결해야한다.

TensorRT에서 직접 Network 구성

TensorRT도 Network를 Layer를 추가하여 구성을 직접할수 있지만, 문제는 Training된 weight와 bias 값을 얻어올 수가 없으므로, 이 부분도 다른 Framework에서 가져와야한다

1.3 TensorRT Python 필요사항 (pyCuda 설치)

Sample의 TensorRT Python 실행을 위해서는 필수로 필요하기 때문에 아래사항을들을 설치하자

$ cd /usr/src/tensorrt/samples/python/introductory_parser_samples
$ cat requirements.txt   
numpy
Pillow
pycuda
$ export CUDA_INC_DIR=/usr/local/cuda-10.0/include

$ sudo python2 -m pip install -r requirements.txt   //pycuda 설치 중 cuda 문제로 에러발생 (#include cuda .h)
or  
$ sudo python3 -m pip install -r requirements.txt  //pycuda 설치 중 cuda 문제로 에러발생 (#include cuda .h)

$ pip list or pip3 list  // 상위 requrement.txt package 확인 
...

TensorRT를 C++로 작성하면 pycuda는 필요가 없을 텐데, TensorRT를 python으로하니, pycuda문제가 발생하며, 이는 필수로 설치를 해야한다

pycuda 직접설치 (필수 설치)

pycuda는 필수로 필요하므로, 상위에서 매번에러가 발생하여, Package대신 아래와 같이 직접 다운받아 설치로 결정

//pycuda 직접설치로 결정 (pip list 검색불가)  아래 사이트참조 
$ cd ~ 
//https://pypi.org/project/pycuda/#files
$ wget https://files.pythonhosted.org/packages/5e/3f/5658c38579b41866ba21ee1b5020b8225cec86fe717e4b1c5c972de0a33c/pycuda-2019.1.2.tar.gz
$ tar zxvf pycuda-2019.1.2.tar.gz 
$ cd pycuda-2019.1.2/

$ python configure.py --cuda-root=/usr/local/cuda-10.0   //python2 로 진행 
or 
$ python3 configure.py --cuda-root=/usr/local/cuda-10.0
//pycuda 설치 
$ sudo make install
........
Using /usr/local/lib/python2.7/dist-packages
Finished processing dependencies for pycuda==2019.1.2
//설치완료

pycuda 설치방법
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html#installing-pycuda
  https://devtalk.nvidia.com/default/topic/1013387/jetson-tx2/is-the-memory-management-method-of-tx1-and-tx2-different-/post/5167500/#5167500
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/gettingStarted.html

2. TensorRT Caffe/UFF/Onnx Parser 실행

아래와 같이 Parser Sample로 찾아가서 Python으로 실행해보면,  /usr/src/tensorrt/data/resnet50 의 각각의 caffe/uff/onnx format의 모델을 사용하여 동일하게 추론을 한다
그리고 아래의위치 /usr/src/tensorrt/data/resnet50/*.jpg Test Sample 그림을 인식하는 것이다

$ cd /usr/src/tensorrt/samples/python/introductory_parser_samples

$ find / -name ResNet50_fp32.caffemodel 2> /dev/null 
/usr/src/tensorrt/data/resnet50/ResNet50_fp32.caffemodel

$ python caffe_resnet50.py    // 기본설정, /usr/src/tensorrt/data
Correctly recognized /usr/src/tensorrt/data/resnet50/reflex_camera.jpeg as reflex camera

$ python uff_resnet50.py 
Correctly recognized /usr/src/tensorrt/data/resnet50/binoculars.jpeg as binoculars

$ python onnx_resnet50.py 
Correctly recognized /usr/src/tensorrt/data/resnet50/binoculars.jpeg as binoculars

각각의 caffe/uff/onnx의 reset50 동작확인이 가능하며 기본적인 구조들이 비슷하기 때문에, 아래의 UFF기준으로만 분석한다.( Tensorflow)

Resnet50의 정보
https://datascienceschool.net/view-notebook/958022040c544257aa7ba88643d6c032/

2.1 uff_resnet50.py 소스분석

TensorRT는 Framework와 호환성을 위해서 3개의 Parser를 지원을 해주고 있으며, 이 Parser를 이용하여 동작되는 구조는 3개가 거의 유사하므로,
현재 3개중 UFF만을 선택해서 소스분석를 세부분석 해보고, 돌아가는 원리를 파악해보자.

다음 소스구조는 resenet50-infer-5.uff model을 UFF Parser 이용하여 Engine생성 후 실제 적용하여추론하는 소스이다.

$ cat uff_resnet50.py 
# This sample uses a UFF ResNet50 Model to create a TensorRT Inference Engine
import random
from PIL import Image
import numpy as np

import pycuda.driver as cuda
# This import causes pycuda to automatically manage CUDA context creation and cleanup.
import pycuda.autoinit

import tensorrt as trt

import sys, os
sys.path.insert(1, os.path.join(sys.path[0], ".."))

## common.py 부분으로 생략
import common

## class를 이용하여 각 설정값을 정의 
class ModelData(object):
    MODEL_PATH = "resnet50-infer-5.uff"
    INPUT_NAME = "input"
    INPUT_SHAPE = (3, 224, 224)
    OUTPUT_NAME = "GPU_0/tower_0/Softmax"
    # We can convert TensorRT data types to numpy types with trt.nptype()
    DTYPE = trt.float32

# You can set the logger severity higher to suppress messages (or lower to display more messages).
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

## host(CPU) 와 device(GPU)  buffer를 분리해서 할당하며, 할당방식도 다르다 
## https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#perform_inference_python
## 상위 링크에서 아래의 기능확인가능 
## Host(CPU) 와 Device(GPU) 의 Buffer를 설정하고, Stream의 생성 
# Allocate host and device buffers, and create a stream.
def allocate_buffers(engine):
    # Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(ModelData.DTYPE))
    h_output = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(ModelData.DTYPE))
    # Allocate device memory for inputs and outputs.
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)
    # Create a stream in which to copy inputs/outputs and run inference.
    stream = cuda.Stream()
    return h_input, d_input, h_output, d_output, stream

## host(CPU) 와 device(GPU)  buffer  관리와 추론 진행 
def do_inference(context, h_input, d_input, h_output, d_output, stream):
    ## CPU->GPU로 전송
    # Transfer input data to the GPU.
    cuda.memcpy_htod_async(d_input, h_input, stream)
    ## GPU 전송후 inference 실행 
    # Run inference.
    context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
    ## GPU->CPU Memory 결과값을 가져오기
    # Transfer predictions back from the GPU.
    cuda.memcpy_dtoh_async(h_output, d_output, stream)
    # Synchronize the stream
    stream.synchronize()

## TensorRT Engine Build이며, UFF Parser를 이용하여 각 정의 
## UFF File,resnet50-infer-5.uff의 input name/output name, input shape를 등록  
## https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt_500rc/tensorrt-api/python_api/coreConcepts.html
## 상위 링크에서 확인가능 
# The UFF path is used for TensorFlow models. You can convert a frozen TensorFlow graph to UFF using the included convert-to-uff utility.
def build_engine_uff(model_file):
    # You can set the logger severity higher to suppress messages (or lower to display more messages).
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
        # Workspace size is the maximum amount of memory available to the builder while building an engine.
        # It should generally be set as high as possible.
        builder.max_workspace_size = common.GiB(1)
        # We need to manually register the input and output nodes for UFF.
        parser.register_input(ModelData.INPUT_NAME, ModelData.INPUT_SHAPE)
        parser.register_output(ModelData.OUTPUT_NAME)
        # Load the UFF model and parse it in order to populate the TensorRT network.
        parser.parse(model_file, network)
        # Build and return an engine.
        return builder.build_cuda_engine(network)

## test_image의 변형이 없이 그대로 리턴하며, pagelocked_buffer(h_input), HOST(CPU) Buffer 에 Image정보를 Resize, antialias ,transpose 후 최종 1D Array 변경 
## CHW (Channel × Height × Width) 상위 INPUT_SHAPE (3, 224, 224)
## 들어온 test_image 그대로 return
def load_normalized_test_case(test_image, pagelocked_buffer):
    # Converts the input image to a CHW Numpy array
    def normalize_image(image):
        # Resize, antialias and transpose the image to CHW.
        c, h, w = ModelData.INPUT_SHAPE
        return np.asarray(image.resize((w, h), Image.ANTIALIAS)).transpose([2, 0, 1]).astype(trt.nptype(ModelData.DTYPE)).ravel()

    # Normalize the image and copy to pagelocked memory.
    np.copyto(pagelocked_buffer, normalize_image(Image.open(test_image)))
    return test_image

def main():
    
    ## commom.py를 이용하여 다음과 같이 설정 
    ## data_path  = /usr/src/tensorrt/data/resnet50 설정 
    ## data_files = "binoculars.jpeg", "reflex_camera.jpeg", "tabby_tiger_cat.jpg","resnet50-infer-5.uff" ,"class_labels.txt"    
    # Set the data path to the directory that contains the trained models and test images for inference.
    data_path, data_files = common.find_sample_data(description="Runs a ResNet50 network with a TensorRT inference engine.", subfolder="resnet50", find_files=["binoculars.jpeg", "reflex_camera.jpeg", "tabby_tiger_cat.jpg", ModelData.MODEL_PATH, "class_labels.txt"])

    ## test_images 에는 data_files의 0~2까지 즉 Image 3개 0>=x && x< 3
    # Get test images, models and labels.
    test_images = data_files[0:3]

    ## 3부터 uff_model_file 과 lables_file 설정  
    uff_model_file, labels_file = data_files[3:]

    ## labels 값을 File을 읽어 줄 순서대로 넣음 
    labels = open(labels_file, 'r').read().split('\n')

    ## 상위함수로 Build TensorRT Engine이며, UFF Parser를 하여 준비  
    # Build a TensorRT engine.
    with build_engine_uff(uff_model_file) as engine:
        # Inference is the same regardless of which parser is used to build the engine, since the model architecture is the same.

        ## 상위함수로 Host(CPU), Device(GPU) 별로 Buffer를 input/output 할당 
        # Allocate buffers and create a CUDA stream.
        h_input, d_input, h_output, d_output, stream = allocate_buffers(engine)

        ## Build된 Engine을 생성하고 inference를 위해 준비 
        # Contexts are used to perform inference.
        with engine.create_execution_context() as context:

            ## Random으로 3개의 test_images들 중 선택하여 하나확정 
            # Load a normalized test case into the host input page-locked buffer.
            test_image = random.choice(test_images)

            ## Host(CPU) Buffer에 test_image를 넣어주고, test_case로 그대로 반환 
            test_case = load_normalized_test_case(test_image, h_input)

            ## Host(CPU) input buffer 기반으로 Device(GPU)로 추론하여 결과를 다시 Host(CPU) output
            # Run the engine. The output will be a 1D tensor of length 1000, where each value represents the
            # probability that the image corresponds to that label
            do_inference(context, h_input, d_input, h_output, d_output, stream)

            ## 추론에서 얻은 Host output중 가장 큰 값의 index를 찾아 반환 (pred)
            # We use the highest probability as our prediction. Its index corresponds to the predicted label.
            pred = labels[np.argmax(h_output)]

            ## Random으로 선택한 Test Case와 추론한 prediction을 비교 
            if "_".join(pred.split()) in os.path.splitext(os.path.basename(test_case))[0]:
                print("Correctly recognized " + test_case + " as " + pred)
            else:
                print("Incorrectly recognized " + test_case + " as " + pred)

if __name__ == '__main__':
    main()

상위 전체소스를 이해하기 위해서 아래 문서 (필수)

UFF Parser를 이용하기 때문에 이곳부터 소스를 보면 쉽게 이해가 쉽다
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#import_tf_python

TensorRT Python Concept (Build Engine, 상위링크에도 설명)
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt_500rc/tensorrt-api/python_api/coreConcepts.html

2.2 uff_resnet50.py 관련 TensorRT API 문서

초기 Class를 생성시 설정되는 값은 중요하며, 각 모델마다 차이점이 존재한다.

Caffemodel : model-file, proto-file, output-blob-names
UFF: uff-file, input-dims, uff-input-blob-name, output-blob-names
ONNX: onnx-file

Python Parser (Caffe,UFF,ONNX)

trt로 검색후 Model별 확인
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Uff/pyUff.html
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Caffe/pyCaffe.html
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Onnx/pyOnnx.html

TensorRT 기본적인 Types / Core (상위 Logger 와 Datatype 확인)

상위소스에서 trt로 검색하여 관련된 것을 아래에서 확인가능
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/FoundationalTypes/pyFoundationalTypes.html
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/Core/pyCore.html

UFF Converter/Operators

https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/uff/uff.html#
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/uff/Operators.html#

상위 TensorRT Model의 성능 및 설정되어야하는 부분 확인

1.6 각 모델의 성능비교 (TensorRT)
https://ahyuo79.blogspot.com/2019/08/deepstream-sdk-40-plugin-gstreamer.html

아쉬운 점이 있다면 별도로 Serialize 와 Deserialize를 진행하지 않아서 Engine을 저장하지 않아 속도가 느리다.

2.3 uff_resnet50.py model engine 생성하여 속도개선

상위 소스를 매번 실행할때마다 매번 Engine을 새로생성하므로, 시간이 상당히 많이 걸리므로,아래와 같이 처음생성시 Engine을 File로 저장하고,
만들어진 Engine File을 매번 이용하는 방법으로 소스를 수정해보자.

기존에 이용하던 DeepStream의 model-engine 과 동일하다
아래와 같이 소스를 수정하면 매번 엔진 빌드시간이 없어지므로, 실행되는 시간이 빨라진다.

$ sudo vi uff_resnet50.py 
.............
def main():

# Set the data path to the directory that contains the trained models and test images for inference.
    data_path, data_files = common.find_sample_data(description="Runs a ResNet50 network with a TensorRT inference engine.", subfolder="resnet50", find_files=["binoculars.jpeg", "reflex_camera.jpeg", "tabby_tiger_cat.jpg", ModelData.MODEL_PATH, "class_labels.txt"])
    # Get test images, models and labels.
    test_images = data_files[0:3]
    uff_model_file, labels_file = data_files[3:]
    labels = open(labels_file, 'r').read().split('\n')

   
    ##  두번째 부터 이미 저장된 Engine File을 Open하여 Deserialize하여 사용 (Engine Build Time을 없앰)
    ##  아래소스와 거의 동일하지만, build_engine_uff 함수의 역할이 필요 없으므로, 바로 추론가능 
    ##  마지막에 소스 바로종료를 하여 밑에 소스 실행금지 


    with open("sample_uff.engine","rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())
       # Inference is the same regardless of which parser is used to build the engine, since the model architecture is the same.
        # Allocate buffers and create a CUDA stream.
        h_input, d_input, h_output, d_output, stream = allocate_buffers(engine)
        # Contexts are used to perform inference.
        with engine.create_execution_context() as context:
            # Load a normalized test case into the host input page-locked buffer.
            test_image = random.choice(test_images)
            test_case = load_normalized_test_case(test_image, h_input)
            # Run the engine. The output will be a 1D tensor of length 1000, where each value represents the
            # probability that the image corresponds to that label
            do_inference(context, h_input, d_input, h_output, d_output, stream)
            # We use the highest probability as our prediction. Its index corresponds to the predicted label.
            pred = labels[np.argmax(h_output)]
            if "_".join(pred.split()) in os.path.splitext(os.path.basename(test_case))[0]:
                print("2nd Correctly recognized " + test_case + " as " + pred)
            else:
                print("2nd Incorrectly recognized " + test_case + " as " + pred)
            sys.exit()


    ##  첫번째 실행에만 Engine을 Build하고 이를 Serialize 한 후 File로 저장 (TensorRT Manual 참고) 

    # Build a TensorRT engine.
    with build_engine_uff(uff_model_file) as engine:
        with open("sample_uff.engine","wb") as f:
            f.write(engine.serialize())
        # Inference is the same regardless of which parser is used to build the engine, since the model architecture is the same.
        # Allocate buffers and create a CUDA stream.
        h_input, d_input, h_output, d_output, stream = allocate_buffers(engine)
        # Contexts are used to perform inference.
        with engine.create_execution_context() as context:
            # Load a normalized test case into the host input page-locked buffer.
            test_image = random.choice(test_images)
            test_case = load_normalized_test_case(test_image, h_input)
            # Run the engine. The output will be a 1D tensor of length 1000, where each value represents the
            # probability that the image corresponds to that label
            do_inference(context, h_input, d_input, h_output, d_output, stream)
            # We use the highest probability as our prediction. Its index corresponds to the predicted label.
            pred = labels[np.argmax(h_output)]
            if "_".join(pred.split()) in os.path.splitext(os.path.basename(test_case))[0]:
                print("Correctly recognized " + test_case + " as " + pred)
            else:
                print("Incorrectly recognized " + test_case + " as " + pred)

$ sudo python uff_resnet50.py  // wrtie 시 파일 접근권한 
2nd Correctly recognized /usr/src/tensorrt/data/resnet50/reflex_camera.jpeg as reflex camera

개선된 소스로 하면 속도가 이전과는 다르다

처음동작시 Engine을 serialize 하여 File로 sample_uff.engine 저장
두번째 부터는 이미 저장된 Engine File을 sample_uff.engine 읽어 Deserialize 한 후 바로사용

3. TensorRT 직접 Network를 구성

만약 Model ( UFF/Caffe/ONNX)의 Parser로 Network를 구성하지 않고 직접 TensorRT에서 Network를 구성한다면 어떻게 해야할까?
그렇다면, Training에서 만들어진 결과값, 즉 (weight,bias)은 어디에서는 가져와야 동작 될 것이다

TensorRT 내부구조를 보면 Layer는 대부분 지원되므로 본인이 API를 이해하고 시간만 있다면, 원하는 Network를 구성은 할수 있을 것 같다.
하지만, 문제는 Training에서 얻은 결과값 필요하며, 이를 가져올 방법은 TensorRT로 Training하거나, 이미 Training된 곳에서 가져오는 것일 것이다.

3.1 NVIDIA TensorRT Network 구성예제

처음 아래 사이트의 예제기반으로 작성한 다음 왜 동작이 안되는 지 몰라 문서를 자세히 읽어보니, pytorch에서 Training한 값을 가져와서 적용하는 구조이다.

https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#create_network_python

$ cd ~ 
$ vi test.py
import tensorrt as trt

INPUT_NAME = "input"
INPUT_SHAPE = (3, 224, 224)
OUTPUT_NAME = "Markoutput"
OUTPUT_SIZE = 400

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

# Create the builder and network
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
 # Configure the network layers based on the weights provided. In this case, the weights are imported from a pytorch model. 
 # Add an input layer. The name is a string, dtype is a TensorRT dtype, and the shape can be provided as either a list or tuple.
 input_tensor = network.add_input(name=INPUT_NAME, dtype=trt.float32, shape=INPUT_SHAPE)

 # Add a convolution layer
 conv1_w = weights['conv1.weight'].numpy()
 conv1_b = weights['conv1.bias'].numpy()
 conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
 conv1.stride = (1, 1)

 pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
 pool1.stride = (2, 2)
 conv2_w = weights['conv2.weight'].numpy()
 conv2_b = weights['conv2.bias'].numpy()
 conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
 conv2.stride = (1, 1)

 pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
 pool2.stride = (2, 2)

 fc1_w = weights['fc1.weight'].numpy()
 fc1_b = weights['fc1.bias'].numpy()
 fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)

 relu1 = network.add_activation(fc1.get_output(0), trt.ActivationType.RELU)

 fc2_w = weights['fc2.weight'].numpy()
 fc2_b = weights['fc2.bias'].numpy()
 fc2 = network.add_fully_connected(relu1.get_output(0), OUTPUT_SIZE, fc2_w, fc2_b)

 fc2.get_output(0).name =OUTPUT_NAME
 network.mark_output(fc2.get_output(0))

$ python test.py
...
    conv1_w = weights['conv1.weight'].numpy()
NameError: name 'weights' is not defined

상위를 이해하기 위해서 TensorRT의 Python 관련된 API 정리

TensorRT Python Manual-Network (상위소스 network 연결된 method들 )
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/Graph/Network.html

TensorRT Python Manual-Type (상위소스 INPUT_SHAPE )
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/FoundationalTypes/Dims.html

TensorRT Python Manual-Type (상위소스 weights)
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/FoundationalTypes/Weights.html?highlight=weights

TensorRT Python Manual-Layer ( 상위소스 fc1/fc2 )
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/Graph/LayerBase.html
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/Graph/Layers.html?highlight=ifullyconnectedlayer#ifullyconnectedlayer

3.2 TensorRT pytorch MINIST 실행

TensorRT pytorch Sample
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#network_api_pytorch_mnist

$ cd /usr/src/tensorrt/samples/python/network_api_pytorch_mnist
$ cat requirements.txt  // pytorch가 x86만 지원 
numpy
https://download.pytorch.org/whl/cpu/torch-1.0.0-cp37-cp37m-linux_x86_64.whl  ; python_version=="3.7"
https://download.pytorch.org/whl/cpu/torch-1.0.0-cp36-cp36m-linux_x86_64.whl  ; python_version=="3.6"
https://download.pytorch.org/whl/cpu/torch-1.0.0-cp35-cp35m-linux_x86_64.whl  ; python_version=="3.5"
https://download.pytorch.org/whl/cpu/torch-1.0.0-cp27-cp27mu-linux_x86_64.whl ; python_version=="2.7"
torchvision==0.2.1
Pillow
pycuda

//아래사이트에서 download ,pip를 업그레이드해도 version이 맞지 않아 설치가 안됨 
$ sudo -H pip install torch-1.0.0a0+8601b33-cp27-cp27mu-linux_aarch64.whl

추후에 다시 설치를 진행하도록하며, 직접소스 비교로 진행하기로함

Pytorch ARM Version Install 방법
https://devtalk.nvidia.com/default/topic/1041716/pytorch-install-problem/
https://developer.ridgerun.com/wiki/index.php?title=Xavier/Deep_Learning/Deep_Learning_Tutorials/Jetson_Reinforcement

3.3 pytorch model/ TensorRT sample 소스 분석

Pytorch와 TensorRT 소스를 분석해보면 동작하는 방식은 어느정도는 이해가 쉽게간다.
다만 내가 DeepLearning 관련지식과 Pytorch 지식이 거의 전무하기 때문에 Python으로 손쉽게 이해하려고 한다.
역시 내가 생각하기에도 python 코드는 가독성이 좋아 이해하기 쉽고 사용하기도 편한것 같다.

소스의 구성은 MNIST Model(Pytorch) 와 MINIST Sample(TensorRT)로 구성되어있으며, 각각의 두개의 소스를 비교분석해보고 동작원리를 알아보자

MNIST Model Source (Pytorch)

$ cat model.py
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable

import numpy as np
import os

from random import randint


## pytorch로 Network 구성하며 아래의 pytorch 예제를 참고
https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html

# Network
class Net(nn.Module):

## 각 Layer 이름과 Layer부분을 정의 
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, kernel_size=5)
        self.conv2 = nn.Conv2d(20, 50, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(800, 500)
        self.fc2 = nn.Linear(500, 10)

## Pytorch를 이용하여 구현된 Network의 구조 
    def forward(self, x):
        x = F.max_pool2d(self.conv1(x), kernel_size=2, stride=2)
        x = F.max_pool2d(self.conv2(x), kernel_size=2, stride=2)
        x = x.view(-1, 800)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


class MnistModel(object):
    ## 초기값을 설정하고, Traing/Test를 위해 DATASET을 Load 한 후 상위 Network 구성 
    def __init__(self):
        self.batch_size = 64
        self.test_batch_size = 100
        self.learning_rate = 0.01
        self.sgd_momentum = 0.9
        self.log_interval = 100
        # Fetch MNIST data set.
        self.train_loader = torch.utils.data.DataLoader(
            datasets.MNIST('/tmp/mnist/data', train=True, download=True, transform=transforms.Compose([
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,))
                ])),
            batch_size=self.batch_size,
            shuffle=True)
        self.test_loader = torch.utils.data.DataLoader(
            datasets.MNIST('/tmp/mnist/data', train=False, transform=transforms.Compose([
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,))
                ])),
            batch_size=self.test_batch_size,
            shuffle=True)
        self.network = Net()

    ## 총 전체 5번을 상위에서 Load한 DATASET기반으로 Training을 진행과 TEST를 진행 
    # Train the network for several epochs, validating after each epoch.
    def learn(self, num_epochs=5):
        # Train the network for a single epoch
        def train(epoch):
            self.network.train()
            optimizer = optim.SGD(self.network.parameters(), lr=self.learning_rate, momentum=self.sgd_momentum)
            for batch, (data, target) in enumerate(self.train_loader):
                data, target = Variable(data), Variable(target)
                optimizer.zero_grad()
                output = self.network(data)
                loss = F.nll_loss(output, target)
                loss.backward()
                optimizer.step()
                if batch % self.log_interval == 0:
                    print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(epoch, batch * len(data), len(self.train_loader.dataset), 100. * batch / len(self.train_loader), loss.data.item()))

        # Test the network
        def test(epoch):
            self.network.eval()
            test_loss = 0
            correct = 0
            for data, target in self.test_loader:
                with torch.no_grad():
                    data, target = Variable(data), Variable(target)
                output = self.network(data)
                test_loss += F.nll_loss(output, target).data.item()
                pred = output.data.max(1)[1]
                correct += pred.eq(target.data).cpu().sum()
            test_loss /= len(self.test_loader)
            print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(test_loss, correct, len(self.test_loader.dataset), 100. * correct / len(self.test_loader.dataset)))


        ## 이곳에서 5번을 실행 
        for e in range(num_epochs):
            train(e + 1)
            test(e + 1)

    ## pytorch에서 아래의 state_dict이 Training 값이라고 한다 
    ## https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict
    def get_weights(self):
        return self.network.state_dict()

    ## Random으로 Test Case를 선정하는 것 같다 
    def get_random_testcase(self):
        data, target = next(iter(self.test_loader))
        case_num = randint(0, len(data) - 1)
        test_case = data.numpy()[case_num].ravel().astype(np.float32)
        test_name = target.numpy()[case_num]
        return test_case, test_name

Pytorch Model 분석

class Net 과 class MnistModel 구성되며, 상위소스 class Net의 경우는 아래의소스 populate_network 함수와 반드시 비교분석을 해야
Pytorch 와 TensorRT의 API의 비교하여 차이를 알수 있다.

양쪽소스에서는 각각의 동일한 Network를 구성하고있고, TensorRT의 경우 Pytorch에서 얻은 Training에서 얻은값 기반으로 추론(Inference)하고 있다.

Pythorch에 대해 잘알지 못하고 Training한 경험이 없기때문에, 상위처럼 이해만 하고 넘어가기로하며, 세부 Training 동작방식은 Pytorch 혹은 Tensorflow를 공부하고 다시 봐야할 것 같다

MNIST Sample Source (TensorRT)

상위 model을 import하고 동작하고 있으며,이 소스에서도 이전의 Model Source와 마찬가지로 Network를 구성을하고 있지만, TensoRT Python API를 사용하고 있다.

$ cat sample.py 
import model
from PIL import Image
import numpy as np

import pycuda.driver as cuda
import pycuda.autoinit

import tensorrt as trt

import sys, os
sys.path.insert(1, os.path.join(sys.path[0], ".."))

## 상위 common.py 이 부분은 생략 
import common

# You can set the logger severity higher to suppress messages (or lower to display more messages).
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

class ModelData(object):
    INPUT_NAME = "data"
    INPUT_SHAPE = (1, 28, 28)
    OUTPUT_NAME = "prob"
    OUTPUT_SIZE = 10
    DTYPE = trt.float32

## TensorRT를 이용한 Network 구조 (상위 Pytorch Class Net의 forward 비교, 둘다 동일한 Network)

def populate_network(network, weights):
    # Configure the network layers based on the weights provided.
    input_tensor = network.add_input(name=ModelData.INPUT_NAME, dtype=ModelData.DTYPE, shape=ModelData.INPUT_SHAPE)

## pytorch network 와 tensorRT Network 비교, 아래주석이 pytorch network 이며, 이를 비교 

    ## self.conv1 = nn.Conv2d(1, 20, kernel_size=5)     
    ##  x = F.max_pool2d(self.conv1(x), kernel_size=2, stride=2)  
    conv1_w = weights['conv1.weight'].numpy()
    conv1_b = weights['conv1.bias'].numpy()
    conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
    conv1.stride = (1, 1)
    
    ## 상위 pool의 kernel_size 를 window_size 로 변경 
    pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
    pool1.stride = (2, 2)

    ##  self.conv2 = nn.Conv2d(20, 50, kernel_size=5)
    ## x = F.max_pool2d(self.conv2(x), kernel_size=2, stride=2)
    conv2_w = weights['conv2.weight'].numpy()
    conv2_b = weights['conv2.bias'].numpy()
    conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
    conv2.stride = (1, 1)

    pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
    pool2.stride = (2, 2)
   
    ## x = x.view(-1, 800)
    ## x = F.relu(self.fc1(x))
    fc1_w = weights['fc1.weight'].numpy()
    fc1_b = weights['fc1.bias'].numpy()
    fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)

    relu1 = network.add_activation(input=fc1.get_output(0), type=trt.ActivationType.RELU)
   
    ## x = self.fc2(x)
    ## F.log_softmax(x, dim=1)
    fc2_w = weights['fc2.weight'].numpy()
    fc2_b = weights['fc2.bias'].numpy()
    fc2 = network.add_fully_connected(relu1.get_output(0), ModelData.OUTPUT_SIZE, fc2_w, fc2_b)

    fc2.get_output(0).name = ModelData.OUTPUT_NAME
    network.mark_output(tensor=fc2.get_output(0))

def build_engine(weights):
    # For more information on TRT basics, refer to the introductory samples.
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
        builder.max_workspace_size = common.GiB(1)
        # Populate the network using weights from the PyTorch model.
        populate_network(network, weights)
        # Build and return an engine.
        return builder.build_cuda_engine(network)

# Loads a random test case from pytorch's DataLoader
def load_random_test_case(model, pagelocked_buffer):
    # Select an image at random to be the test case.
    img, expected_output = model.get_random_testcase()
    # Copy to the pagelocked input buffer
    np.copyto(pagelocked_buffer, img)
    return expected_output

def main():
    data_path, _ = common.find_sample_data(description="Runs an MNIST network using a PyTorch model", subfolder="mnist")
    # Train the PyTorch model
    mnist_model = model.MnistModel()
    mnist_model.learn()
    weights = mnist_model.get_weights()
    # Do inference with TensorRT.
    with build_engine(weights) as engine:
        # Build an engine, allocate buffers and create a stream.
        # For more information on buffer allocation, refer to the introductory samples.
        inputs, outputs, bindings, stream = common.allocate_buffers(engine)
        with engine.create_execution_context() as context:
            case_num = load_random_test_case(mnist_model, pagelocked_buffer=inputs[0].host)
            # For more information on performing inference, refer to the introductory samples.
            # The common.do_inference function will return a list of outputs - we only have one in this case.
            [output] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
            pred = np.argmax(output)
            print("Test Case: " + str(case_num))
            print("Prediction: " + str(pred))

if __name__ == '__main__':
    main()

MINIST Sample 간단분석

data_path 정의 /usr/src/tensorrt/data/mnist 로 설정
model.MnistModel() : minist model 생성
mnist_model.learn() : minist training
weights = mnist_model.get_weights() : pytorch로 부터 weights 값 얻음
build_engine(weights) : weight 기반 TensorRT Engine Build
populate_network(network, weights) : weight기반으로 TensorRT network 구성
builder.build_cuda_engine(network) : TensorRT Engine Build 완료
common.allocate_buffers(engine) : Memory 할당 (commom.py 참조)
with engine.create_execution_context() as context : TensorRT 엔진실행준비
load_random_test_case : 함수호출하여 CPU buffer 할당 및 test_cast 선정 (model.py 참조)
common.do_inference : CPU에 할당된 Buffer를 GPU에 추론적용하고 CPU다시 가져옴
case_num : 임의로 선정된 값 , pred : 추론을 통해 예측값

TensorRT의 이해는 NVIDIA 영문 Manual로 어느 정도 이해는 하겠지만, 근본적으로 Deep Learning 지식과 각각의 Network 구조를 이해하려면
어쩔수 없이 별도로 이 분야에 대해서 세부적으로 공부 해야할 것 같다.

댓글 없음 :

댓글 쓰기

피드 구독하기: 댓글 ( Atom )