이전에 TensorRT의 Python 소스를 분석했지만, Python 다양한 소스가 존재하기 때문에 아래와 같이 우선 NVIDIA에서 제공해주는 TensorRT Python 전체예제를 살펴보자
TensorRT 기본 Release 정보확인
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/index.html
1.1 TensorRT Python 점검사항
- Jetpack 4.2.1 설치기준
- TensorRT Version Check
$ dpkg -l | grep TensorRT ii graphsurgeon-tf 5.1.6-1+cuda10.0 arm64 GraphSurgeon for TensorRT package ii libnvinfer-dev 5.1.6-1+cuda10.0 arm64 TensorRT development libraries and headers ii libnvinfer-samples 5.1.6-1+cuda10.0 all TensorRT samples and documentation ii libnvinfer5 5.1.6-1+cuda10.0 arm64 TensorRT runtime libraries ii python-libnvinfer 5.1.6-1+cuda10.0 arm64 Python bindings for TensorRT ii python-libnvinfer-dev 5.1.6-1+cuda10.0 arm64 Python development package for TensorRT ii python3-libnvinfer 5.1.6-1+cuda10.0 arm64 Python 3 bindings for TensorRT ii python3-libnvinfer-dev 5.1.6-1+cuda10.0 arm64 Python 3 development package for TensorRT ii tensorrt 5.1.6.1-1+cuda10.0 arm64 Meta package of TensorRT ii uff-converter-tf 5.1.6-1+cuda10.0 arm64 UFF converter for TensorRT package
https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html#installing
- TensorRT Python 기본구조 및 필요 Package 설치
- pyCuda 설치 (python2만 설치 했으나, python3에도 설치진행)
- TensorRT UFF/Caffe/Onnx Parser 소스 분석
- TensorRT 직접Network 직접구성 분석
이전소스 구조가 비슷하기에 이전 내용을 이해면 이해가 쉽다
https://ahyuo79.blogspot.com/2019/08/tensorrt-5-python.html
- Tensorflow 설치부분확인
$ pip3 list // python3 version만 설치 .. tensorboard 1.14.0 tensorflow-estimator 1.14.0 tensorflow-gpu 1.14.0+nv19.7 ... $ pip list // python2는 없음
아래 사이트의 2.2 IPlugIn SSD 기능확인의 Tensorflow 설치부분참고
https://ahyuo79.blogspot.com/2019/08/ds-sdk-40-test4-iplugin-sample.html
https://docs.nvidia.com/deeplearning/frameworks/install-tf-jetson-platform/index.html
1.2 TensorRT Python 소스구조 확인
NVIDIA에서 제공해주는 TensorRT의 Python 예제들을 전체 살펴보자
- TensorRT의 Python 전체소스 확인
$ cd /usr/src/tensorrt/samples/python $ tree -t . ├── common.py // 거의 공통적으로 사용 import common ├── end_to_end_tensorflow_mnist │ ├── model.py │ ├── README.md │ ├── requirements.txt │ └── sample.py ├── engine_refit_mnist │ ├── model.py │ ├── README.md │ ├── requirements.txt │ └── sample.py ├── fc_plugin_caffe_mnist │ ├── CMakeLists.txt │ ├── __init__.py │ ├── README.md │ ├── requirements.txt │ ├── sample.py │ └── plugin │ ├── FullyConnected.h │ └── pyFullyConnected.cpp ├── int8_caffe_mnist │ ├── calibrator.py │ ├── README.md │ ├── requirements.txt │ └── sample.py ├── network_api_pytorch_mnist // 이전에 이미 설명 │ ├── model.py │ ├── README.md │ ├── requirements.txt │ └── sample.py ├── uff_custom_plugin // UFF cumtome PlugIn 부분 예제 (중요) │ ├── CMakeLists.txt │ ├── __init__.py │ ├── lenet5.py │ ├── README.md │ ├── requirements.txt │ ├── sample.py │ └── plugin │ ├── clipKernel.cu │ ├── clipKernel.h │ ├── customClipPlugin.cpp │ └── customClipPlugin.h ├── uff_ssd │ ├── CMakeLists.txt │ ├── detect_objects.py │ ├── README.md │ ├── requirements.txt │ ├── voc_evaluation.py │ ├── images │ │ ├── image1.jpg │ │ ├── image2.jpg │ │ └── image_details.txt │ ├── plugin │ │ └── FlattenConcat.cpp │ └── utils │ ├── boxes.py │ ├── coco.py │ ├── engine.py │ ├── inference.py │ ├── __init__.py │ ├── mAP.py │ ├── model.py │ ├── paths.py │ └── voc.py ├── yolov3_onnx │ ├── coco_labels.txt │ ├── data_processing.py │ ├── onnx_to_tensorrt.py │ ├── README.md │ ├── requirements.txt │ └── yolov3_to_onnx.py ├── common.pyc └── introductory_parser_samples // 이전에 이미 설명 ├── caffe_resnet50.py ├── onnx_resnet50.py ├── README.md ├── requirements.txt ├── sample_uff.engine └── uff_resnet50.py
1.3 common.py 소스
$ cat common.py import os import argparse import numpy as np import pycuda.driver as cuda import pycuda.autoinit import tensorrt as trt try: # Sometimes python2 does not understand FileNotFoundError FileNotFoundError except NameError: FileNotFoundError = IOError ## 2의 30승 이므로 Giga Byte로 변경 def GiB(val): return val * 1 << 30 ## 기본위치설정은 /usr/src/tensorrt/data 이며 subfolder와 find_files에 따라 변경 def find_sample_data(description="Runs a TensorRT Python sample", subfolder="", find_files=[]): ''' Parses sample arguments. Args: description (str): Description of the sample. subfolder (str): The subfolder containing data relevant to this sample find_files (str): A list of filenames to find. Each filename will be replaced with an absolute path. Returns: str: Path of data directory. Raises: FileNotFoundError ''' # Standard command-line arguments for all samples. kDEFAULT_DATA_ROOT = os.path.join(os.sep, "usr", "src", "tensorrt", "data") parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument("-d", "--datadir", help="Location of the TensorRT sample data directory.", default=kDEFAULT_DATA_ROOT) args, unknown_args = parser.parse_known_args() # If data directory is not specified, use the default. data_root = args.datadir # If the subfolder exists, append it to the path, otherwise use the provided path as-is. subfolder_path = os.path.join(data_root, subfolder) data_path = subfolder_path if not os.path.exists(subfolder_path): print("WARNING: " + subfolder_path + " does not exist. Trying " + data_root + " instead.") data_path = data_root # Make sure data directory exists. if not (os.path.exists(data_path)): raise FileNotFoundError(data_path + " does not exist. Please provide the correct data path with the -d option.") # Find all requested files. for index, f in enumerate(find_files): find_files[index] = os.path.abspath(os.path.join(data_path, f)) if not os.path.exists(find_files[index]): raise FileNotFoundError(find_files[index] + " does not exist. Please provide the correct data path with the -d option.") return data_path, find_files ## 아래 allocate_buffers에서 사용되어지며, host(CPU) host_mem , device(GPU) device_mem 할당 ## 추후에 inputs[0].host 이런식으로 사용되어짐 # Simple helper data class that's a little nicer to use than a 2-tuple. class HostDeviceMem(object): def __init__(self, host_mem, device_mem): self.host = host_mem self.device = device_mem def __str__(self): return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device) def __repr__(self): return self.__str__() ## GPU추론을 위해서 host(CPU) 와 device(GPU) inputs/output 별도Buffer 생성 # Allocates all buffers required for an engine, i.e. host/device inputs/outputs. def allocate_buffers(engine): inputs = [] outputs = [] bindings = [] stream = cuda.Stream() for binding in engine: size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size dtype = trt.nptype(engine.get_binding_dtype(binding)) ## HOST (CPU) 와 Device(GPU) Memory 할당이 다르다 ## 추후에 inputs[0].host 이런식으로 사용되어짐 # Allocate host and device buffers host_mem = cuda.pagelocked_empty(size, dtype) device_mem = cuda.mem_alloc(host_mem.nbytes) # Append the device buffer to device bindings. bindings.append(int(device_mem)) ## 상위 input/output이 NULL인 상태에서 선언된 HostDeviceMem를 추가 # Append to the appropriate list. if engine.binding_is_input(binding): inputs.append(HostDeviceMem(host_mem, device_mem)) else: outputs.append(HostDeviceMem(host_mem, device_mem)) return inputs, outputs, bindings, stream ## inference할 때도 host(CPU) 와 device(GPU) 의 개념존재 ## 추론을 위해서 CPU->GPU Buffer 이동하고 추론하고 GPU->CPU로 가져오는 방식이다 ## 최종적으로 CPU의 Output Buffer를 반환하여 Linux에서 실행가능 # This function is generalized for multiple inputs/outputs. # inputs and outputs are expected to be lists of HostDeviceMem objects. def do_inference(context, bindings, inputs, outputs, stream, batch_size=1): # Transfer input data to the GPU. [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs] # Run inference. context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle) # Transfer predictions back from the GPU. [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs] # Synchronize the stream stream.synchronize() # Return only the host outputs. return [out.host for out in outputs]
Python Sample Section
https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#python_samples_section
https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#python_topics
2. Tensorflow MINIST 소스 실행
Tensorflow의 Keras를 Model를 생성하고 이를 테스트 하는 실행소스
$ cd /usr/src/tensorrt/samples/python/end_to_end_tensorflow_mnist $ cat requirements.txt //python3 에 이미 설치됨 numpy Pillow pycuda tensorflow $ sudo mkdir models //권한문제 $ sudo python3 model.py // download 권한문제 ......... 2019-09-05 14:23:43.819912: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 60000/60000 [==============================] - 8s 136us/sample - loss: 0.2010 - acc: 0.9414 // loss acc는 어떻게 계산이 되는지 모르겠음 Epoch 2/5 60000/60000 [==============================] - 7s 121us/sample - loss: 0.0803 - acc: 0.9754 // acc는 정확성 같고, loss 무슨손실인지 Epoch 3/5 60000/60000 [==============================] - 7s 119us/sample - loss: 0.0523 - acc: 0.9838 Epoch 4/5 60000/60000 [==============================] - 7s 118us/sample - loss: 0.0361 - acc: 0.9887 Epoch 5/5 60000/60000 [==============================] - 7s 117us/sample - loss: 0.0291 - acc: 0.9907 // 5 번 Training이 진행될 수록 acc의 수치는 증가 loss는 줄어든다 10000/10000 [==============================] - 1s 73us/sample - loss: 0.0600 - acc: 0.9812 // 마지막 Test 진행하여 결과 W0905 14:24:21.071296 547892088848 deprecation_wrapper.py:119] From model.py:78: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead. ......... $ find / -name convert_to_uff.py 2> /dev/null /usr/lib/python3.6/dist-packages/uff/bin/convert_to_uff.py /usr/lib/python2.7/dist-packages/uff/bin/convert_to_uff.py $ sudo python3 /usr/lib/python3.6/dist-packages/uff/bin/convert_to_uff.py models/lenet5.pb ......... UFF Version 0.6.3 === Automatically deduced input nodes === [name: "input_1" op: "Placeholder" attr { key: "dtype" value { type: DT_FLOAT } } attr { key: "shape" value { shape { dim { size: -1 } dim { size: 28 } dim { size: 28 } dim { size: 1 } } } } ] ========================================= === Automatically deduced output nodes === [name: "dense_1/Softmax" op: "Softmax" input: "dense_1/BiasAdd" attr { key: "T" value { type: DT_FLOAT } } ] ========================================== Using output node dense_1/Softmax Converting to UFF graph DEBUG: convert reshape to flatten node No. nodes: 13 UFF Output written to models/lenet5.uff $ ls models/ //UFF Format 생성확인 (PB->UFF) lenet5.pb lenet5.uff // Test Case : Random으로 선택된 Case , Prediction: 추론의 의한값 동일 $ sudo python3 sample.py // -d /usr/src/tensorrt/data Test Case: 1 Prediction: 1 $ ls /usr/src/tensorrt/data/mnist/ 0.pgm 3.pgm 6.pgm 9.pgm LegacyCalibrationTable lenet5_mnist_frozen.pb mnistapi.wts mnist_lenet.caffemodel mnist.prototxt 1.pgm 4.pgm 7.pgm batches lenet5_custom_pool.uff lenet5.uff mnist.caffemodel mnist_mean.binaryproto 2.pgm 5.pgm 8.pgm deploy.prototxt lenet5_custom_pool.uff.txt lenet5.uff.txt mnistgie.wts mnist.onnx
UFF Utility
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/uff/uff.html
2.1 Tensorflow MNIST Model 소스분석
기본소스는 Tensorflow의 Keras로 생성된 모델과 설정된 Netowkr으로 동작되며, Tensorflow의 MNIST DATASET을 Download하여 Training과 TEST를 걸쳐
최종적으로 Model을 파일로 PB파일로 저장한다
$ cat model.py import tensorflow as tf import numpy as np ## Google에서 minist.npz download 후 TRAIN과 TEST를 횟수를 정의하기위해 1차원추가 def process_dataset(): ## Google에서 mnist.npz data를 가져온 후 값을 255로 나누어 저장 # Import the data (x_train, y_train),(x_test, y_test) = tf.keras.datasets.mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 ## TRAINING 횟수 , TEST 횟수를 reshape를 해서 1차원을 추가 (4차원) # Reshape the data NUM_TRAIN = 60000 NUM_TEST = 10000 x_train = np.reshape(x_train, (NUM_TRAIN, 28, 28, 1)) x_test = np.reshape(x_test, (NUM_TEST, 28, 28, 1)) return x_train, y_train, x_test, y_test ## Model의 Network 생성 과 구성(각 Layer 추가설정) def create_model(): model = tf.keras.models.Sequential() model.add(tf.keras.layers.InputLayer(input_shape=[28,28, 1])) model.add(tf.keras.layers.Flatten()) model.add(tf.keras.layers.Dense(512, activation=tf.nn.relu)) model.add(tf.keras.layers.Dense(10, activation=tf.nn.softmax)) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) return model ## Model과 File 명을 입력 받아 freeze하여 lenet5.pb로 저장 def save(model, filename): # First freeze the graph and remove training nodes. output_names = model.output.op.name sess = tf.keras.backend.get_session() frozen_graph = tf.graph_util.convert_variables_to_constants(sess, sess.graph.as_graph_def(), [output_names]) frozen_graph = tf.graph_util.remove_training_nodes(frozen_graph) # Save the model with open(filename, "wb") as ofile: ofile.write(frozen_graph.SerializeToString()) def main(): ## DataSet Download 하여 Training/TEST 숫자 변경, 상위함수 x_train, y_train, x_test, y_test = process_dataset() ## 상위에 정의된 Layer로 모델구성,상위함수 model = create_model() ## Training을 위해 전체횟수 5번 과 1번의 Progress Bar로 표시 # Train the model on the data model.fit(x_train, y_train, epochs = 5, verbose = 1) ## Model Training/TEST 진행 x_test:input , y_test:output # Evaluate the model on test data model.evaluate(x_test, y_test) ## Training/TEST한 Model File로 lenet5.pb 저장 save(model, filename="models/lenet5.pb") if __name__ == '__main__': main()
- 기본용어이해
step : weight 와 Bias를 1회 update하는 것을 1 step
batch size : 1회 step에 사용한 data의 수를 정의
https://m.blog.naver.com/PostView.nhn?blogId=wideeyed&logNo=221333529176&proxyReferer=https%3A%2F%2Fwww.google.com%2F
- Tensorflow의 mnist DATASET
- tf.keras.datasets.mnist.load_data
- Tensorflow keras model 이해
- tf.keras.models.Sequential
- model.fi (verbose: 0은 silent 1은 1은 progress bar 표시)
- model.evaluate
- numpy의 reshape / ravel 이해
https://rfriend.tistory.com/349
2.2 TensorRT의 Sample.py 소스분석
상위에서 Training/TEST를 걸쳐 생성된 PB파일을 UFF로 변경된 모델로 읽어서 TensorRT의 Engine을 생성하고 이를 실행하여,
임의로 정한 TEST CASE와 추론을 통한 PREDICTION CASE를 비교한다
$ cat sample.py # This sample uses a UFF MNIST model to create a TensorRT Inference Engine from random import randint from PIL import Image import numpy as np import pycuda.driver as cuda # This import causes pycuda to automatically manage CUDA context creation and cleanup. import pycuda.autoinit import tensorrt as trt import sys, os sys.path.insert(1, os.path.join(sys.path[0], "..")) ## 상위 common.py import common # You can set the logger severity higher to suppress messages (or lower to display more messages). TRT_LOGGER = trt.Logger(trt.Logger.WARNING) ## Class로 ModelData 정의하고 초기값들을 설정 (Class는 설정값만 이용) class ModelData(object): MODEL_FILE = "lenet5.uff" INPUT_NAME ="input_1" INPUT_SHAPE = (1, 28, 28) OUTPUT_NAME = "dense_1/Softmax" ## Build Engine을 만드는 함수 (UFF를 통해 Network 정의) ## UFF Parser에 INPUT정보 input_1 (1, 28, 28) , 이 부분 상위 model.py의 create_model 확인 ## UFF Parser에 OUTPUT정보 dense_1/Softmax def build_engine(model_file): # For more information on TRT basics, refer to the introductory samples. with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser: builder.max_workspace_size = common.GiB(1) # Parse the Uff Network parser.register_input(ModelData.INPUT_NAME, ModelData.INPUT_SHAPE) parser.register_output(ModelData.OUTPUT_NAME) parser.parse(model_file, network) # Build and return an engine. return builder.build_cuda_engine(network) ## CPU Input Buffer에 그림이미지를 넣고 추론준비하고, Random으로 TESTCASE 0~9.pgm 파일준비 ## pagelocked_buffer=inputs[0].host, 즉 Host(CPU) Input Buffer를 입력받는다 ## 그리고 Host(CPU) Input Buffer 에 Random으로 선택된 TEST Case의 Image를 읽어 1차원으로 변경후 ## 1.0 - img/255 연산을 걸친 후 최종 Host Input Buffer 넣는다 ## Random으로 선택된 TEST_CASE는 그대로 리턴 # Loads a test case into the provided pagelocked_buffer. def load_normalized_test_case(data_path, pagelocked_buffer, case_num=randint(0, 9)): test_case_path = os.path.join(data_path, str(case_num) + ".pgm") # Flatten the image into a 1D array, normalize, and copy to pagelocked memory. img = np.array(Image.open(test_case_path)).ravel() np.copyto(pagelocked_buffer, 1.0 - img / 255.0) return case_num ## Main 함수 순차적으로 보자 def main(): ## /usr/src/tensorrt/data/mnist/로 data_path로 설정 data_path, _ = common.find_sample_data(description="Runs an MNIST network using a UFF model file", subfolder="mnist") ## MODEL_PATH 정의가 되었다면, 이것으로 설정 ## os.path.dirname(__file__)를 통해 현재 작업중인 Directory 알아내고, models 의 directory 설정 model_path = os.environ.get("MODEL_PATH") or os.path.join(os.path.dirname(__file__), "models") ## 상위에서 정의된 모델 파일 확인 lenet5.uff model_file = os.path.join(model_path, ModelData.MODEL_FILE) ## 상위 함수 호출로 TensorRT(Cuda) Engine 생성 with build_engine(model_file) as engine: ## Host(CPU), Device(GPU)의 Input/Output Buffer를 설정 # Build an engine, allocate buffers and create a stream. # For more information on buffer allocation, refer to the introductory samples. inputs, outputs, bindings, stream = common.allocate_buffers(engine) ## 실제적인 Engine을 생성하고 준비 중인 상태로 진입 (이전에는 Build상태) with engine.create_execution_context() as context: ## 상위함수로, Host(CPU) Input Buffer Image Data를 넣고 추론준비하고, Test Case 선택 case_num = load_normalized_test_case(data_path, pagelocked_buffer=inputs[0].host) ## Engine이 생성되고 준비가 되었으니, 추론을 진행 (GPU에게 추론진행) # For more information on performing inference, refer to the introductory samples. # The common.do_inference function will return a list of outputs - we only have one in this case. [output] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream) ## 추론된 output 중 가장 큰값 찾고 index 값을 추출 pred = np.argmax(output) ## TEST CASE 상위 Random으로 선택된 값과 추론에 나온값비교 print("Test Case: " + str(case_num)) print("Prediction: " + str(pred)) if __name__ == '__main__': main()