Jeonghun (James) Lee: NVIDIA-TensorRT

레이블이 NVIDIA-TensorRT인 게시물을 표시합니다. 모든 게시물 표시

6/04/2019

TensorRT 5.0 Multimedia-Sample

1. TensorRT 5.0 와 Multimedia제어

Multimedia 와 TensorRT의 테스트를 진행하며, 관련 예제들을 알아보자.
이 부분의 기능은 추후 설명할 DeepStream 부분하고도 거의 동일한 기능이기때문에, 동작은 이해를 하자.

우선 JetsonTX2의 성능을 최대로 변경

$ sudo jetson_clocks   // sudo nvpmodel -m 0

$ sudo jetson_clocks --show
[sudo] password for jetsontx2: 
SOC family:tegra186  Machine:quill
Online CPUs: 0-5
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200 IdleStates: C1=0 c7=0 
cpu1: Online=1 Governor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200 IdleStates: C1=0 c6=0 c7=0 
cpu2: Online=1 Governor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200 IdleStates: C1=0 c6=0 c7=0 
cpu3: Online=1 Governor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200 IdleStates: C1=0 c7=0 
cpu4: Online=1 Governor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200 IdleStates: C1=0 c7=0 
cpu5: Online=1 Governor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200 IdleStates: C1=0 c7=0 
GPU MinFreq=1300500000 MaxFreq=1300500000 CurrentFreq=1300500000
EMC MinFreq=40800000 MaxFreq=1866000000 CurrentFreq=1866000000 FreqOverride=1
Fan: speed=255
NV Power Mode: MAXN

$ cat /usr/bin/jetson_clocks   // script 추후 세부분석 
....
 do_hotplug
 do_clusterswitch
 do_cpu
 do_gpu
 do_emc
 do_fan
 do_nvpmodel
......

NVDIA Multimedia API
https://docs.nvidia.com/jetson/archives/l4t-multimedia-archived/l4t-multimedia-281/index.html

1.1 Sample Backend Test

JetPack 3.3 때와 JePack4.2 동일한 Video인 줄 알았는데, 예제 Video가 변경이 되었으며,

실행되는 시간은 오래 걸리기 때문에 인내를 가지고, 실행을 기다리자.

 
$ cd  /usr/src/tegra_multimedia_api
$ ls
argus  data  include  LEGAL  LICENSE  Makefile  README  samples  tools

$ cd samples     // JetPack3.3 과 거의 동일함 
00_video_decode  02_video_dec_cuda  04_video_dec_trt  06_jpeg_decode    08_video_dec_drm        10_camera_recording  13_multi_camera       backend  frontend  v4l2cuda
01_video_encode  03_video_cuda_enc  05_jpeg_encode    07_video_convert  09_camera_jpeg_capture  12_camera_v4l2_cuda  14_multivideo_decode  common   Rules.mk

$ cd backend 

//JetPack 3.3 과 동일하게, HDMI를 연결한 후 테스트를 진행해야 함 

$ ./backend 1 ../../data/Video/sample_outdoor_car_1080p_10fps.h264 H264 \
    --trt-deployfile ../../data/Model/GoogleNet_one_class/GoogleNet_modified_oneClass_halfHD.prototxt \
    --trt-modelfile ../../data/Model/GoogleNet_one_class/GoogleNet_modified_oneClass_halfHD.caffemodel \
    --trt-proc-interval 1 -fps 10

// --trt-forcefp32 0  옵션이 없어지고, fp16으로 동작

Backend 의 Sample 구조

Gstream 예제 ( not Used TensorRT)

아래와 같이 4 Channel H.264 기반으로 Input으로 받아 H.264를 Decode 하고, Video Image Compositor (VIC) 걸쳐 CUDA를 이용하여
X11 기반으로 OpenGL을 이용하여 재생을 한다.(일반적인 Gstreamer 를 생각하면 되겠다)

Gstream 예제 ( used TensorRT)

상위에서 실행한 실제 Sample의 구조(backend)이며, TensorRT(GIE)를 이용하여 자동차를 구분하는 기능까지 추가해서 동작하는 기능이다.
VIC의 주기능은 주로 영상변환기능(RGB2YUB 변환 or Scale 변환 , 즉 영상 Format의 변화 기능을 담당)이며,이를 TensorRT(GIE, Inference Engine)의 입력포맷에 맞게 데이터를 변환하여 처리한다.

용어

TensorRT (previously known as GPU Inference Engine (GIE))
Video Image Compositor (VIC)

VIC(Video Image Compositor)

https://developer.ridgerun.com/wiki/index.php?title=Xavier/Processors/HDAV_Subsystem/Compositor

BackEnd Sample (상위예제)

https://docs.nvidia.com/jetson/archives/l4t-multimedia-archived/l4t-multimedia-281/nvvid_backend_group.html

Yolo 영상 Test (JetsonHacks)

JetPack 3.3 기준으로 Yolo 영상 TEST이며, 3.3 Frame 이라고 하며, 속도가 너무 느리다.
https://www.youtube.com/watch?v=p1fJFG1S6Sw

1.2 Sample FrontEnd

이 Sample은 EVM JetsonTX2에는 Camera가 기본적으로 존재하기 때문에 동작하며, 다른 Jetson EVM에서는 동작이 될지는 정확하게 모르겠다.
Jetson TX2 EVM은 MIPI로 Camera가 연결이 되어있으며, 거의 구조가 Gstream의 Pipe와 비슷하며, 이를 비교해서 봐야할 것이다.

 
$ cd  /usr/src/tegra_multimedia_api
$ ls
argus  data  include  LEGAL  LICENSE  Makefile  README  samples  tools

$ cd samples     // JetPack3.3 과 거의 동일함 
00_video_decode  02_video_dec_cuda  04_video_dec_trt  06_jpeg_decode    08_video_dec_drm        10_camera_recording  13_multi_camera       backend  frontend  v4l2cuda
01_video_encode  03_video_cuda_enc  05_jpeg_encode    07_video_convert  09_camera_jpeg_capture  12_camera_v4l2_cuda  14_multivideo_decode  common   Rules.mk

$ cd frontend 

//JetPack 3.3 과 동일하게, HDMI를 연결한 후 테스트를 진행, 실시간으로 화면재생  

// 처음 구동시 trtModel.cache 를 생성하기 때문에 시간이 걸린다. 

$ sudo ./frontend --deploy ../../data/Model/GoogleNet_three_class/GoogleNet_modified_threeClass_VGA.prototxt \
       --model ../../data/Model/GoogleNet_three_class/GoogleNet_modified_threeClass_VGA.caffemodel

$ ll  // 아래와 같이 새로 생성된 파일이 존재 (trt.h264 / trtModel.cache / output1.h265 ...)
total 1134276
drwxr-xr-x  2 root root      4096  6월  7 13:00 ./
drwxr-xr-x 20 root root      4096  5월 30 15:17 ../
-rwxr-xr-x  1 root root    784936  5월 30 15:19 frontend*
-rw-r--r--  1 root root     12271  5월 30 15:17 main.cpp
-rw-r--r--  1 root root    157944  5월 30 15:19 main.o
-rw-r--r--  1 root root      2821  5월 30 15:17 Makefile
-rw-r--r--  1 root root 287496172  6월  7 12:59 output1.h265
-rw-r--r--  1 root root 287555462  6월  7 12:59 output2.h265
-rw-r--r--  1 root root 287447012  6월  7 12:59 output3.h265
-rw-r--r--  1 root root      2626  5월 30 15:17 Queue.h
-rw-r--r--  1 root root      3134  5월 30 15:17 StreamConsumer.cpp
-rw-r--r--  1 root root      2854  5월 30 15:17 StreamConsumer.h
-rw-r--r--  1 root root     62112  5월 30 15:19 StreamConsumer.o
-rw-r--r--  1 root root 287331038  6월  7 12:59 trt.h264
-rw-r--r--  1 root root  10003768  6월  7 12:49 trtModel.cache
-rw-r--r--  1 root root     13293  5월 30 15:17 TRTStreamConsumer.cpp
-rw-r--r--  1 root root      3856  5월 30 15:17 TRTStreamConsumer.h
-rw-r--r--  1 root root    341584  5월 30 15:19 TRTStreamConsumer.o
-rwxr-xr-x  1 root root       208  6월  7 12:48 tst.sh*
-rw-r--r--  1 root root      9197  5월 30 15:17 VideoEncoder.cpp
-rw-r--r--  1 root root      3462  5월 30 15:17 VideoEncoder.h
-rw-r--r--  1 root root     83496  5월 30 15:19 VideoEncoder.o
-rw-r--r--  1 root root      4177  5월 30 15:17 VideoEncodeStreamConsumer.cpp
-rw-r--r--  1 root root      2480  5월 30 15:17 VideoEncodeStreamConsumer.h
-rw-r--r--  1 root root    104768  5월 30 15:19 VideoEncodeStreamConsumer.o


// 각 동영상 H.264/ H.265 재생확인


$ sudo ../00_video_decode/video_decode H265 output1.h265 //480P
or 
$ sudo ../02_video_dec_cuda/video_dec_cuda output1.h265 H265 //480p
$ sudo ../02_video_dec_cuda/video_dec_cuda output2.h265 H265  //720p
$ sudo ../02_video_dec_cuda/video_dec_cuda output3.h265 H265  //1080p 
$ sudo ../02_video_dec_cuda/video_dec_cuda trt.h264 H264  //1080p

상위 테스트를 진행후 생긴 영상들의 종류를 아래와 같이 나눠볼 수 있겠다.
아래의 File Sink 부분이 각각의 H.265의 OUTPUT이라고 생각하면된다.
그리고, TensorRT를 걸쳐 직접 Rendering 하고 Display 해주는 부분과 H.264로 저장해주는 부분이다. (trt.h264)

아래의 Flow대로 라면 ,
Jetson TX 카메라의 실시간 영상분석(TensorRT이용하여,Box를 만들어 구분)하여 파일로 저장한다.
정확하게 테스트를 해볼 환경이 되지 않아 이부분을 직접 일일 보드를 가지고 돌아다니면서 다 테스트를 해보지 못했지만,

일단 Box가 실시간으로 생기는 것은 확인은 했지만,아쉽게도 자동차는 잘 구분을 할 줄 알았으나, 구분을 잘 못하는 것 같음.
( 테스트 환경이 잘못될 수도 있음, 일반 자동차사진을 비추고 찍고 테스트함 )

좌측 Argus Camera API

우측 V4L2

https://docs.nvidia.com/jetson/archives/l4t-multimedia-archived/l4t-multimedia-281/l4t_mm_camcap_tensorrt_multichannel_group.html

FrontEnd Example

상위내용설명

https://docs.nvidia.com/jetson/archives/l4t-multimedia-archived/l4t-multimedia-281/l4t_mm_camcap_tensorrt_multichannel_group.html

Frame Buffer 정보

https://devtalk.nvidia.com/default/topic/1017059/jetson-tx2/onboard-camera-dev-video0/

Gstream 관련기능

https://devtalk.nvidia.com/default/topic/1010795/jetson-tx2/v4l2-on-jetson-tx2/
https://devtalk.nvidia.com/default/topic/1030593/how-to-control-on-board-camera-such-as-saving-images-and-videos/

다른 Camera Solution

https://github.com/Abaco-Systems/jetson-inference-gv

1.3 Sample Videe_dec_trt

Backend와 유사한 Sample이지만, 영상으로 보여주지 않고 분석까지만 해주는 Sample 이다.

 
$ cd  /usr/src/tegra_multimedia_api
$ ls
argus  data  include  LEGAL  LICENSE  Makefile  README  samples  tools

$ cd samples     // JetPack3.3 과 거의 동일함 
00_video_decode  02_video_dec_cuda  04_video_dec_trt  06_jpeg_decode    08_video_dec_drm        10_camera_recording  13_multi_camera       backend  frontend  v4l2cuda
01_video_encode  03_video_cuda_enc  05_jpeg_encode    07_video_convert  09_camera_jpeg_capture  12_camera_v4l2_cuda  14_multivideo_decode  common   Rules.mk

$ cd 04_video_dec_trt 

//result.txt result0.txt result1.txt 생성되며, HDMI 연결가능, 다른 모델을 사용했지만, 문제발생 

// 2 Channel 분석 
$ sudo ./video_dec_trt 2 ../../data/Video/sample_outdoor_car_1080p_10fps.h264 \
    ../../data/Video/sample_outdoor_car_1080p_10fps.h264 H264 \
    --trt-deployfile ../../data/Model/resnet10/resnet10.prototxt \
    --trt-modelfile ../../data/Model/resnet10/resnet10.caffemodel \
    --trt-mode 0


or 
// 1 Channel 분석 
$ sudo ./video_dec_trt 1  ../../data/Video/sample_outdoor_car_1080p_10fps.h264 H264 \
    --trt-deployfile ../../data/Model/resnet10/resnet10.prototxt \
    --trt-modelfile ../../data/Model/resnet10/resnet10.caffemodel \
    --trt-mode 0


$cat  ../../data/Model/resnet10/labels.txt    // Labeling 확인 
Car
RoadSign
TwoWheeler
Person

//result.txt , result0.txt , result1.txt  생성 (2ch)

$ cat result.txt | head -n 100        // num 0,1,2 모두 생성되며, 1과,2의 정보가 없어 문제 발생  
frame:0 class num:0 has rect:5
 x,y,w,h:0.55625 0.410326 0.040625 0.0516304
 x,y,w,h:0.595312 0.366848 0.0546875 0.0923913
 x,y,w,h:0.09375 0.36413 0.223438 0.201087
 x,y,w,h:0.323438 0.413043 0.0984375 0.111413
 x,y,w,h:0.403125 0.418478 0.0390625 0.076087

frame:0 class num:1 has rect:0

frame:0 class num:2 has rect:0

frame:1 class num:0 has rect:5
 x,y,w,h:0.55625 0.413043 0.0390625 0.048913
 x,y,w,h:0.595312 0.366848 0.0546875 0.0923913
 x,y,w,h:0.09375 0.36413 0.225 0.201087
 x,y,w,h:0.323438 0.413043 0.0984375 0.111413
 x,y,w,h:0.403125 0.418478 0.040625 0.076087

frame:1 class num:1 has rect:0

frame:1 class num:2 has rect:0

frame:2 class num:0 has rect:5
 x,y,w,h:0.55625 0.410326 0.0390625 0.0516304
 x,y,w,h:0.595312 0.366848 0.0546875 0.0923913
 x,y,w,h:0.0921875 0.36413 0.221875 0.201087
 x,y,w,h:0.323438 0.413043 0.0984375 0.11413
 x,y,w,h:0.403125 0.418478 0.040625 0.076087

frame:2 class num:1 has rect:0

frame:2 class num:2 has rect:0

frame:3 class num:0 has rect:5
 x,y,w,h:0.554688 0.410326 0.0375 0.0516304
 x,y,w,h:0.595312 0.366848 0.0546875 0.0923913
 x,y,w,h:0.0921875 0.361413 0.220313 0.201087
 x,y,w,h:0.323438 0.413043 0.0984375 0.111413
 x,y,w,h:0.403125 0.421196 0.0390625 0.076087
........

$ cat ./result0.txt | head -n 100    // num0 만 생성 
frame:0 class num:0 has rect:5
 x,y,w,h:0.55625 0.410326 0.040625 0.0516304
 x,y,w,h:0.595312 0.366848 0.0546875 0.0923913
 x,y,w,h:0.09375 0.36413 0.223438 0.201087
 x,y,w,h:0.323438 0.413043 0.0984375 0.111413
 x,y,w,h:0.403125 0.418478 0.0390625 0.076087

frame:1 class num:0 has rect:5
 x,y,w,h:0.55625 0.413043 0.0390625 0.048913
 x,y,w,h:0.595312 0.366848 0.0546875 0.0923913
 x,y,w,h:0.09375 0.36413 0.225 0.201087
 x,y,w,h:0.323438 0.413043 0.0984375 0.111413
 x,y,w,h:0.403125 0.418478 0.040625 0.076087

frame:2 class num:0 has rect:5
 x,y,w,h:0.55625 0.410326 0.0390625 0.0516304
 x,y,w,h:0.595312 0.366848 0.0546875 0.0923913
 x,y,w,h:0.0921875 0.36413 0.221875 0.201087
 x,y,w,h:0.323438 0.413043 0.0984375 0.11413
 x,y,w,h:0.403125 0.418478 0.040625 0.076087

frame:3 class num:0 has rect:5
 x,y,w,h:0.554688 0.410326 0.0375 0.0516304
 x,y,w,h:0.595312 0.366848 0.0546875 0.0923913
 x,y,w,h:0.0921875 0.361413 0.220313 0.201087
 x,y,w,h:0.323438 0.413043 0.0984375 0.111413
 x,y,w,h:0.403125 0.421196 0.0390625 0.076087

frame:4 class num:0 has rect:5
 x,y,w,h:0.55625 0.413043 0.0375 0.0461957
 x,y,w,h:0.596875 0.366848 0.0546875 0.0923913
 x,y,w,h:0.0921875 0.36413 0.220313 0.201087
 x,y,w,h:0.323438 0.413043 0.0984375 0.111413
 x,y,w,h:0.403125 0.421196 0.0375 0.076087
..........

//상위 분석한 정보기반으로 동영상 Play ( Backend와 동일)
$ sudo ../02_video_dec_cuda/video_dec_cuda ../../data/Video/sample_outdoor_car_1080p_10fps.h264 H264  --bbox-file result0.txt

각 영상을 Decoding 한 후 각 Size에 맞게 변환 후에 TensorRT에 적용한 후 Save BBOX 정보

명령어 사용법

$ ./video_dec_trt   // 사용법 1 Channl or 2 Channel Video 입력을 받아 최종으로 result.txt 를 만들어냄 (box 정보)

video_dec_trt [Channel-num]   ...  [options]

Channel-num:
 1-32, Number of file arguments should exactly match the number of channels specified

Supported formats:
 H264
 H265

OPTIONS:
 -h,--help            Prints this text
 --dbg-level   Sets the debug level [Values 0-3]

 --trt-deployfile     set deploy file name
 --trt-modelfile      set model file name
 --trt-mode           0 fp16 (if supported), 1 fp32, 2 int8
 --trt-enable-perf    1[default] to enable perf measurement, 0 otherwise




$  ../02_video_dec_cuda/video_dec_cuda 
video_dec_cuda   [options]

Supported formats:
 H264
 H265

OPTIONS:
 -h,--help            Prints this text
 --dbg-level   Sets the debug level [Values 0-3]

 --disable-rendering  Disable rendering
 --fullscreen         Fullscreen playback [Default = disabled]
 -ww           Window width in pixels [Default = video-width]
 -wh          Window height in pixels [Default = video-height]
 -wx        Horizontal window offset [Default = 0]
 -wy        Vertical window offset [Default = 0]

 -fps            Display rate in frames per second [Default = 30]

 -o         Write to output file

 -f       1 NV12, 2 I420 [Default = 1]

 --input-nalu         Input to the decoder will be nal units
 --input-chunks       Input to the decoder will be a chunk of bytes [Default]
 --bbox-file          bbox file path
 --display-text     enable nvosd text overlay with input string

04_video_dec_trt
상위와 비슷한 예제이며, 직접 Video Input을 받아 처리
  https://docs.nvidia.com/jetson/archives/l4t-multimedia-archived/l4t-multimedia-281/l4t_mm_vid_decode_trt.html

JetsonTX2 Gstreamer
  https://developer.ridgerun.com/wiki/index.php?title=Gstreamer_pipelines_for_Jetson_TX2
  https://elinux.org/Jetson/H264_Codec
  https://developer.ridgerun.com/wiki/index.php?title=NVIDIA_Jetson_TX1_TX2_Video_Latency

Jetson TX2 Gstreamer and OpenCV (python)
  https://jkjung-avt.github.io/tx2-camera-with-python/
  https://devtalk.nvidia.com/default/topic/1025356/how-to-capture-and-display-camera-video-with-python-on-jetson-tx2/

5/24/2019

NVIDIA TensorRT Manual 및 관련자료 수집 및 용어정리

1. NVIDIA TensorRT Manual

NVIDIA Deep Learning Manual

Deep Learning에 Frame 과 NVIDIA의 종합 SDK Manual
아래사이트에서 궁금한 내용은 각각 아이콘을 클릭해서 들어가자
https://developer.nvidia.com/deep-learning-software

NVIDIA TensorRT

C++ 구성이 되었으며, Inference Engine으로 사용됨
https://developer.nvidia.com/tensorrt

TensorRT 기본기능 설명

성능을 업그레이드 하는 방법소개하며, CPU만 사용할때보다는 최고 40배 빠르다고함.
https://devblogs.nvidia.com/speed-up-inference-tensorrt/

TensorRT Cloud (nvidia-docker, x86만 지원하며, ARM 아직 미지원)

https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt

TensorRT python API 사용방법 소개

https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#python_topics

TensorRT의 DLA( Deep Learning Accelerator)

현재 Jetson TX2에는 HW적으로 해당사항이 없음
https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#dla_topic

TensorRT의 제약사항 (필수확인)

JetsonTX2는 FP32/FP16만 지원하며, DLA는 소스는 존재하지만 HW가 지원못함
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#layers-matrix
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#hardware-precision-matrix

  https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html

TensorRT의 성능측정

Nsight를 이용하여 CUDA를 측정하고 이를 개선하는것 같은데, 주로 x86용으로 사용하는 것 같다.
현재 Jetson TX2도 Nsight를 지원하지만, 이는 동작성능측정인 것 같다.
추후 Tensorflow의 Tensorboard로 분석하는 법을 배워야할 것 같다.
(우선 Tensorflow부터)
https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html
https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#profiling

TensorRT Sample (C++/python)

C++ Sample / Python Sample
https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html

TensorRT (x86용 설치)

https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html#overview

TensorRT의 함수 (Graph Surgeon)

TensorFlow의 Graph를 분석이 중요하다고 하는데, 이 부분을 어떻게 해야하는지를 나중에 자세히알아보도록 하자
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/graphsurgeon/graphsurgeon.html

TensorRT의 UFF/Caffe/Onnx Parser 함수

다른 Model을 Import하여 Parser하여 변환을 할 것인데, 어떻게 진행하는 지 알아야함
현재 별도의 명령어로 Conver는 존재
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Uff/pyUff.html

TensorRT의 Python Network 구성 (Layer)

Deep Learning Network로 상위에서 각 제약사항과 같이 봐야하며, 대충의 기능을 알아두자
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/Graph/pyGraph.html#

TensorRT의 Python Core 기능

Core에서 CudaEngine을 보면 핵심기능같은데, 상위문서 보면 Serialize 인것으로 생각
Profiler를 별도로 제공하며, 이를 분석하고 사용하는 도구가 무엇인지는 차후에 찾아보자
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/Core/pyCore.html
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/Core/Profiler.html

TensorRT Userful Resource

  http://on-demand.gputechconf.com/gtcdc/2017/video/DC7172/
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html
  http://on-demand.gputechconf.com/gtcdc/2017/video/DC7172/
  https://devblogs.nvidia.com/tensorrt-4-accelerates-translation-speech-recommender/
  http://on-demand.gputechconf.com/gtc/2018/video/S8822/

1.1 NVIDIA에서 사용되는 용어 정리

NVIDIA의 Manual을 보면 약어들이 존재해서 혼란하게 만들어서 관련부분들을 간단히 정리하여 적는다.
최근부터 NVIDIA에서 한글지원을 해주고 있으므로, 가능하다면 한글로 보자.

DGX: NVIDIA Workstation 인지 기기인지 좀 혼동, 추측으로 Workstation (홈페이지에서 세부설명이 없어혼동)
NGC: NVIDIA GPU Cloud 로 x86기반으로 지원 (Docker)

아래의 Product부분참조

https://www.nvidia.com/en-us/data-center/dgx-systems/

1.2 NVIDIA DGX System

NVIDIA에서 Deep Learning Training 을 위해서 제작한 Workstation인지 혼동되며 이후 모델을 HGX로 만들 생각인 것 같다.
HGX는 이름만 존재하며, 아직 정식버전은 없는 것 같다.
DGX도 여러 종류가 존재하며, 빠른 train을 위한 Workstation으로 이용가능할 것 같으며, 클라우딩도 지원 및 다중 GPU도 지원이 되어 성능이 빠른 것 같다

NVIDIA DGX Series

아래의 Link를 보면 각각의 Workstation 및 고사양으로 사용가능한 것 같다.
https://www.nvidia.com/en-us/data-center/dgx-pod-reference-architecture/
https://www.nvidia.com/en-us/data-center/dgx-systems/
https://www.nvidia.com/ko-kr/data-center/dgx-station/
https://youtu.be/PuZ2F87Lqg4

Tensorflow -> TensorRT

https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html

1.3 NVIDIA의 Docker 사용

NVIDIA의 Docker의 구성은 다음과 같다.

docker : x86 과 ARM 지원
nvidia-docker : ARM은 아직 미지원

NVIDIA에서는 GPU Cloud를 이용하여 Docker기능을 좀 더 제공을 하고 있으며, 이 때 사용하는 것이 nvidia-docker 이지만, 이는 ARM은 아직 지원하지 않는다.
처음 nvidia-docker가 ARM을 지원을 하는줄 알고 착각하여 좀 삽질을 했다.

NVIDIA GPU Cloud 및 NVIDIA-DOCKER(nvidia-docker) 는 Host PC에서 사용되는 기능으로 주로 Training 과 이를 Test 하기 위해서 사용되는 것 같다.
개인생각으로는 추후에 ARM도 지원을 해줄 것 같은데, 그때 자세히 알아보자.

NVIDIA-DOCKER 지원 (Host PC만 지원 , ARM 미지원 )
  https://github.com/NVIDIA/nvidia-docker/issues/214
  https://github.com/NVIDIA/nvidia-docker/wiki
  https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#do-you-support-tegra-platforms-arm64
  https://www.nvidia.co.kr/content/apac/event/kr/deep-learning-day-2017/dli-1/Docker-User-Guide-17-08_v1_NOV01_Joshpark.pdf

NVIDIA GPU Cloud (NGC) 가입 및 이용

https://ngc.nvidia.com/
https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt

TensorRT의 Docker 사용방법

https://docs.nvidia.com/deeplearning/sdk/tensorrt-container-release-notes/pullcontainer.html#pullcontainer

상위의 문서를 보면 DGX를 이용하지 않는다면, NVIDIA® GPU Cloud™ (NGC) 문서를 참고하라고 해서 이것을 참고
https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html

Jetson TX2 에서 docker를 이용하여 TensorRT 설치 시도

nvidia@tegra-ubuntu:~/jhlee$ mkdir docker
nvidia@tegra-ubuntu:~/jhlee$ cd docker
nvidia@tegra-ubuntu:~/jhlee/docker$ sudo docker pull nvcr.io/nvidia/tensorrt:19.05-py3   
[sudo] password for nvidia: 
19.05-py3: Pulling from nvidia/tensorrt
7e6591854262: Pulling fs layer 
089d60cb4e0a: Pulling fs layer 
9c461696bc09: Pulling fs layer 
45085432511a: Pull complete 
6ca460804a89: Pull complete 
2631f04ebf64: Pull complete 
86f56e03e071: Pull complete 
234646620160: Downloading [=====================>                             ]  265.5MB/615.2MB
7f717cd17058: Verifying Checksum 
e69a2ba99832: Download complete 
bc9bca17b13c: Download complete 
1870788e477f: Download complete 
603e0d586945: Downloading [=====================>                             ]  214.3MB/492.7MB
717dfedf079c: Download complete 
1035ef613bc7: Download complete 
c5bd7559c3ad: Download complete 
d82c679b8708: Download complete 
....

$ sudo docker images
REPOSITORY                TAG                 IMAGE ID            CREATED             SIZE
nvcr.io/nvidia/tensorrt   19.05-py3           de065555c278        2 weeks ago         3.83GB

Nvidia docker 기반에 TensorRT 동작

https://docs.nvidia.com/deeplearning/sdk/tensorrt-container-release-notes/running.html
Nvidia Docker (Docker에 Cuda기능추가)

https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/
Jetson TX2 자료로 Docker 검색가능
https://elinux.org/Jetson_TX2

nvidia-docker 설치

NVIDIA에서 제공하는 Docker로 CPU중심의 Docker가 아닌 GPU도 같이 사용이 가능하며, 현재 ARM Version은 미제공
https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
https://nvidia.github.io/nvidia-docker/

Jetson의 개발환경 구축 (PC)

https://github.com/teoac/DeepLearningOnJetson/wiki/How-to-Set-Environment-for-Development

Tensorflow for Jetson TX2

Jetson에서 쉽게 Tensorflow를 쉽게 설치가능
https://docs.nvidia.com/deeplearning/frameworks/install-tf-jetsontx2/index.html
https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-tx2/

Jetson TX2의 Jetpack 과 Issack SDK 관련자료

https://developer.nvidia.com/embedded/jetpack
https://developer.nvidia.com/isaac-sdk

Nsight

NVIDIA에서 제공해주는 Eclipse 기반의 IDE Tool로 Jetpack 설치하면 존재함
https://developer.nvidia.com/nsight-systems
https://developer.nvidia.com/nsight-graphics

4/30/2019

NVIDIA의 Deep Learning-TensorRT

1. NVIDIA DEEP LEARNING SDK 구조 및 기본설명

NVIDIA의 Deep Learning 전체 시스템을 간단히 보면, TRAINING 과 INFERENCE 로 나누어지며, TRAINING을 통해 DATA MODEL을 만들고,
이 MODEL를 가지고 INFERENCE 에서 적용하는 방식이다.

정말 단순하고 당연한 구조이지만, NVIDIA만의 특징은 상위 두 기능에 각각의 NVIDIA의 기능들을 제공하고 있기때문이다.

Training And Inference

상위에서 설명한 기능들을 살펴보자.
현재 NVIDIA의 공식사이트에서 확인된 기술은 다음과 같으며, 확인할때 마다 지속적으로 변경되는 것으로 보아 앞으로도 확장이 될 것 같다.

Deep Learning Primitives (CUDA® Deep Neural Network library™ (cuDNN))
Deep Learning Inference Engine (TensorRT™ )
Deep Learning for Video Analytics (NVIDIA DeepStream™ SDK)
Linear Algebra (CUDA® Basic Linear Algebra Subroutines library™ (cuBLAS))
Sparse Matrix Operations (NVIDIA CUDA® Sparse Matrix library™ (cuSPARSE))
Multi-GPU Communication (NVIDIA® Collective Communications Library ™ (NCCL))

CUDA는 NVIDIA에서 제공하는 Graphic Library로 그중에서 포함된 기능들을 설명한다.

CUDA/CuDNN

CuDNN은 CUDA Deep Learning Network Library라고 하며, DNN위한 Library라고 한다.
Convolution , Pooling,을 제공을 해주며, 다양한 Deep Learning Framework에서 사용되어진다고 한다.

CUDA/cuBLAS

선형대수라고하며, CPU의 MKL BLAS Library보다, 6배이상으로 빠르다고 하면, GPU Library 라고하는데, 사용을 해봐야 알것 같다.
CUDA 내부 역할을 잘모르기 때문에, 이부분을 단정지어 말하기가 애매하다.

CUDA/cuSPARSE

cuBLAS와 cuPARSE와의 차이는 잘모르겠으며, CPU의 MKL BLAS의 역할하는 것 같으며, 명확하지 않다.
Matlab 처럼 Matrix의 기능을 빠르게 지원을 해주는 것으로 생각된다.

일단 상위 3개의 기능이 CUDA에 포함이 되고 있으며, 이를 구분해서 알아두자.

TensoRT

Training은 Host PC에서 진행을 할 것이며, 만들어진 Model기반으로 이를 적용할 NVIDIA의 Inference Engine 및 Framework라고 생각하면 되겠다.
이는 x86/ ARM 을 다 지원하므로, 개별 부분은 다 자세히 알아봐야 한다.
현재 Jetson에서는 C++만을 지원하고 있다. (추후 Python은 어떻게 될지 모르겠다.).

Deep Stream SDK

TensorRT와 마찬가지로, C++로 Inference에서 사용이되며, Deep Learning을 이용한 빠른 Video 분석을 위한 Library라고 생각하는데, 추측으로는 TensorRT를 Gstream 처럼 만들어서 넣는 개념인 것 같다.
만약 사용이되어지면, 실시간으로 비디오 Yolo와 비슷 할 것 같다.
( TensorRT에서 현재 Yolo 지원)

NCCL

멀티 GPU 통신 기능로, 주로 Training할 때 사용될 것으로 생각이 된다.
Host PC에서 여러개의 GPU와 통신하여 빠른 기능을 사용할때 사용할 것 같다.

상위내용은 아래에서 확인
  https://docs.nvidia.com/deeplearning/sdk/introduction/index.html

JetsonTX2 기준으로 본다면, 사용할 부분은 Training은 PC에서 특정 Framework을 이용하여 Model을 만드는 것을 진행 한 후 Inference은 Jetson TX2의 TensorRT로 최적화를 진행
(Tensorflow->TensorRT)

TensorRT Install 및 구조파악
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html

Nvidia Jetson Tutorial
  https://developer.nvidia.com/embedded/learn/tutorials

Nvidia 의 TensorFlow to TensorRT
  https://developer.nvidia.com/embedded/twodaystoademo

1.1 NVIDIA Deep Learning SDK Manual

상위 NVIDIA Manual을 보면 좌측을 보면 아래와 같이 구성이 되어있으며, 필요한 부분만 보자.

Deep Learning SDK: 반드시 봐야하며, 전체 구조를 쉽게 파악가능
Performance: Nvidia의 제공하는 성능에 대해서 설명해주지만, 현재 이해 불가능
Training Library: 상위 cuDNN/NCCL 설명이 Training 중심으로 나오지만, 추후 자세히
Inference Library: TensorRT가 Inference 엔진이 가장중요하며, 이부분을 이해
Inference Server: Docker를 이용하여 동작되며, HTTP/GRPC제공하며, Tensorflow server와 비슷한 것 같다.
Data Loading: Data Loading Library 라고 하는데, 읽어보면, Trainning시 Bigdata를 말하는 것 같다.
Archives: 상위 각각의 설명을 Guide로 설명해주고 있다.

NVIDIA Deep Learning SDK
  https://docs.nvidia.com/deeplearning/sdk/index.html

NVIDIA DGX (추후 파악)
https://docs.nvidia.com/deeplearning/dgx/
  https://docs.nvidia.com/deeplearning/dgx/install-tf-jetsontx2/index.html

NVIDIA DEV COMMUNITY
  https://devtalk.nvidia.com/

1.2 Jetson TX2의 Jetpack 설치 및 환경확인

Jetson TX2에서 진행을 하고 있기 때문에 제약사항은 반드시 확인을 해야한다.

Jetson TX2 (Ubuntu 16.04)
Jetpack 3.3 설치 (TensorRT 4.0.2)
Tensorflow 1.8.0 설치

Jetpack 3.3 관련내용
https://developer.nvidia.com/embedded/jetpack-3_3

Jetpack 4.1 관련내용
https://developer.nvidia.com/embedded/downloads#?search=L4T%20Jetson%20TX2%20Driver%20Package

TensoRT는 x86과 Jetson에서도 돌아가는 시스템이지만 Manual을 볼때 반드시 체크해야할 것이 Jetson에도 적용이되는지를 확인을 해야한다.
TensorRT는 NVIDIA에서 제공하는 Deep Learning inference Engine을 말한다.

현재 TensorRT 5.x까지 지원을 하고 있으며, 현재 본인의 TensorRT version 아래와 같이 확인해보자.

nvidia@tegra-ubuntu$  dpkg -l | grep TensorRT
ii  libnvinfer-dev                              4.1.3-1+cuda9.0                              arm64        TensorRT development libraries and headers
ii  libnvinfer-samples                          4.1.3-1+cuda9.0                              arm64        TensorRT samples and documentation
ii  libnvinfer4                                 4.1.3-1+cuda9.0                              arm64        TensorRT runtime libraries
ii  tensorrt                                    4.0.2.0-1+cuda9.0                            arm64        Meta package of TensorRT

  https://devtalk.nvidia.com/default/topic/1050183/tensorrt/tensorrt-5-1-2-installation-on-jetson-tx2-board/

Download TensorRT
  https://developer.nvidia.com/nvidia-tensorrt-download

How To Upgrade TensorRT (x86기반)
  https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html#installing-debian

  https://medium.com/@ardianumam/installing-tensorrt-in-jetson-tx2-8d130c4438f5

2. Training Framework

Manual의 Index를 보면, Training Library를 보면, CUDA의 cuDNN/NCCL 기능을 활용되는 기능만 설명이 되어있다.

현재 실제 Training을 해보지 않은 입장에서는 어떻게 Training을 진행해야하는지는 모르기 때문에, 추후 알게 된다면, 관련부분을 다시 자세히 서술하자.

Manual의 Performance에서 Training with Mixed Precision부분을 보면 다양한 Framework를 통해 진행을 하는 것 같다.

NVCaffe, Caffe2, MXNet, Microsoft Cognitive Toolkit, PyTorch, TensorFlow and Theano
그리고, 다양한 Format을 이용하여, 정확도를 변경하여 최적화를 진행을 하는 것 같다.
이 부분은 실제로 Training을 진행해봐야 아는 부분이기에, Manual만 읽고 이해하기로 한다.

Framework들

https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#framework

3. Inference (TensorRT)

TensorRT의 기능은 위에서 설명했듯이 NVIDIA의 Inference의 엔진이며, C++로 구현이 되어있다.
그리고, 다른 Deep Learning Framework에서 작성된 Network Model을 가져와 이곳에 맞게 사용되어지는 것 같다.

TensorRT의 사용법

Deep Learning Framework와 연결하여 사용
TensorRT 단독사용

상위 두개로 지원을 해주는 것 같으며, 최종으로는 TensorRT만 사용하도록 갈 것 같다.
TensorRT도 Deep Learning Framework 같지만, 오직 Inference 기능만 제공하기에,
다른 Framework과 다른 것 같다.

TensorRT의 최신기능 및 설명 ( 성능 )
https://developer.nvidia.com/tensorrt
http://www.nextplatform.com/wp-content/uploads/2018/01/inference-technical-overview-1.pdf

TensorRT의 기능설명 및 지원 Parser
https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html

TensorRT는 C++로 구현되었으며, Python도 구현이 되어있다고 한다.

Network 정의를 보면, input/output tensors들을 정의하고, Layer들을 추가 및 설정변경하고, 다양한 설정을 제공하는 것을 알 수 있다.

Network Definition : Network Model 관련된 부분( Deep Learning 공부를 해야함)
Builder: Network의 최적화에 관련된 부분 같다.
Engine: Inference Engine 이라고 하는데 이 세개는 정확히 구분하기가 애매하다.

TensorRT의 Parser

다른 Framework, 작성된 Network를 직접 가져올수 있다고 하며, Caffe or UFF or ONNX format으로 형태로 가져와서 이를 최적화 하는 기능 가지고 있다고 한다.

Caffe Parser
UFF Parser
ONNX Parser

https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#api

TensorRT는 다양한 Deep learning Framework를 연결지원하여, 생성된 Trained Model을 TensorRT에 최적화 진행이 가능하며 이때 이용하는 것이 Parser기능이다.

Framework(TensorFlow)와 TensorRT 사용법 및 Convert

https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#build_model
https://github.com/NVIDIA-AI-IOT/tf_to_trt_image_classification

TensorFlow to UFF Convert Format

https://devtalk.nvidia.com/default/topic/1028464/jetson-tx2/converting-tf-model-to-tensorrt-uff-format/

TensorFlow에서 TensorRT Import 방법 및 사용법

TensorFlow와 TensorRT를 같이 사용하는 방법
https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#import_tf_python

TensorFlow에 TensorRT 적용

tf : Tensorflow
trt: TensorRT

https://jkjung-avt.github.io/tf-trt-models/
https://github.com/NVIDIA-AI-IOT/tf_trt_models

TensorRT Support Matrix

TensorRT의 Parser 정보 지원확인
TensorRT의 Layer와 Features의 제약사항확인
https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html

TensorRT Release version 의 기능확인

https://docs.nvidia.com/deeplearning/sdk/tensorrt-container-release-notes/index.html

TensorRT API (C++/Python)

https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/index.html

TensorRT docker 사용법

https://docs.nvidia.com/deeplearning/sdk/tensorrt-container-release-notes/running.html

TensorRT의 Sample 사용법

https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html
https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#mnist_sample

nvidia@tegra-ubuntu$ ls /usr/src/tensorrt/samples/
common     Makefile         sampleCharRNN     sampleGoogleNet  sampleMNIST     sampleMovieLens  sampleOnnxMNIST  sampleUffMNIST  trtexec
getDigits  Makefile.config  sampleFasterRCNN  sampleINT8       sampleMNISTAPI  sampleNMT        samplePlugin     sampleUffSSD

nvidia@tegra-ubuntu$ cd /usr/src/tensorrt/samples
nvidia@tegra-ubuntu$ sudo make

Jetson의 TensorRT의 Python은 미지원이지만, 차후 지원
https://devtalk.nvidia.com/default/topic/1036899/tensorrt-python-on-tx2-/

3.1 TensorRT Sample TEST

Jetson TX2에서 Jetpack을 설치된 상태에서 Test를 진행을 했으며, 처음에는 Compile이 되지 않았기 때문에, 내부에서 Build를 해서 실행파일을 생성해야한다.
그리고, Network Model은 내부에 제공을 해주지만, Caffe Model로 지원을 해주고 있다.

trt : tensorRT

nvidia@tegra-ubuntu$  cd /usr/src/tensorrt/bin/
nvidia@tegra-ubuntu$ ls
$ ls
chobj                     sample_fasterRCNN        sample_mnist            sample_nmt               sample_uff_mnist
dchobj                    sample_fasterRCNN_debug  sample_mnist_api        sample_nmt_debug         sample_uff_mnist_debug
download-digits-model.py  sample_googlenet         sample_mnist_api_debug  sample_onnx_mnist        sample_uff_ssd
giexec                    sample_googlenet_debug   sample_mnist_debug      sample_onnx_mnist_debug  sample_uff_ssd_debug
sample_char_rnn           sample_int8              sample_movielens        sample_plugin            trtexec
sample_char_rnn_debug     sample_int8_debug        sample_movielens_debug  sample_plugin_debug      trtexec_debug

nvidia@tegra-ubuntu$ $ ls ../data/                               // sample의 모델과 관련정보 
char-rnn  faster-rcnn  googlenet  mnist  movielens  ssd

nvidia@tegra-ubuntu$ ./sample_mnist              
Reading Caffe prototxt: ../data/mnist/mnist.prototxt
Reading Caffe model: ../data/mnist/mnist.caffemodel

Input:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@#-:.-=@@@@@@@@@@@@@@
@@@@@%=     . *@@@@@@@@@@@@@
@@@@%  .:+%%% *@@@@@@@@@@@@@
@@@@+=#@@@@@# @@@@@@@@@@@@@@
@@@@@@@@@@@%  @@@@@@@@@@@@@@
@@@@@@@@@@@: *@@@@@@@@@@@@@@
@@@@@@@@@@- .@@@@@@@@@@@@@@@
@@@@@@@@@:  #@@@@@@@@@@@@@@@
@@@@@@@@:   +*%#@@@@@@@@@@@@
@@@@@@@%         :+*@@@@@@@@
@@@@@@@@#*+--.::     +@@@@@@
@@@@@@@@@@@@@@@@#=:.  +@@@@@
@@@@@@@@@@@@@@@@@@@@  .@@@@@
@@@@@@@@@@@@@@@@@@@@#. #@@@@
@@@@@@@@@@@@@@@@@@@@#  @@@@@
@@@@@@@@@%@@@@@@@@@@- +@@@@@
@@@@@@@@#-@@@@@@@@*. =@@@@@@
@@@@@@@@ .+%%%%+=.  =@@@@@@@
@@@@@@@@           =@@@@@@@@
@@@@@@@@*=:   :--*@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Output:

0: 
1: 
2: 
3: **********
4: 
5: 
6: 
7: 
8: 
9: 

nvidia@tegra-ubuntu$ ./sample_uff_mnist
../data/mnist/lenet5.uff



---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@+  :@@@@@@@@
@@@@@@@@@@@@@@%= :. --%@@@@@
@@@@@@@@@@@@@%. -@= - :@@@@@
@@@@@@@@@@@@@: -@@#%@@ #@@@@
@@@@@@@@@@@@: #@@@@@@@-#@@@@
@@@@@@@@@@@= #@@@@@@@@=%@@@@
@@@@@@@@@@= #@@@@@@@@@:@@@@@
@@@@@@@@@+ -@@@@@@@@@%.@@@@@
@@@@@@@@@::@@@@@@@@@@+-@@@@@
@@@@@@@@-.%@@@@@@@@@@.*@@@@@
@@@@@@@@ *@@@@@@@@@@@ *@@@@@
@@@@@@@% %@@@@@@@@@%.-@@@@@@
@@@@@@@:*@@@@@@@@@+. %@@@@@@
@@@@@@# @@@@@@@@@# .*@@@@@@@
@@@@@@# @@@@@@@@=  +@@@@@@@@
@@@@@@# @@@@@@%. .+@@@@@@@@@
@@@@@@# @@@@@*. -%@@@@@@@@@@
@@@@@@# ---    =@@@@@@@@@@@@
@@@@@@#      *%@@@@@@@@@@@@@
@@@@@@@%: -=%@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => 14.2556  : ***
1 => -4.83078  : 
2 => 1.09185  : 
3 => -6.29008  : 
4 => -0.835606  : 
5 => -6.92059  : 
6 => 2.40399  : 
7 => -6.01171  : 
8 => 0.730784  : 
9 => 1.50033  : 




---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@+ @@@@@@@@@@@@@@
@@@@@@@@@@@@. @@@@@@@@@@@@@@
@@@@@@@@@@@@- @@@@@@@@@@@@@@
@@@@@@@@@@@#  @@@@@@@@@@@@@@
@@@@@@@@@@@#  *@@@@@@@@@@@@@
@@@@@@@@@@@@  :@@@@@@@@@@@@@
@@@@@@@@@@@@= .@@@@@@@@@@@@@
@@@@@@@@@@@@#  %@@@@@@@@@@@@
@@@@@@@@@@@@% .@@@@@@@@@@@@@
@@@@@@@@@@@@%  %@@@@@@@@@@@@
@@@@@@@@@@@@%  %@@@@@@@@@@@@
@@@@@@@@@@@@@= +@@@@@@@@@@@@
@@@@@@@@@@@@@* -@@@@@@@@@@@@
@@@@@@@@@@@@@*  @@@@@@@@@@@@
@@@@@@@@@@@@@@  @@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@* @@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => -5.21897  : 
1 => 14.7033  : ***
2 => -3.10811  : 
3 => -5.6187  : 
4 => 3.30519  : 
5 => -2.81663  : 
6 => -2.79249  : 
7 => 0.943604  : 
8 => 2.90335  : 
9 => -2.76499  : 




---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@*.  .*@@@@@@@@@@@
@@@@@@@@@@*.     +@@@@@@@@@@
@@@@@@@@@@. :#+   %@@@@@@@@@
@@@@@@@@@@.:@@@+  +@@@@@@@@@
@@@@@@@@@@.:@@@@: +@@@@@@@@@
@@@@@@@@@@=%@@@@: +@@@@@@@@@
@@@@@@@@@@@@@@@@# +@@@@@@@@@
@@@@@@@@@@@@@@@@* +@@@@@@@@@
@@@@@@@@@@@@@@@@: +@@@@@@@@@
@@@@@@@@@@@@@@@@: +@@@@@@@@@
@@@@@@@@@@@@@@@* .@@@@@@@@@@
@@@@@@@@@@%**%@. *@@@@@@@@@@
@@@@@@@@%+.  .: .@@@@@@@@@@@
@@@@@@@@=  ..   :@@@@@@@@@@@
@@@@@@@@: *@@:  :@@@@@@@@@@@
@@@@@@@%  %@*    *@@@@@@@@@@
@@@@@@@%  ++  ++ .%@@@@@@@@@
@@@@@@@@-    +@@- +@@@@@@@@@
@@@@@@@@=  :*@@@# .%@@@@@@@@
@@@@@@@@@+*@@@@@%.  %@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => -2.20233  : 
1 => -0.773752  : 
2 => 23.4804  : ***
3 => 3.09638  : 
4 => -4.57744  : 
5 => -5.71223  : 
6 => -5.92572  : 
7 => -0.543553  : 
8 => 4.85982  : 
9 => -9.1751  : 




---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@#-:.-=@@@@@@@@@@@@@@
@@@@@%=     . *@@@@@@@@@@@@@
@@@@%  .:+%%% *@@@@@@@@@@@@@
@@@@+=#@@@@@# @@@@@@@@@@@@@@
@@@@@@@@@@@%  @@@@@@@@@@@@@@
@@@@@@@@@@@: *@@@@@@@@@@@@@@
@@@@@@@@@@- .@@@@@@@@@@@@@@@
@@@@@@@@@:  #@@@@@@@@@@@@@@@
@@@@@@@@:   +*%#@@@@@@@@@@@@
@@@@@@@%         :+*@@@@@@@@
@@@@@@@@#*+--.::     +@@@@@@
@@@@@@@@@@@@@@@@#=:.  +@@@@@
@@@@@@@@@@@@@@@@@@@@  .@@@@@
@@@@@@@@@@@@@@@@@@@@#. #@@@@
@@@@@@@@@@@@@@@@@@@@#  @@@@@
@@@@@@@@@%@@@@@@@@@@- +@@@@@
@@@@@@@@#-@@@@@@@@*. =@@@@@@
@@@@@@@@ .+%%%%+=.  =@@@@@@@
@@@@@@@@           =@@@@@@@@
@@@@@@@@*=:   :--*@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => -10.1173  : 
1 => -2.8161  : 
2 => -2.5111  : 
3 => 19.4893  : ***
4 => -2.07457  : 
5 => 6.91505  : 
6 => -2.07856  : 
7 => -0.881291  : 
8 => -0.81335  : 
9 => -7.68046  : 




---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@.*@@@@@@@@@@
@@@@@@@@@@@@@@@@.=@@@@@@@@@@
@@@@@@@@@@@@+@@@.=@@@@@@@@@@
@@@@@@@@@@@% #@@.=@@@@@@@@@@
@@@@@@@@@@@% #@@.=@@@@@@@@@@
@@@@@@@@@@@+ *@@:-@@@@@@@@@@
@@@@@@@@@@@= *@@= @@@@@@@@@@
@@@@@@@@@@@. #@@= @@@@@@@@@@
@@@@@@@@@@=  =++.-@@@@@@@@@@
@@@@@@@@@@       =@@@@@@@@@@
@@@@@@@@@@  :*## =@@@@@@@@@@
@@@@@@@@@@:*@@@% =@@@@@@@@@@
@@@@@@@@@@@@@@@% =@@@@@@@@@@
@@@@@@@@@@@@@@@# =@@@@@@@@@@
@@@@@@@@@@@@@@@# =@@@@@@@@@@
@@@@@@@@@@@@@@@* *@@@@@@@@@@
@@@@@@@@@@@@@@@= #@@@@@@@@@@
@@@@@@@@@@@@@@@= #@@@@@@@@@@
@@@@@@@@@@@@@@@=.@@@@@@@@@@@
@@@@@@@@@@@@@@@++@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => -5.58382  : 
1 => -0.332037  : 
2 => -2.3609  : 
3 => 0.0268471  : 
4 => 9.68715  : ***
5 => 0.345264  : 
6 => -5.68754  : 
7 => 0.252157  : 
8 => 0.0862162  : 
9 => 4.92423  : 




---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@=   ++++#++=*@@@@@
@@@@@@@@#.            *@@@@@
@@@@@@@@=             *@@@@@
@@@@@@@@.   .. ...****%@@@@@
@@@@@@@@: .%@@#@@@@@@@@@@@@@
@@@@@@@%  -@@@@@@@@@@@@@@@@@
@@@@@@@%  -@@*@@@*@@@@@@@@@@
@@@@@@@#  :#- ::. ::=@@@@@@@
@@@@@@@-             -@@@@@@
@@@@@@%.              *@@@@@
@@@@@@#     :==*+==   *@@@@@
@@@@@@%---%%@@@@@@@.  *@@@@@
@@@@@@@@@@@@@@@@@@@+  *@@@@@
@@@@@@@@@@@@@@@@@@@=  *@@@@@
@@@@@@@@@@@@@@@@@@*   *@@@@@
@@@@@%+%@@@@@@@@%.   .%@@@@@
@@@@@*  .******=    -@@@@@@@
@@@@@*             .#@@@@@@@
@@@@@*            =%@@@@@@@@
@@@@@@%#+++=     =@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => -4.68429  : 
1 => -5.85174  : 
2 => -11.9795  : 
3 => 3.46393  : 
4 => -6.07335  : 
5 => 23.6807  : ***
6 => 1.61781  : 
7 => -2.97774  : 
8 => 1.30685  : 
9 => 4.07391  : 




---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@%.:@@@@@@@@@@@@
@@@@@@@@@@@@@: *@@@@@@@@@@@@
@@@@@@@@@@@@* =@@@@@@@@@@@@@
@@@@@@@@@@@% :@@@@@@@@@@@@@@
@@@@@@@@@@@- *@@@@@@@@@@@@@@
@@@@@@@@@@# .@@@@@@@@@@@@@@@
@@@@@@@@@@: #@@@@@@@@@@@@@@@
@@@@@@@@@+ -@@@@@@@@@@@@@@@@
@@@@@@@@@: %@@@@@@@@@@@@@@@@
@@@@@@@@+ +@@@@@@@@@@@@@@@@@
@@@@@@@@:.%@@@@@@@@@@@@@@@@@
@@@@@@@% -@@@@@@@@@@@@@@@@@@
@@@@@@@% -@@@@@@#..:@@@@@@@@
@@@@@@@% +@@@@@-    :@@@@@@@
@@@@@@@% =@@@@%.#@@- +@@@@@@
@@@@@@@@..%@@@*+@@@@ :@@@@@@
@@@@@@@@= -%@@@@@@@@ :@@@@@@
@@@@@@@@@- .*@@@@@@+ +@@@@@@
@@@@@@@@@@+  .:-+-: .@@@@@@@
@@@@@@@@@@@@+:    :*@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => 0.409332  : 
1 => -3.60869  : 
2 => -4.52237  : 
3 => -4.49587  : 
4 => -0.557327  : 
5 => 6.62171  : 
6 => 19.9842  : ***
7 => -9.71854  : 
8 => 3.16726  : 
9 => -4.7647  : 




---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@%=#@@@@@%=%@@@@@@@@@@
@@@@@@@           %@@@@@@@@@
@@@@@@@           %@@@@@@@@@
@@@@@@@#:-#-.     %@@@@@@@@@
@@@@@@@@@@@@#    #@@@@@@@@@@
@@@@@@@@@@@@@    #@@@@@@@@@@
@@@@@@@@@@@@@:  :@@@@@@@@@@@
@@@@@@@@@%+==   *%%%%%%%%%@@
@@@@@@@@%                 -@
@@@@@@@@@#+.          .:-%@@
@@@@@@@@@@@*     :-###@@@@@@
@@@@@@@@@@@*   -%@@@@@@@@@@@
@@@@@@@@@@@*   *@@@@@@@@@@@@
@@@@@@@@@@@*   @@@@@@@@@@@@@
@@@@@@@@@@@*   #@@@@@@@@@@@@
@@@@@@@@@@@*   *@@@@@@@@@@@@
@@@@@@@@@@@*   *@@@@@@@@@@@@
@@@@@@@@@@@*   @@@@@@@@@@@@@
@@@@@@@@@@@*   @@@@@@@@@@@@@
@@@@@@@@@@@@+=#@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => -6.70799  : 
1 => 0.957398  : 
2 => 3.31229  : 
3 => 2.58422  : 
4 => 3.30001  : 
5 => -3.82085  : 
6 => -6.51343  : 
7 => 16.7635  : ***
8 => -2.20583  : 
9 => -5.96497  : 




---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@%+-:  =@@@@@@@@@@@@
@@@@@@@%=      -@@@**@@@@@@@
@@@@@@@   :%#@-#@@@. #@@@@@@
@@@@@@*  +@@@@:*@@@  *@@@@@@
@@@@@@#  +@@@@ @@@%  @@@@@@@
@@@@@@@.  :%@@.@@@. *@@@@@@@
@@@@@@@@-   =@@@@. -@@@@@@@@
@@@@@@@@@%:   +@- :@@@@@@@@@
@@@@@@@@@@@%.  : -@@@@@@@@@@
@@@@@@@@@@@@@+   #@@@@@@@@@@
@@@@@@@@@@@@@@+  :@@@@@@@@@@
@@@@@@@@@@@@@@+   *@@@@@@@@@
@@@@@@@@@@@@@@: =  @@@@@@@@@
@@@@@@@@@@@@@@ :@  @@@@@@@@@
@@@@@@@@@@@@@@ -@  @@@@@@@@@
@@@@@@@@@@@@@# +@  @@@@@@@@@
@@@@@@@@@@@@@* ++  @@@@@@@@@
@@@@@@@@@@@@@*    *@@@@@@@@@
@@@@@@@@@@@@@#   =@@@@@@@@@@
@@@@@@@@@@@@@@. +@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => -5.12389  : 
1 => -3.94476  : 
2 => -0.990646  : 
3 => 1.20684  : 
4 => 3.48777  : 
5 => -0.614695  : 
6 => -4.78878  : 
7 => -2.69351  : 
8 => 14.321  : ***
9 => 3.12232  : 




---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@%.-@@@@@@@@@@@
@@@@@@@@@@@*-    %@@@@@@@@@@
@@@@@@@@@@= .-.  *@@@@@@@@@@
@@@@@@@@@= +@@@  *@@@@@@@@@@
@@@@@@@@* =@@@@  %@@@@@@@@@@
@@@@@@@@..@@@@%  @@@@@@@@@@@
@@@@@@@# *@@@@-  @@@@@@@@@@@
@@@@@@@: @@@@%   @@@@@@@@@@@
@@@@@@@: @@@@-   @@@@@@@@@@@
@@@@@@@: =+*= +: *@@@@@@@@@@
@@@@@@@*.    +@: *@@@@@@@@@@
@@@@@@@@%#**#@@: *@@@@@@@@@@
@@@@@@@@@@@@@@@: -@@@@@@@@@@
@@@@@@@@@@@@@@@+ :@@@@@@@@@@
@@@@@@@@@@@@@@@*  @@@@@@@@@@
@@@@@@@@@@@@@@@@  %@@@@@@@@@
@@@@@@@@@@@@@@@@  #@@@@@@@@@
@@@@@@@@@@@@@@@@: +@@@@@@@@@
@@@@@@@@@@@@@@@@- +@@@@@@@@@
@@@@@@@@@@@@@@@@*:%@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => -2.75228  : 
1 => -1.51535  : 
2 => -4.11729  : 
3 => 0.316925  : 
4 => 3.73423  : 
5 => -3.00593  : 
6 => -6.18866  : 
7 => -1.02671  : 
8 => 1.937  : 
9 => 14.8275  : ***

Average over 10 runs is 1.05167 ms.

nvidia@tegra-ubuntu$ ./sample_googlenet
Building and running a GPU inference engine for GoogleNet, N=4...
Bindings after deserializing:
Binding 0 (data): Input.
Binding 1 (prob): Output.
conv1/7x7_s2 + conv1/relu_7x7 input refo 0.378ms
conv1/7x7_s2 + conv1/relu_7x7            1.465ms
pool1/3x3_s2                             0.488ms
pool1/norm1                              0.137ms
conv2/3x3_reduce + conv2/relu_3x3_reduce 0.178ms
conv2/3x3 + conv2/relu_3x3               2.240ms
conv2/norm2                              0.415ms
pool2/3x3_s2                             0.531ms
inception_3a/1x1 + inception_3a/relu_1x1 0.275ms
inception_3a/3x3 + inception_3a/relu_3x3 0.578ms
inception_3a/5x5 + inception_3a/relu_5x5 0.134ms
inception_3a/pool                        0.245ms
inception_3a/pool_proj + inception_3a/re 0.096ms
inception_3a/1x1 copy                    0.026ms
inception_3b/1x1 + inception_3b/relu_1x1 0.561ms
inception_3b/3x3 + inception_3b/relu_3x3 1.156ms
inception_3b/5x5 + inception_3b/relu_5x5 0.613ms
inception_3b/pool                        0.140ms
inception_3b/pool_proj + inception_3b/re 0.132ms
inception_3b/1x1 copy                    0.048ms
pool3/3x3_s2                             0.247ms
inception_4a/1x1 + inception_4a/relu_1x1 0.286ms
inception_4a/3x3 + inception_4a/relu_3x3 0.279ms
inception_4a/5x5 + inception_4a/relu_5x5 0.068ms
inception_4a/pool                        0.075ms
inception_4a/pool_proj + inception_4a/re 0.076ms
inception_4a/1x1 copy                    0.020ms
inception_4b/1x1 + inception_4b/relu_1x1 0.302ms
inception_4b/3x3 + inception_4b/relu_3x3 0.423ms
inception_4b/5x5 + inception_4b/relu_5x5 0.096ms
inception_4b/pool                        0.076ms
inception_4b/pool_proj + inception_4b/re 0.081ms
inception_4b/1x1 copy                    0.017ms
inception_4c/1x1 + inception_4c/relu_1x1 0.299ms
inception_4c/3x3 + inception_4c/relu_3x3 0.408ms
inception_4c/5x5 + inception_4c/relu_5x5 0.092ms
inception_4c/pool                        0.076ms
inception_4c/pool_proj + inception_4c/re 0.082ms
inception_4c/1x1 copy                    0.014ms
inception_4d/1x1 + inception_4d/relu_1x1 0.300ms
inception_4d/3x3 + inception_4d/relu_3x3 0.042ms
inception_4d/3x3 + inception_4d/relu_3x3 0.892ms
inception_4d/3x3 + inception_4d/relu_3x3 0.080ms
inception_4d/5x5 + inception_4d/relu_5x5 0.115ms
inception_4d/pool                        0.075ms
inception_4d/pool_proj + inception_4d/re 0.081ms
inception_4d/1x1 copy                    0.012ms
inception_4e/1x1 + inception_4e/relu_1x1 0.441ms
inception_4e/3x3 + inception_4e/relu_3x3 0.578ms
inception_4e/5x5 + inception_4e/relu_5x5 0.195ms
inception_4e/pool                        0.078ms
inception_4e/pool_proj + inception_4e/re 0.137ms
inception_4e/1x1 copy                    0.025ms
pool4/3x3_s2                             0.072ms
inception_5a/1x1 + inception_5a/relu_1x1 0.196ms
inception_5a/3x3 + inception_5a/relu_3x3 0.250ms
inception_5a/5x5 + inception_5a/relu_5x5 0.074ms
inception_5a/pool                        0.044ms
inception_5a/pool_proj + inception_5a/re 0.076ms
inception_5a/1x1 copy                    0.009ms
inception_5b/1x1 + inception_5b/relu_1x1 0.279ms
inception_5b/3x3 + inception_5b/relu_3x3 0.016ms
inception_5b/3x3 + inception_5b/relu_3x3 0.749ms
inception_5b/3x3 + inception_5b/relu_3x3 0.030ms
inception_5b/5x5 + inception_5b/relu_5x5 0.104ms
inception_5b/pool                        0.053ms
inception_5b/pool_proj + inception_5b/re 0.080ms
inception_5b/1x1 copy                    0.011ms
pool5/7x7_s1                             0.059ms
loss3/classifier input reformatter 0     0.005ms
loss3/classifier                         0.022ms
prob                                     0.009ms
Time over all layers: 18.039
Done.

상위와 같이 간단한 테스트들은 잘되고 쉽으며, TensorRT만으로도 동작이 된다.

NVIDIA DLA (Deep Learning Accelerator)

구글링을해보면, TensorRT Accelerator 라고하는데, giexec가 먼저나오고, trtexec가 나왔다고 하는데, 기능은 거의 동일하다고 보면될 것 같다.

- TensorRT (previously known as GPU Inference Engine (GIE))

nvidia@tegra-ubuntu$ ./trtexec     //tensorRT exec 

Mandatory params:
  --deploy=      Caffe deploy file
  OR --uff=      UFF file
  --output=      Output blob name (can be specified multiple times)

Mandatory params for onnx:
  --onnx=        ONNX Model file

Optional params:
  --uffInput=,C,H,W Input blob names along with their dimensions for UFF parser
  --model=       Caffe model file (default = no model, random weights used)
  --batch=N            Set batch size (default = 1)
  --device=N           Set cuda device to N (default = 0)
  --iterations=N       Run N iterations (default = 10)
  --avgRuns=N          Set avgRuns to N - perf is measured as an average of avgRuns (default=10)
  --percentile=P       For each iteration, report the percentile time at P percentage (0
Generate a serialized TensorRT engine
  --calib=       Read INT8 calibration cache file.  Currently no support for ONNX model.

nvidia@tegra-ubuntu$./giexec            

Mandatory params:
  --deploy=      Caffe deploy file
  OR --uff=      UFF file
  --output=      Output blob name (can be specified multiple times)

Mandatory params for onnx:
  --onnx=        ONNX Model file

Optional params:
  --uffInput=,C,H,W Input blob names along with their dimensions for UFF parser
  --model=       Caffe model file (default = no model, random weights used)
  --batch=N            Set batch size (default = 1)
  --device=N           Set cuda device to N (default = 0)
  --iterations=N       Run N iterations (default = 10)
  --avgRuns=N          Set avgRuns to N - perf is measured as an average of avgRuns (default=10)
  --percentile=P       For each iteration, report the percentile time at P percentage (0

Generate a serialized TensorRT engine
  --calib=       Read INT8 calibration cache file.  Currently no support for ONNX model.

NVIDIA DLA (Deep Learning Accelerator)
https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#dla_topic

NVIDIA CUDA Example

nvidia@tegra-ubuntu$ ls /home/nvidia/NVIDIA_CUDA-9.0_Samples/bin/aarch64/linux/release
alignedTypes           conjugateGradientPrecond     fp16ScalarProduct       mergeSort                simpleCubemapTexture       simpleTexture_kernel64.ptx
asyncAPI               conjugateGradientUM          freeImageInteropNPP     MersenneTwisterGP11213   simpleCUBLAS               simpleVoteIntrinsics
bandwidthTest          convolutionFFT2D             FunctionPointers        MonteCarloMultiGPU       simpleCUBLASXT             simpleZeroCopy
batchCUBLAS            convolutionSeparable         histEqualizationNPP     nbody                    simpleCUDA2GL              smokeParticles
BiCGStab               convolutionTexture           histogram               newdelete                simpleCUFFT                SobelFilter
bicubicTexture         cppIntegration               HSOpticalFlow           oceanFFT                 simpleCUFFT_2d_MGPU        SobolQRNG
bilateralFilter        cppOverload                  imageDenoising          p2pBandwidthLatencyTest  simpleCUFFT_MGPU           sortingNetworks
bindlessTexture        cudaOpenMP                   inlinePTX               particles                simpleDevLibCUBLAS         stereoDisparity
binomialOptions        cuSolverDn_LinearSolver      interval                postProcessGL            simpleGL                   template
BlackScholes           cuSolverRf                   jpegNPP                 ptxjit                   simpleHyperQ               threadFenceReduction
boxFilter              cuSolverSp_LinearSolver      lineOfSight             quasirandomGenerator     simpleLayeredTexture       threadMigration
boxFilterNPP           cuSolverSp_LowlevelCholesky  Mandelbrot              radixSortThrust          simpleMultiCopy            threadMigration_kernel64.ptx
c++11_cuda             cuSolverSp_LowlevelQR        marchingCubes           randomFog                simpleMultiGPU             transpose
cannyEdgeDetectorNPP   dct8x8                       matrixMul               recursiveGaussian        simpleOccupancy            UnifiedMemoryStreams
cdpAdvancedQuicksort   deviceQuery                  matrixMulCUBLAS         reduction                simplePitchLinearTexture   vectorAdd
cdpBezierTessellation  deviceQueryDrv               matrixMulDrv            scalarProd               simplePrintf               vectorAddDrv
cdpLUDecomposition     dwtHaar1D                    matrixMulDynlinkJIT     scan                     simpleSeparateCompilation  vectorAdd_kernel64.ptx
cdpQuadtree            dxtc                         matrixMul_kernel64.ptx  segmentationTreeThrust   simpleStreams              volumeFiltering
cdpSimplePrint         eigenvalues                  MC_EstimatePiInlineP    shfl_scan                simpleSurfaceWrite         volumeRender
cdpSimpleQuicksort     fastWalshTransform           MC_EstimatePiInlineQ    simpleAssert             simpleTemplates            warpAggregatedAtomicsCG
clock                  FDTD3d                       MC_EstimatePiP          simpleAtomicIntrinsics   simpleTexture
concurrentKernels      FilterBorderControlNPP       MC_EstimatePiQ          simpleCallback           simpleTexture3D
conjugateGradient      fluidsGL                     MC_SingleAsianOptionP   simpleCooperativeGroups  simpleTextureDrv

https://tm3.ghost.io/2018/07/06/setting-up-the-nvidia-jetson-tx2/

Multimedia 와 TensorRT

Yolo 처럼 자동차가 지나가는 것을 쉽게 파악이 가능하다.

nvidia@tegra-ubuntu:$ cd ~/tegra_multimedia_api/samples/
nvidia@tegra-ubuntu:$ ls
00_video_decode  02_video_dec_cuda  04_video_dec_trt  06_jpeg_decode    08_video_dec_drm        10_camera_recording  13_multi_camera  common    Rules.mk
01_video_encode  03_video_cuda_enc  05_jpeg_encode    07_video_convert  09_camera_jpeg_capture  12_camera_v4l2_cuda  backend          frontend  v4l2cuda

nvidia@tegra-ubuntu:$ cd backend
nvidia@tegra-ubuntu:$ ./backend 1 ../../data/Video/sample_outdoor_car_1080p_10fps.h264 H264 --trt-deployfile ../../data/Model/GoogleNet_one_class/GoogleNet_modified_oneClass_halfHD.prototxt --trt-modelfile ../../data/Model/GoogleNet_one_class/GoogleNet_modified_oneClass_halfHD.caffemodel --trt-forcefp32 0 --trt-proc-interval 1 -fps 10
// Xwindow에서 실행 , HDMI 연결후

https://devtalk.nvidia.com/default/topic/1027851/jetson-tx2/jetpack-3-2-tegra_multimedia_api-backend-sample-won-t-run/
https://www.youtube.com/watch?v=D7lkth34rgM

OpenCV4Tegra

CUDA를 이용하는 OpenCV로 별도로 설치를 해줘야 가능한 것 같은데, 검사할 방법이 있다면 찾아보는 것이 낫을 것 같다.

  https://jkjung-avt.github.io/opencv3-on-tx2/
  https://github.com/jetsonhacks/buildOpenCVTX2
  https://www.youtube.com/watch?v=gvmP0WRVUxI

  https://devtalk.nvidia.com/default/topic/822903/jetson-tk1/opencv4tegra-libraries/
  https://devtalk.nvidia.com/default/topic/1043074/jetson-tx2/how-to-download-and-install-opencv4tegra/
  https://devtalk.nvidia.com/default/topic/1042056/jetson-tx2/jetpack-3-3-opencv-error-no-cuda-support-the-library-is-compiled-without-cuda-support-/post/5285618/#5285618

Jetson TX2 Yolo 실행 및 성능

https://jkjung-avt.github.io/yolov3/

피드 구독하기: 글 ( Atom )