10/25/2019

NVIDIA Docker SSD Traing 분석 (2차 분석)

1.  NVIDIA Object Detection SSD Docker    

NVIDIA Tensorflow DeepLearning Example은 현재 Github에서 제공을 해주고 있으며 각각의 아래의 사이트에서 확인을 하자.

  • Github NVIDIA DeepLearning SSD 사이트 확인 
현재 아래의 SSD Github의 README.md 기반으로 진행을 하며 이부분을 보고 진행을 하면된다.
  https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Detection/SSD

상위 소스를 이용하여 쉽게 NVIDIA Tensorflow Object Detection SSD Docker 구성이 가능하며, 테스트도 가능하다.

  • Github 기타 DeepLearning Example 
기타 아래의 NVIDIA DeepLearning Example이 존재하며 이부분들을 살펴보자 (아직 미테스트)
  https://github.com/NVIDIA/DeepLearningExamples



  • 기타 참고 사이트 

NVIDIA에서 제공해주는 각 Framework 별 Training 기능소개
  https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

Tensorflow의 사이트의 Tensorflow Guide
  https://www.tensorflow.org/tutorials?hl=ko


1.1 NVIDIA SSD Docker Quick Guide 

README.md 문서를 참고하며 아래와 같이 실행하면 쉽게 Docker를 이용하여 Object Detection 의 SSD  Model를 쉽게 Training 이 가능하다. (COCOSET 기반)

  • Quick Guide 1. Clone the repository
$ git clone https://github.com/NVIDIA/DeepLearningExamples
$ cd DeepLearningExamples/TensorFlow/Detection/SSD


  • Quick Guide 2. Build the SSD320 v1.2 TensorFlow NGC container.
$ docker build . -t nvidia_ssd 

상위와 같이 실행하면 dockerfile을 기반으로 새로운 Docker Image 생성
이외에도 Docker commit를 이용하여 docker에서 직접 Image 생성도 가능

  • Quick Guide  3. Download and preprocess the dataset. (COCO 2017)
$ ./download_all.sh nvidia_ssd /home/jhlee/works/ssd/data /home/jhlee/works/ssd/check 

  • Quick Guide 4. Launch the NGC container to run training/inference.
$ nvidia-docker run --rm -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-v /home/jhlee/works/ssd/data:/data/coco2017_tfrecords \
-v /home/jhlee/works/ssd/check:/checkpoints \
--ipc=host \
nvidia_ssd 

  • Quick Guide  5. Start training.
root@c7550d6b2c59:/workdir/models/research#  bash ./examples/SSD320_FP16_1GPU.sh /checkpoints  

  • Quick Guide  6. Start validation/evaluation.
root@c7550d6b2c59:/workdir/models/research#   bash examples/SSD320_evaluate.sh /checkpoints  


2. Object Detection의 SSD 실행 및 분석 


2.1  Quick Guide 1~2  의 실행 및 분석 

nvcr.io/nvidia/tensorflow:19.05-py3  기반으로 필요한 Package를 설치한 후 새로운 Docker Image를 생성하는 과정이다.

  • HOST에서 직접 아래와 같이 실행 
상위 Github의 명령대로 그대로 실행

$ cd ~/works
$ mkdir ssd 
$ cd ssd
$ mkdir data
$ mkdir check 
$ git clone https://github.com/NVIDIA/DeepLearningExamples
$ cd DeepLearningExamples/TensorFlow/Detection/SSD
$ ls 
configs  Dockerfile  download_all.sh  examples  img  models  NOTICE  README.md  requirements.txt 



  • Docker Image 생성 
Dockerfile에 nvcr.io/nvidia/tensorflow:19.05-py3 기반에 필요한 Package들을 설치를 진행 후 Image 생성이 됨

$ docker build . -t nvidia_ssd    // Dockerfile기반으로 Image 생성 


  • 상위 Dockerfile 분석 및 이해 
아래의 Docker File을 이해하기 위해서는 현재 Directory 위치가 중요

$ pwd 
/home/jhlee/works/ssd/DeepLearningExamples/TensorFlow/Detection/SSD
$ cat Dockerfile 
FROM nvcr.io/nvidia/tensorflow:19.05-py3 as base

FROM base as sha

RUN mkdir /sha
RUN cat `cat HEAD | cut -d' ' -f2` > /sha/repo_sha

FROM base as final

WORKDIR /workdir

RUN PROTOC_VERSION=3.0.0 && \
    PROTOC_ZIP=protoc-${PROTOC_VERSION}-linux-x86_64.zip && \
    curl -OL https://github.com/google/protobuf/releases/download/v$PROTOC_VERSION/$PROTOC_ZIP && \
    unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
    rm -f $PROTOC_ZIP

COPY requirements.txt .
RUN pip install Cython
RUN pip install -r requirements.txt

WORKDIR models/research/
COPY models/research/ .
RUN protoc object_detection/protos/*.proto --python_out=.
ENV PYTHONPATH="/workdir/models/research/:/workdir/models/research/slim/:$PYTHONPATH"

COPY examples/ examples
COPY configs/ configs/
COPY download_all.sh download_all.sh

COPY --from=sha /sha .  


  • Google Protocol Buffer
이부분 정보를 자세히 설명해주셔서 감사하다
  https://bcho.tistory.com/1182

  • DockerFile
책으로도 나왔으며, 쉽게 Dockerfile 생성 및 사용법을 알수 있음
  http://pyrasis.com/docker.html

상위 현재위치 File
  https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Detection/SSD

  • 생성된 Docker Image 확인 
$ docker images   
REPOSITORY                  TAG                             IMAGE ID            CREATED             SIZE
nvidia_ssd                  latest                          ab529215f717        5 minutes ago       6.97GB
none                        none                            a6bc644c75ed        6 minutes ago       6.96GB  //nvidia_ssd를 만들면서 생기는 image
nvcr.io/nvidia/tensorflow   19.08-py3                       be978d32a5c3        8 weeks ago         7.35GB
nvcr.io/nvidia/cuda         10.1-cudnn7-devel-ubuntu18.04   0ead98c22e04        8 weeks ago         3.67GB
nvidia/cuda                 9.0-devel                       2a64416134d8        8 weeks ago         2.03GB
nvcr.io/nvidia/cuda         10.1-devel-ubuntu18.04          946e78c7b298        8 weeks ago         2.83GB
nvidia/cuda                 10.1-base                       a5f5d3b655ca        8 weeks ago         106MB
nvcr.io/nvidia/tensorflow   19.05-py3                       01c8c4b0d7ff        5 months ago        6.96GB


nvidia_ssd 를 위해서 none 과 nvcr.io/nvidia/tensorflow:19.05-py3 가 필요


2.2  Quick Guide 3  실행 및 분석

Quick Guide 3은 CoCoDataSET을 Download하고 이 기반으로 TF Record format을 만드는 작업이다.

  • CoCoDATA Set Download 와 TF Record 생성 (download.sh)
Host에서 실행되는 Shell Script으로  Host에 아래의 두개 Directory 구성이 필요하다.
여기서 주역할은 COCOSET Download와 이 기반으로 TF Record를 생성이다.
  1. /data/coco2017_tfrecords : COCOSET의 DATA 저장장소 및 TF Record 저장장소 
  2. /checkpoints : Tensorflow의 checkpoint 파일로 이 부분은 별도로 알아보자. 

HOST 실행
$ ./download_all.sh nvidia_ssd /home/jhlee/works/ssd/data /home/jhlee/works/ssd/check 


$ cat ./download_all.sh // 기본분석  이전과 거의 유사하지만, 아래의 Container에서 Shell을 실행 ,이 부분 기존의 DATASET Download하는 부분으로 변경 

if [ -z $1 ]; then echo "Docker container name is missing" && exit 1; fi
## 1st ARG : CONTAINER NAME
## 2nd ARG : BASE PATH /data/coco2017_tfrecords
## 3st ARG : BASE PATH /checkpoints 
CONTAINER=$1
COCO_DIR=${2:-"/data/coco2017_tfrecords"}
CHECKPOINT_DIR=${3:-"/checkpoints"}
mkdir -p $COCO_DIR
chmod 777 $COCO_DIR
# Download backbone checkpoint
mkdir -p $CHECKPOINT_DIR
chmod 777 $CHECKPOINT_DIR
cd $CHECKPOINT_DIR
wget http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz
tar -xzf resnet_v1_50_2016_08_28.tar.gz
mkdir -p resnet_v1_50
mv resnet_v1_50.ckpt resnet_v1_50/model.ckpt
## nvidia-docker/docker로 사용가능하며, 아래의 Script는 반드시 Docker Container에서 실행과동시에 bash script 실행 
## docker 내부의 download_and_preprocess_mscoco.sh 에 의해 COCOSET 2017 Download 후 아래와 같이 TFRecords 생성 
nvidia-docker run --rm -it -u 123 -v $COCO_DIR:/data/coco2017_tfrecords $CONTAINER bash -c '
# Create TFRecords
bash /workdir/models/research/object_detection/dataset_tools/download_and_preprocess_mscoco.sh \
    /data/coco2017_tfrecords'



  • download_and_preprocess_mscoco.sh 분석
Container 내부에서 실행되는 실제적인 Shell Scirpt으로 분석하려며 Docker를 실행해서 봐야한다.
  1. COCOSET2017 Download (Annotation 부분포함)
  2. DataSET 기반으로 TFRecord 생성 

Shell Script 분석을 위해 아래와 같이 간단히 Container 실행하여 분석

$ nvidia-docker run --rm -it -u 123 -v $HOME/works/ssd/data:/data/coco2017_tfrecords nvidia_ssd 
================
== TensorFlow ==
================

NVIDIA Release 19.05 (build 6390160)
TensorFlow Version 1.13.1
.....

I have no name!@a4891a3ac177:/workdir/models/research$ cat object_detection/dataset_tools/download_and_preprocess_mscoco.sh 
#!/bin/bash
set -e

if [ -z "$1" ]; then
  echo "usage download_and_preprocess_mscoco.sh [data dir]"
  exit
fi

if [ "$(uname)" == "Darwin" ]; then
  UNZIP="tar -xf"
else
  UNZIP="unzip -nq"
fi

# Create the output directories.
OUTPUT_DIR="${1%/}"
SCRATCH_DIR="${OUTPUT_DIR}/raw-data"
mkdir -p "${OUTPUT_DIR}"
mkdir -p "${SCRATCH_DIR}"
CURRENT_DIR=$(pwd)

# Helper function to download and unpack a .zip file.
function download_and_unzip() {
  local BASE_URL=${1}
  local FILENAME=${2}

  if [ ! -f ${FILENAME} ]; then
    echo "Downloading ${FILENAME} to $(pwd)"
    wget -nd -c "${BASE_URL}/${FILENAME}"
  else
    echo "Skipping download of ${FILENAME}"
  fi
  echo "Unzipping ${FILENAME}"
  ${UNZIP} ${FILENAME}
}

cd ${SCRATCH_DIR}

## 말 그래도 cocoset의 Download하는데, 필요한 Image들이 많다 
## (이 부분은 DATASET을 자세히 알아봐야겠다.)

# Download the images.     
BASE_IMAGE_URL="http://images.cocodataset.org/zips"

TRAIN_IMAGE_FILE="train2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${TRAIN_IMAGE_FILE}
TRAIN_IMAGE_DIR="${SCRATCH_DIR}/train2017"

VAL_IMAGE_FILE="val2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${VAL_IMAGE_FILE}
VAL_IMAGE_DIR="${SCRATCH_DIR}/val2017"

TEST_IMAGE_FILE="test2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${TEST_IMAGE_FILE}
TEST_IMAGE_DIR="${SCRATCH_DIR}/test2017"

## Annotation 부분을 Download하는데, 보면 종류가 꽤 되는데, 이 부분 역시 DATASET의 역할을 알아야겠다.  

# Download the annotations.
BASE_INSTANCES_URL="http://images.cocodataset.org/annotations"
INSTANCES_FILE="annotations_trainval2017.zip"
download_and_unzip ${BASE_INSTANCES_URL} ${INSTANCES_FILE}

#
# Train 과 Validation 에는 annotations 중에서 instances_train2017.json / instances_val2017.json 만 사용 
#
TRAIN_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/instances_train2017.json"
VAL_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/instances_val2017.json"

# Download the test image info.
BASE_IMAGE_INFO_URL="http://images.cocodataset.org/annotations"
IMAGE_INFO_FILE="image_info_test2017.zip"
download_and_unzip ${BASE_IMAGE_INFO_URL} ${IMAGE_INFO_FILE}

#
# TEST시에는 annotations 중에서 image_info_test-dev2017.json 사용 
#
TESTDEV_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/image_info_test-dev2017.json"

# Build TFRecords of the image data.
cd "${CURRENT_DIR}"
python object_detection/dataset_tools/create_coco_tf_record.py \
  --logtostderr \
  --include_masks \
  --train_image_dir="${TRAIN_IMAGE_DIR}" \
  --val_image_dir="${VAL_IMAGE_DIR}" \
  --test_image_dir="${TEST_IMAGE_DIR}" \
  --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
  --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
  --testdev_annotations_file="${TESTDEV_ANNOTATIONS_FILE}" \
  --output_dir="${OUTPUT_DIR}"


상위에서  dataset_tools/create_coco_tf_record.py 를 이용하여 tf record format를 생성
만약 DATASET이 변경되면, dataset_tools를 참조

Preparing Inputs (다른 SET의 설정을 알수 있음)
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/preparing_inputs.md


root@c7550d6b2c59:/workdir/models/research# ls object_detection/dataset_tools/
__init__.py                    create_kitti_tf_record.py       create_pascal_tf_record.py       create_pycocotools_package.sh         oid_hierarchical_labels_expansion_test.py  tf_record_creation_util.py
create_coco_tf_record.py       create_kitti_tf_record_test.py  create_pascal_tf_record_test.py  download_and_preprocess_mscoco.sh     oid_tfrecord_creation.py                   tf_record_creation_util_test.py
create_coco_tf_record_test.py  create_oid_tf_record.py         create_pet_tf_record.py          oid_hierarchical_labels_expansion.py  oid_tfrecord_creation_test.py

## 아래를 보면 COCO의 Annotation은 JSON 형태 사용 
root@c7550d6b2c59:/workdir/models/research# cat object_detection/dataset_tools/create_coco_tf_record.py 
r"""Convert raw COCO dataset to TFRecord for object_detection.

Please note that this tool creates sharded output files.

Example usage:
    python create_coco_tf_record.py --logtostderr \
      --train_image_dir="${TRAIN_IMAGE_DIR}" \
      --val_image_dir="${VAL_IMAGE_DIR}" \
      --test_image_dir="${TEST_IMAGE_DIR}" \
      --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
      --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
      --testdev_annotations_file="${TESTDEV_ANNOTATIONS_FILE}" \
      --output_dir="${OUTPUT_DIR}"
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import hashlib
import io
import json
import os
import contextlib2
import numpy as np
import PIL.Image

from pycocotools import mask
import tensorflow as tf

from object_detection.dataset_tools import tf_record_creation_util
from object_detection.utils import dataset_util
from object_detection.utils import label_map_util


flags = tf.app.flags
tf.flags.DEFINE_boolean('include_masks', False,
                        'Whether to include instance segmentations masks '
                        '(PNG encoded) in the result. default: False.')
tf.flags.DEFINE_string('train_image_dir', '',
                       'Training image directory.')
tf.flags.DEFINE_string('val_image_dir', '',
                       'Validation image directory.')
tf.flags.DEFINE_string('test_image_dir', '',
                       'Test image directory.')
tf.flags.DEFINE_string('train_annotations_file', '',
                       'Training annotations JSON file.')
tf.flags.DEFINE_string('val_annotations_file', '',
                       'Validation annotations JSON file.')
tf.flags.DEFINE_string('testdev_annotations_file', '',
                       'Test-dev annotations JSON file.')
tf.flags.DEFINE_string('output_dir', '/tmp/', 'Output data directory.')

FLAGS = flags.FLAGS

tf.logging.set_verbosity(tf.logging.INFO)


def create_tf_example(image,
                      annotations_list,
                      image_dir,
                      category_index,
                      include_masks=False):
  """Converts image and annotations to a tf.Example proto.

  Args:
    image: dict with keys:
      [u'license', u'file_name', u'coco_url', u'height', u'width',
      u'date_captured', u'flickr_url', u'id']
    annotations_list:
      list of dicts with keys:
      [u'segmentation', u'area', u'iscrowd', u'image_id',
      u'bbox', u'category_id', u'id']
      Notice that bounding box coordinates in the official COCO dataset are
      given as [x, y, width, height] tuples using absolute coordinates where
      x, y represent the top-left (0-indexed) corner.  This function converts
      to the format expected by the Tensorflow Object Detection API (which is
      which is [ymin, xmin, ymax, xmax] with coordinates normalized relative
      to image size).
    image_dir: directory containing the image files.
    category_index: a dict containing COCO category information keyed
      by the 'id' field of each category.  See the
      label_map_util.create_category_index function.
    include_masks: Whether to include instance segmentations masks
      (PNG encoded) in the result. default: False.
  Returns:
    example: The converted tf.Example
    num_annotations_skipped: Number of (invalid) annotations that were ignored.

  Raises:
    ValueError: if the image pointed to by data['filename'] is not a valid JPEG
  """
  image_height = image['height']
  image_width = image['width']
  filename = image['file_name']
  image_id = image['id']

  full_path = os.path.join(image_dir, filename)
  with tf.gfile.GFile(full_path, 'rb') as fid:
    encoded_jpg = fid.read()
  encoded_jpg_io = io.BytesIO(encoded_jpg)
  image = PIL.Image.open(encoded_jpg_io)
  key = hashlib.sha256(encoded_jpg).hexdigest()

  xmin = []
  xmax = []
  ymin = []
  ymax = []
  is_crowd = []
  category_names = []
  category_ids = []
  area = []
  encoded_mask_png = []
  num_annotations_skipped = 0
  for object_annotations in annotations_list:
    (x, y, width, height) = tuple(object_annotations['bbox'])
    if width <= 0 or height <= 0:
      num_annotations_skipped += 1
      continue
    if x + width > image_width or y + height > image_height:
      num_annotations_skipped += 1
      continue
    xmin.append(float(x) / image_width)
    xmax.append(float(x + width) / image_width)
    ymin.append(float(y) / image_height)
    ymax.append(float(y + height) / image_height)
    is_crowd.append(object_annotations['iscrowd'])
    category_id = int(object_annotations['category_id'])
    category_ids.append(category_id)
    category_names.append(category_index[category_id]['name'].encode('utf8'))
    area.append(object_annotations['area'])

    if include_masks:
      run_len_encoding = mask.frPyObjects(object_annotations['segmentation'],
                                          image_height, image_width)
      binary_mask = mask.decode(run_len_encoding)
      if not object_annotations['iscrowd']:
        binary_mask = np.amax(binary_mask, axis=2)
      pil_image = PIL.Image.fromarray(binary_mask)
      output_io = io.BytesIO()
      pil_image.save(output_io, format='PNG')
      encoded_mask_png.append(output_io.getvalue())
  feature_dict = {
      'image/height':
          dataset_util.int64_feature(image_height),
      'image/width':
          dataset_util.int64_feature(image_width),
      'image/filename':
          dataset_util.bytes_feature(filename.encode('utf8')),
      'image/source_id':
          dataset_util.bytes_feature(str(image_id).encode('utf8')),
      'image/key/sha256':
          dataset_util.bytes_feature(key.encode('utf8')),
      'image/encoded':
          dataset_util.bytes_feature(encoded_jpg),
      'image/format':
          dataset_util.bytes_feature('jpeg'.encode('utf8')),
      'image/object/bbox/xmin':
          dataset_util.float_list_feature(xmin),
      'image/object/bbox/xmax':
          dataset_util.float_list_feature(xmax),
      'image/object/bbox/ymin':
          dataset_util.float_list_feature(ymin),
      'image/object/bbox/ymax':
          dataset_util.float_list_feature(ymax),
      'image/object/class/text':
          dataset_util.bytes_list_feature(category_names),
      'image/object/is_crowd':
          dataset_util.int64_list_feature(is_crowd),
      'image/object/area':
          dataset_util.float_list_feature(area),
  }
  if include_masks:
    feature_dict['image/object/mask'] = (
        dataset_util.bytes_list_feature(encoded_mask_png))
  example = tf.train.Example(features=tf.train.Features(feature=feature_dict))
  return key, example, num_annotations_skipped


def _create_tf_record_from_coco_annotations(
    annotations_file, image_dir, output_path, include_masks, num_shards):
  """Loads COCO annotation json files and converts to tf.Record format.

  Args:
    annotations_file: JSON file containing bounding box annotations.
    image_dir: Directory containing the image files.
    output_path: Path to output tf.Record file.
    include_masks: Whether to include instance segmentations masks
      (PNG encoded) in the result. default: False.
    num_shards: number of output file shards.
  """
  with contextlib2.ExitStack() as tf_record_close_stack, \
      tf.gfile.GFile(annotations_file, 'r') as fid:
    output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
        tf_record_close_stack, output_path, num_shards)
    groundtruth_data = json.load(fid)
    images = groundtruth_data['images']
    category_index = label_map_util.create_category_index(
        groundtruth_data['categories'])

    annotations_index = {}
    if 'annotations' in groundtruth_data:
      tf.logging.info(
          'Found groundtruth annotations. Building annotations index.')
      for annotation in groundtruth_data['annotations']:
        image_id = annotation['image_id']
        if image_id not in annotations_index:
          annotations_index[image_id] = []
        annotations_index[image_id].append(annotation)
    missing_annotation_count = 0
    for image in images:
      image_id = image['id']
      if image_id not in annotations_index:
        missing_annotation_count += 1
        annotations_index[image_id] = []
    tf.logging.info('%d images are missing annotations.',
                    missing_annotation_count)

    total_num_annotations_skipped = 0
    for idx, image in enumerate(images):
      if idx % 100 == 0:
        tf.logging.info('On image %d of %d', idx, len(images))
      annotations_list = annotations_index[image['id']]
      _, tf_example, num_annotations_skipped = create_tf_example(
          image, annotations_list, image_dir, category_index, include_masks)
      total_num_annotations_skipped += num_annotations_skipped
      shard_idx = idx % num_shards
      output_tfrecords[shard_idx].write(tf_example.SerializeToString())
    tf.logging.info('Finished writing, skipped %d annotations.',
                    total_num_annotations_skipped)


def main(_):
  assert FLAGS.train_image_dir, '`train_image_dir` missing.'
  assert FLAGS.val_image_dir, '`val_image_dir` missing.'
  assert FLAGS.test_image_dir, '`test_image_dir` missing.'
  assert FLAGS.train_annotations_file, '`train_annotations_file` missing.'
  assert FLAGS.val_annotations_file, '`val_annotations_file` missing.'
  assert FLAGS.testdev_annotations_file, '`testdev_annotations_file` missing.'

  if not tf.gfile.IsDirectory(FLAGS.output_dir):
    tf.gfile.MakeDirs(FLAGS.output_dir)
  train_output_path = os.path.join(FLAGS.output_dir, 'coco_train.record')
  val_output_path = os.path.join(FLAGS.output_dir, 'coco_val.record')
  testdev_output_path = os.path.join(FLAGS.output_dir, 'coco_testdev.record')

  _create_tf_record_from_coco_annotations(
      FLAGS.train_annotations_file,
      FLAGS.train_image_dir,
      train_output_path,
      FLAGS.include_masks,
      num_shards=100)
  _create_tf_record_from_coco_annotations(
      FLAGS.val_annotations_file,
      FLAGS.val_image_dir,
      val_output_path,
      FLAGS.include_masks,
      num_shards=10)
  _create_tf_record_from_coco_annotations(
      FLAGS.testdev_annotations_file,
      FLAGS.test_image_dir,
      testdev_output_path,
      FLAGS.include_masks,
      num_shards=100)


if __name__ == '__main__':
  tf.app.run()



  • TF Record 생성확인 
  1. coco_train.record :Pipeline에서 설정
  2. coco_val.record  : Pipeline에서 설정
  3. coco_testdev.record  : 현재 사용하는 지 미확인 

root@4b038f3383f2:/workdir/models/research# ls /data/coco2017_tfrecords/
annotation                          coco_testdev.record-00035-of-00100  coco_testdev.record-00071-of-00100  coco_train.record-00007-of-00100  coco_train.record-00043-of-00100  coco_train.record-00079-of-00100
coco_testdev.record-00000-of-00100  coco_testdev.record-00036-of-00100  coco_testdev.record-00072-of-00100  coco_train.record-00008-of-00100  coco_train.record-00044-of-00100  coco_train.record-00080-of-00100
coco_testdev.record-00001-of-00100  coco_testdev.record-00037-of-00100  coco_testdev.record-00073-of-00100  coco_train.record-00009-of-00100  coco_train.record-00045-of-00100  coco_train.record-00081-of-00100
coco_testdev.record-00002-of-00100  coco_testdev.record-00038-of-00100  coco_testdev.record-00074-of-00100  coco_train.record-00010-of-00100  coco_train.record-00046-of-00100  coco_train.record-00082-of-00100
coco_testdev.record-00003-of-00100  coco_testdev.record-00039-of-00100  coco_testdev.record-00075-of-00100  coco_train.record-00011-of-00100  coco_train.record-00047-of-00100  coco_train.record-00083-of-00100
coco_testdev.record-00004-of-00100  coco_testdev.record-00040-of-00100  coco_testdev.record-00076-of-00100  coco_train.record-00012-of-00100  coco_train.record-00048-of-00100  coco_train.record-00084-of-00100
coco_testdev.record-00005-of-00100  coco_testdev.record-00041-of-00100  coco_testdev.record-00077-of-00100  coco_train.record-00013-of-00100  coco_train.record-00049-of-00100  coco_train.record-00085-of-00100
coco_testdev.record-00006-of-00100  coco_testdev.record-00042-of-00100  coco_testdev.record-00078-of-00100  coco_train.record-00014-of-00100  coco_train.record-00050-of-00100  coco_train.record-00086-of-00100
coco_testdev.record-00007-of-00100  coco_testdev.record-00043-of-00100  coco_testdev.record-00079-of-00100  coco_train.record-00015-of-00100  coco_train.record-00051-of-00100  coco_train.record-00087-of-00100
coco_testdev.record-00008-of-00100  coco_testdev.record-00044-of-00100  coco_testdev.record-00080-of-00100  coco_train.record-00016-of-00100  coco_train.record-00052-of-00100  coco_train.record-00088-of-00100
coco_testdev.record-00009-of-00100  coco_testdev.record-00045-of-00100  coco_testdev.record-00081-of-00100  coco_train.record-00017-of-00100  coco_train.record-00053-of-00100  coco_train.record-00089-of-00100
coco_testdev.record-00010-of-00100  coco_testdev.record-00046-of-00100  coco_testdev.record-00082-of-00100  coco_train.record-00018-of-00100  coco_train.record-00054-of-00100  coco_train.record-00090-of-00100
coco_testdev.record-00011-of-00100  coco_testdev.record-00047-of-00100  coco_testdev.record-00083-of-00100  coco_train.record-00019-of-00100  coco_train.record-00055-of-00100  coco_train.record-00091-of-00100
coco_testdev.record-00012-of-00100  coco_testdev.record-00048-of-00100  coco_testdev.record-00084-of-00100  coco_train.record-00020-of-00100  coco_train.record-00056-of-00100  coco_train.record-00092-of-00100
coco_testdev.record-00013-of-00100  coco_testdev.record-00049-of-00100  coco_testdev.record-00085-of-00100  coco_train.record-00021-of-00100  coco_train.record-00057-of-00100  coco_train.record-00093-of-00100
coco_testdev.record-00014-of-00100  coco_testdev.record-00050-of-00100  coco_testdev.record-00086-of-00100  coco_train.record-00022-of-00100  coco_train.record-00058-of-00100  coco_train.record-00094-of-00100
coco_testdev.record-00015-of-00100  coco_testdev.record-00051-of-00100  coco_testdev.record-00087-of-00100  coco_train.record-00023-of-00100  coco_train.record-00059-of-00100  coco_train.record-00095-of-00100
coco_testdev.record-00016-of-00100  coco_testdev.record-00052-of-00100  coco_testdev.record-00088-of-00100  coco_train.record-00024-of-00100  coco_train.record-00060-of-00100  coco_train.record-00096-of-00100
coco_testdev.record-00017-of-00100  coco_testdev.record-00053-of-00100  coco_testdev.record-00089-of-00100  coco_train.record-00025-of-00100  coco_train.record-00061-of-00100  coco_train.record-00097-of-00100
coco_testdev.record-00018-of-00100  coco_testdev.record-00054-of-00100  coco_testdev.record-00090-of-00100  coco_train.record-00026-of-00100  coco_train.record-00062-of-00100  coco_train.record-00098-of-00100
coco_testdev.record-00019-of-00100  coco_testdev.record-00055-of-00100  coco_testdev.record-00091-of-00100  coco_train.record-00027-of-00100  coco_train.record-00063-of-00100  coco_train.record-00099-of-00100
coco_testdev.record-00020-of-00100  coco_testdev.record-00056-of-00100  coco_testdev.record-00092-of-00100  coco_train.record-00028-of-00100  coco_train.record-00064-of-00100  coco_val.record-00000-of-00010
coco_testdev.record-00021-of-00100  coco_testdev.record-00057-of-00100  coco_testdev.record-00093-of-00100  coco_train.record-00029-of-00100  coco_train.record-00065-of-00100  coco_val.record-00001-of-00010
coco_testdev.record-00022-of-00100  coco_testdev.record-00058-of-00100  coco_testdev.record-00094-of-00100  coco_train.record-00030-of-00100  coco_train.record-00066-of-00100  coco_val.record-00002-of-00010
coco_testdev.record-00023-of-00100  coco_testdev.record-00059-of-00100  coco_testdev.record-00095-of-00100  coco_train.record-00031-of-00100  coco_train.record-00067-of-00100  coco_val.record-00003-of-00010
coco_testdev.record-00024-of-00100  coco_testdev.record-00060-of-00100  coco_testdev.record-00096-of-00100  coco_train.record-00032-of-00100  coco_train.record-00068-of-00100  coco_val.record-00004-of-00010
coco_testdev.record-00025-of-00100  coco_testdev.record-00061-of-00100  coco_testdev.record-00097-of-00100  coco_train.record-00033-of-00100  coco_train.record-00069-of-00100  coco_val.record-00005-of-00010
coco_testdev.record-00026-of-00100  coco_testdev.record-00062-of-00100  coco_testdev.record-00098-of-00100  coco_train.record-00034-of-00100  coco_train.record-00070-of-00100  coco_val.record-00006-of-00010
coco_testdev.record-00027-of-00100  coco_testdev.record-00063-of-00100  coco_testdev.record-00099-of-00100  coco_train.record-00035-of-00100  coco_train.record-00071-of-00100  coco_val.record-00007-of-00010
coco_testdev.record-00028-of-00100  coco_testdev.record-00064-of-00100  coco_train.record-00000-of-00100    coco_train.record-00036-of-00100  coco_train.record-00072-of-00100  coco_val.record-00008-of-00010
coco_testdev.record-00029-of-00100  coco_testdev.record-00065-of-00100  coco_train.record-00001-of-00100    coco_train.record-00037-of-00100  coco_train.record-00073-of-00100  coco_val.record-00009-of-00010
coco_testdev.record-00030-of-00100  coco_testdev.record-00066-of-00100  coco_train.record-00002-of-00100    coco_train.record-00038-of-00100  coco_train.record-00074-of-00100  raw-data
coco_testdev.record-00031-of-00100  coco_testdev.record-00067-of-00100  coco_train.record-00003-of-00100    coco_train.record-00039-of-00100  coco_train.record-00075-of-00100
coco_testdev.record-00032-of-00100  coco_testdev.record-00068-of-00100  coco_train.record-00004-of-00100    coco_train.record-00040-of-00100  coco_train.record-00076-of-00100
coco_testdev.record-00033-of-00100  coco_testdev.record-00069-of-00100  coco_train.record-00005-of-00100    coco_train.record-00041-of-00100  coco_train.record-00077-of-00100
coco_testdev.record-00034-of-00100  coco_testdev.record-00070-of-00100  coco_train.record-00006-of-00100    coco_train.record-00042-of-00100  coco_train.record-00078-of-00100



TFRecord 와 TF Example
  https://www.tensorflow.org/tutorials/load_data/tfrecord

  • 상위에서 사용되어지는 COCO DATASET 2017
  http://images.cocodataset.org/zips/train2017.zip
  http://images.cocodataset.org/annotations/annotations_trainval2017.zip
  http://images.cocodataset.org/zips/val2017.zip
  http://images.cocodataset.org/zips/test2017.zip

  • Cocodata Set 관련내용 재확인
  https://ahyuo79.blogspot.com/2019/10/cocodata-set.html


2.3 Quick Guide 4~5 실행 및 분석 

우선 Docker의 Conatiner를 아래와 같이 실행한 후 Tensorflow Training Shell Script으로 Training을 진행을 한다.
나의 경우는 GPU가 하나이므로 Training부분이 아주 느리다


NVIDIA Docker는 추후 사라질게 될것 같으며, 아래와 같이 Docker로도 실행이 가능하지만, nvidia-toolkit을 반드시 설치해야한다.
관련부분은 이전부분 확인
  https://ahyuo79.blogspot.com/2019/10/nvidia-docker.html

  • 설치된 NGC Version2로 실행 

$ nvidia-docker run --rm -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8888:8888 -p 6006:6006  \
-v /home/jhlee/works/ssd/data:/data/coco2017_tfrecords \
-v /home/jhlee/works/ssd/check:/checkpoints \
--ipc=host \
--name nvidia_ssd \
nvidia_ssd 


  • docker로 변경하여 실행 (nvidia-docker2 미사용할 경우)
  1. Tensorboard/Jupyter Port mapping 추가
  2. name 설정하여 쉽게 찾기

$ docker run --gpus all --rm -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8888:8888 -p 6006:6006  \
-v /home/jhlee/works/ssd/data:/data/coco2017_tfrecords \
-v /home/jhlee/works/ssd/check:/checkpoints \
--ipc=host \
--name nvidia_ssd \
nvidia_ssd




  • Training 부분 실행 및 분석 
Training의 Shell Script 분석을 해보면, 내부적으로 사용되는 Config 파일도 존재하며 이 부분을 알아두자.

root@c7550d6b2c59:/workdir/models/research# bash ./examples/SSD320_FP16_1GPU.sh /checkpoints 


root@c7550d6b2c59:/workdir/models/research# cat examples/SSD320_FP16_1GPU.sh 

CKPT_DIR=${1:-"/results/SSD320_FP16_1GPU"}
PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd320_full_1gpus.config"

export TF_ENABLE_AUTO_MIXED_PRECISION=1

TENSOR_OPS=0
export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_RNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}

time python -u ./object_detection/model_main.py \
       --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
       --model_dir=${CKPT_DIR} \
       --alsologtostder \
       "${@:3}"



  • Configuring the Object Detection Training Pipeline
Pipeline 설정은 Training하기 위해서 필요한 설정이며, model에 따라 각각의 설정이 조금씩 다른 것 같다.
다양한 config 파일을 확인하고 싶다면, object_detection/samples/configs 에서 확인을 하자
pre-trained 모델은 resnet_v150 기준으로 동작을 하므로 관련된 기능을 알아두자

root@a79a83fc99f6:/workdir/models/research# cat configs/ssd320_full_1gpus.config 
# SSD with Resnet 50 v1 FPN feature extractor, shared box predictor and focal
# loss (a.k.a Retinanet).
# See Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from Imagenet classification checkpoint

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: true
    num_classes: 90         ## Label 의 갯수 object_detection/data/mscoco_label_map.pbtxt 의 label의 class 수와 동일
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5      ## 50% 넘은 것만을 화면에 표시 , 추후 object_detection/object_detection_tutorial.ipynb 사용시 파악가능 
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: [1.0, 2.0, 0.5]
        scales_per_octave: 2
      }
    }
    image_resizer {                ## 이 부분은 network의 input shape 인 것 같음 (kernel)
      fixed_shape_resizer {
        height: 320
        width: 320
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        depth: 256
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.0004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            scale: true,
            decay: 0.997,
            epsilon: 0.001,
          }
        }
        num_layers_before_predictor: 4
        kernel_size: 3
      }
    }
    feature_extractor {
      type: 'ssd_resnet50_v1_fpn'       # feature_extractor용으로 resnet 50을 사용하며 이 부분은 변경가능  
      fpn {
        min_level: 3
        max_level: 7
      }
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.0004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          scale: true,
          decay: 0.997,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.25
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6                 
        max_detections_per_class: 100     ### class마다 찾을 수 있는 MAX
        max_total_detections: 100         ### 전체 찾을 수 있는 MAX object_detection/object_detection_tutorial.ipynb 의 output_dict['num_detections'] 과 동일 
      }
      score_converter: SIGMOID
    }
  }
}

# 
# model은 Google의 pre-train 된 모델 
# SSD에서 내부 feature extractor를 pre-trained model사용하면 fine_tune_checkpoint_type  "classification" 
# faster R-CNN, fine_tune_checkpoint_type  "detection"  
#

train_config: {
  fine_tune_checkpoint: "/checkpoints/resnet_v1_50/model.ckpt"        ## 상위 pre-trained model download 위치 
  fine_tune_checkpoint_type: "classification"                         # 모델에 따라 설정이 다르다고 함 
  batch_size: 32    ## GPU의 Memory가 Out of Memory가 발생할수 있으므로, 본인의 GPU Memory 맞게 설정 or CPU모드로 변경 
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  num_steps: 100000    ## steps 100,000 정함 , ( object_detection/model_main.py --num_train_steps 으로 설정가능) 
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_crop_image {
      min_object_covered: 0.0
      min_aspect_ratio: 0.75
      max_aspect_ratio: 3.0
      min_area: 0.75
      max_area: 1.0
      overlap_thresh: 0.0
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: .02000000000000000000
          total_steps: 100000
          warmup_learning_rate: .00866640000000000000
          warmup_steps: 8000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100         ## max_total_detections와 동일하게 해야 할 것 같음  object_detection_tutorial.ipynb의 output_dict['detection_boxes']
  unpad_groundtruth_tensors: false
}

#
# Training Setting 
# 
# input_path:  //TF Record 
#      coco_train.record-00000-of-00100 
#      coco_train.record-00001-of-00100 
#      .....
# label_map_path:
#      mscoco_label_map.pbtxt
#
train_input_reader: {
  tf_record_input_reader {
    input_path: "/data/coco2017_tfrecords/*train*"    ## tf_record format 위치 
  }
  label_map_path: "object_detection/data/mscoco_label_map.pbtxt"  ## Label 위치 
}


# 
# Eval Setting 
# 
#
#

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  num_examples: 8000   ##  eval 시 examples의 갯수 
  ##max_evals: 10               ##  eval 할 수 있는 max 값 지정 (object_detection/model_main.py --eval_count 설정가능) , 원래 config 미존재 
}

# 
# Eval Setting 
# 
# input_path:  //TF Record 
#      coco_val.record-00000-of-00010
#      coco_val.record-00001-of-00010 
#  
# label_map_path:
#      mscoco_label_map.pbtxt
#

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/data/coco2017_tfrecords/*val*"  ## tf_record format 위치 
  }
  label_map_path: "object_detection/data/mscoco_label_map.pbtxt" ## Label 정보확인가능 
  shuffle: false
  num_readers: 1
}

  • Pre-trained Models
  https://  github.com/tensorflow/models/tree/master/research/slim

  • Configuring the Object Detection Training Pipeline
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/configuring_jobs.md
  https://medium.com/coinmonks/modelling-transfer-learning-using-tensorflows-object-detection-model-on-mac-692c8609be40
  https://devtalk.nvidia.com/default/topic/1049371/tensorrt/how-to-visualize-tf-trt-graphs-with-tensorboard-/


2.4 Quick Guide 6  실행 및 분석 

Training을 한 후 Validation 하는 부분으로 보정의 역할을 하는 것 같은데, 정확한 역할은 Tensorflow와  DataSet의 기본구조를 알아야 할 것 같다.


root@c7550d6b2c59:/workdir/models/research#  bash examples/SSD320_evaluate.sh /checkpoints 


root@c7550d6b2c59:/workdir/models/research#  cat examples/SSD320_evaluate.sh
CHECKPINT_DIR=$1

TENSOR_OPS=0
export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_RNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}

python object_detection/model_main.py --checkpoint_dir $CHECKPINT_DIR --model_dir /results --run_once --pipeline_config_path configs/ssd320_full_1gpus.config


  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md

2.5 생성된 최종 Check Point 파일 확인

NVIDIA Docker를 이용하여 최종적으로 만들어지는 파일은 Checkpoint File이며, PB파일은 본인이 직접 만들어야하고, Inference도 역시 이 기반으로 해봐야 할 것 같다.
현재 chekpoint가 아래와 같이 model.ckpt-0 와 model.ckpt-100000로 구성됨

root@c7550d6b2c59:/workdir/models/research# ls  /checkpoints/
checkpoint                                   model.ckpt-0.data-00000-of-00002  model.ckpt-100000.data-00000-of-00002  resnet_v1_50
eval                                         model.ckpt-0.data-00001-of-00002  model.ckpt-100000.data-00001-of-00002  resnet_v1_50_2016_08_28.tar.gz
events.out.tfevents.1572262719.c7550d6b2c59  model.ckpt-0.index                model.ckpt-100000.index                
graph.pbtxt                                  model.ckpt-0.meta                 model.ckpt-100000.meta                 



  • 기본구성 
  1. model.ckpt-0: Training  시작과 동시에 생성 (STEP0)
  2. model.ckpt-100000 : Pipeline STEP 수와 동일하며, 여기서 재학습도 할 경우 계속 증가  
  3. graph.pb.txt: Network 구조 파악 

Checkpoint 이해필요
  https://eehoeskrap.tistory.com/343
  https://eehoeskrap.tistory.com/370
  https://eehoeskrap.tistory.com/344
  https://gusrb.tistory.com/21
  http://jaynewho.com/post/8


2.6  CheckPoint 를 PB Format으로 변환 

Inference를 위해서 아래와 같이 PB파일로 변환

  • export_inference_graph.py 사용법 
  https://github.com/tensorflow/models/blob/master/research/object_detection/export_inference_graph.py


  • input_type
  1. image_tensor
  2. encoded_image_string_tensor
  3. tf_example

TFRecord 와 TF Example
  https://www.tensorflow.org/tutorials/load_data/tfrecord


root@c7550d6b2c59:/workdir/models/research# python object_detection/export_inference_graph.py \
    --input_type image_tensor \
    --pipeline_config_path configs/ssd320_full_1gpus.config \
    --trained_checkpoint_prefix  /checkpoints/model.ckpt-100000 \
    --output_directory /checkpoints/inference_graph_100000

root@c7550d6b2c59:/workdir/models/research# ls /checkpoints/inference_graph_100000/
checkpoint  frozen_inference_graph.pb  model.ckpt.data-00000-of-00001  model.ckpt.index  model.ckpt.meta  pipeline.config  saved_model

//새로 생성된 PB파일 확인 
root@c7550d6b2c59:/workdir/models/research# ls /checkpoints/inference_graph_100000/saved_model/
saved_model.pb  variables




Exporting a trained model for inference
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md


2.7  Tensorboard 로 테스트 진행 


  • Host에서 CheckPoint 구조확인 

$ cd ~/works/ssd/check  //host에서 checkpoint 구조 파악 
$ tree
.
├── checkpoint  //model.ckpt-100000 과 각 path 
├── eval        //training시 생성된 tensorboard log  validation/evaluation 시 생성된 부분은 /result/eval에 존재
│   └── events.out.tfevents.1572359812.c7550d6b2c59   // Tensorboard 용 Log file 
├── events.out.tfevents.1572262719.c7550d6b2c59       // Tensorboard 용 Log file 
├── graph.pbtxt                      // Network 관련정보 확인 
├── inference_graph_100000           // model.ckpt-100000x 기반의 pb 파일 생성 
│   ├── checkpoint
│   ├── frozen_inference_graph.pb
│   ├── model.ckpt.data-00000-of-00001
│   ├── model.ckpt.index
│   ├── model.ckpt.meta
│   ├── pipeline.config
│   └── saved_model
│       ├── saved_model.pb
│       └── variables
├── model.ckpt-0.data-00000-of-00002        //checkpoint-0
├── model.ckpt-0.data-00001-of-00002
├── model.ckpt-0.index
├── model.ckpt-0.meta
├── model.ckpt-100000.data-00000-of-00002   //checkpoint-100000
├── model.ckpt-100000.data-00001-of-00002
├── model.ckpt-100000.index
├── model.ckpt-100000.meta
├── resnet_v1_50                        //Pre-trained Model (SSD의 특징추출용으로 사용)
│   └── model.ckpt
├── resnet_v1_50_2016_08_28.tar.gz    //Pre-trained Model
├── resnet_v1_50_2016_08_28.tar.gz.1
└── resnet_v1_50_2016_08_28.tar.gz.2



  • Pre-trained Models
SSD 에서 feature_extractor으로 resetnet model 사용 (fine tuning / transfer learning)
  resnet_v1_50_2016_08_28.tar.gz
  https://github.com/tensorflow/models/tree/master/research/slim

  • Docker Container에서 Tensorboard 실행
상위 checkpoint에서 생성된 Tensorboard의 log가 존재하므로 분석이 가능

root@c7550d6b2c59:/workdir/models/research# tensorboard --logdir=/checkpoints 


  • Tensorboard Browser 연결
  http://localhost:6006/


  • Tensorboard -> Scalars


Images 관련 부분 소스 

root@c7550d6b2c59:/workdir/models/research# vi ./object_detection/utils/visualization_utils.py 
........

def draw_side_by_side_evaluation_image(eval_dict,
                                       category_index,
                                       max_boxes_to_draw=20,
                                       min_score_thresh=0.2,
                                       use_normalized_coordinates=True):
  """Creates a side-by-side image with detections and groundtruth.

  Bounding boxes (and instance masks, if available) are visualized on both
  subimages.

  Args:
    eval_dict: The evaluation dictionary returned by
      eval_util.result_dict_for_batched_example() or
      eval_util.result_dict_for_single_example().
    category_index: A category index (dictionary) produced from a labelmap.
    max_boxes_to_draw: The maximum number of boxes to draw for detections.
    min_score_thresh: The minimum score threshold for showing detections.
    use_normalized_coordinates: Whether to assume boxes and kepoints are in
      normalized coordinates (as opposed to absolute coordiantes).
      Default is True.

  Returns:
    A list of [1, H, 2 * W, C] uint8 tensor. The subimage on the left
      corresponds to detections, while the subimage on the right corresponds to
      groundtruth.
  """
........


class EvalMetricOpsVisualization(object):
....
  def get_estimator_eval_metric_ops(self, eval_dict):  ## 아래의에서 호출됨 

    if self._max_examples_to_draw == 0:
      return {}
    images = self.images_from_evaluation_dict(eval_dict)

    def get_images():
      """Returns a list of images, padded to self._max_images_to_draw."""
      images = self._images
      while len(images) < self._max_examples_to_draw:
        images.append(np.array(0, dtype=np.uint8))
      self.clear()
      return images

    def image_summary_or_default_string(summary_name, image): ## 이곳에서 Image 생성
      """Returns image summaries for non-padded elements."""
      return tf.cond(
          tf.equal(tf.size(tf.shape(image)), 4),
          lambda: tf.summary.image(summary_name, image),    ## Tensorboard Image 
          lambda: tf.constant(''))

    update_op = tf.py_func(self.add_images, [[images[0]]], [])
    image_tensors = tf.py_func(
        get_images, [], [tf.uint8] * self._max_examples_to_draw)
    eval_metric_ops = {}
    for i, image in enumerate(image_tensors):
      summary_name = self._summary_name_prefix + '/' + str(i)
      value_op = image_summary_or_default_string(summary_name, image)   ## Tensorboard Image 생성 
      eval_metric_ops[summary_name] = (value_op, update_op)
    return eval_metric_ops

.....

class VisualizeSingleFrameDetections(EvalMetricOpsVisualization): ## VisualizeSingleFrameDetections는 EvalMetricOpsVisualization
  """Class responsible for single-frame object detection visualizations."""

  def __init__(self,
               category_index,
               max_examples_to_draw=5,
               max_boxes_to_draw=20,
               min_score_thresh=0.2,
               use_normalized_coordinates=True,
               summary_name_prefix='Detections_Left_Groundtruth_Right'):
    super(VisualizeSingleFrameDetections, self).__init__(
        category_index=category_index,
        max_examples_to_draw=max_examples_to_draw,
        max_boxes_to_draw=max_boxes_to_draw,
        min_score_thresh=min_score_thresh,
        use_normalized_coordinates=use_normalized_coordinates,
        summary_name_prefix=summary_name_prefix)

  def images_from_evaluation_dict(self, eval_dict):
    return draw_side_by_side_evaluation_image(
        eval_dict, self._category_index, self._max_boxes_to_draw,
        self._min_score_thresh, self._use_normalized_coordinates)

...........

root@c7550d6b2c59:/workdir/models/research# vi ./object_detection/model_lib.py
....
    if mode == tf.estimator.ModeKeys.EVAL:  ## EVAL Mode 
.........
      eval_dict = eval_util.result_dict_for_batched_example(          ## Image 정보 
          eval_images,
          features[inputs.HASH_KEY],
          detections,
          groundtruth,
          class_agnostic=class_agnostic,
          scale_to_absolute=True,
          original_image_spatial_shapes=original_image_spatial_shapes,
          true_image_shapes=true_image_shapes)

      if class_agnostic:
        category_index = label_map_util.create_class_agnostic_category_index()
      else:
        category_index = label_map_util.create_category_index_from_labelmap(
            eval_input_config.label_map_path)
      vis_metric_ops = None
      if not use_tpu and use_original_images:
        eval_metric_op_vis = vis_utils.VisualizeSingleFrameDetections(   
            category_index,
            max_examples_to_draw=eval_config.num_visualizations,
            max_boxes_to_draw=eval_config.max_num_boxes_to_visualize,
            min_score_thresh=eval_config.min_score_threshold,
            use_normalized_coordinates=False)
        vis_metric_ops = eval_metric_op_vis.get_estimator_eval_metric_ops(     ## 이곳에서 Image 저장 , 상위참조 
            eval_dict)
....



관련내용정리 
tf.estimator.ModeKeys.TRAIN
tf.estimator.ModeKeys.EVAL
tf.estimator.ModeKeys.PREDICT

아래 사이트에서 설명이 너무 잘되어있음 
  https://bcho.tistory.com/1196

  • Tensorboard -> Images


  https://www.tensorflow.org/tensorboard/image_summaries

  • Tensorboard -> Graphs
보고 쉽게 이해가도록 했으며, 시각화가 너무 잘되어있어 좋다. 
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md


2.8  Object Detection 준비  

상위 Docker로 실행한 Terminal 에서  Jupyter 를 실행하여 Jupyter  TEST 진행

  • TEST Image 준비 

root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/test2017/000000000001.jpg object_detection/test_images/image1.jpg
root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/test2017/000000517810.jpg object_detection/test_images/image2.jpg

or

root@5208474af96a:/workdir/models/research# cat object_detection/test_images/image_info.txt  //아래의 사이트에서 image1.jpg image2.jpg download 후 복사  

Image provenance:
image1.jpg: https://commons.wikimedia.org/wiki/File:Baegle_dwa.jpg
image2.jpg: Michael Miley,
  https://www.flickr.com/photos/mike_miley/4678754542/in/photolist-88rQHL-88oBVp-88oC2B-88rS6J-88rSqm-88oBLv-88oBC4

root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/image1.jpg object_detection/test_images/image1.jpg   //아래 사이트에서 download함 
root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/image2.jpg object_detection/test_images/image2.jpg


  • Jupyter를 이용하여  object_detection/object_detection_tutorial.ipynb  실행 

root@c7550d6b2c59:/workdir/models/research# jupyter notebook   // error 발생 
root@c7550d6b2c59:/workdir/models/research# jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root

Tensorflow Jupiter Notebook
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_notebook.md

  • Jupyter notebook 실행 후 브라우저확인 
  http://localhost:8888/

  • Jupyter notebook 실행시 발생하는 에러
상위에서 옵션을 정의하여 해결함
  https://github.com/kaczmarj/neurodocker/issues/82
  http://melonicedlatte.com/web/2018/05/22/134429.html


2.9  기본 Object Detection 확인 


  • 별도의 Docker Terminal을 실행
각 파일의 위치파악하고 필요한 파일들을 각각 파악

$ docker exec -it nvidia_ssd /bin/bash  // 상위 docker 이미 Jupyter가 돌아가는 중이므로 별도의 Terminal 사용 

root@5208474af96a:/workdir/models/research# python object_detection/model_main.py --help

root@5208474af96a:/workdir/models/research# ls object_detection/object_detection_tutorial.ipynb  // Jupyter로 테스트 진행 
object_detection/object_detection_tutorial.ipynb

root@5208474af96a:/workdir/models/research# ls object_detection/ssd_mobilenet_v1_coco_2017_11_17  // 상위 Jupyter에서 사용하는 Model
frozen_inference_graph.pb

root@5208474af96a:/workdir/models/research# ls object_detection/data                              // 상위 Jupyter에서 사용하는 pbtxt
ava_label_map_v2.1.pbtxt           mscoco_complete_label_map.pbtxt     oid_object_detection_challenge_500_label_map.pbtxt
face_label_map.pbtxt               mscoco_label_map.pbtxt              pascal_label_map.pbtxt
fgvc_2854_classes_label_map.pbtxt  mscoco_minival_ids.txt              pet_label_map.pbtxt
kitti_label_map.pbtxt              oid_bbox_trainable_label_map.pbtxt



  • object_detection_tutorial.ipynb
별도의 Training을 안해도 소스를 보면 Model를 download하여 test directory만 설정해주면 된다.
사용모델: ssd_mobilenet_v1_coco_2017_11_17.tar.gz
  https://medium.com/@yuu.ishikawa/how-to-show-signatures-of-tensorflow-saved-model-5ac56cf1960f
  http://solarisailab.com/archives/2387

간단히 분석하며  상위 Model(pb파일)과 pbtxt를 이용하여 test_images 내의 image들을 테스트 진행

  • object_detection_tutorial.ipynb 문제사항    
테스트를 진행하면 마지막에 cuDNN 에러가 발생하며, 원인은 GPU Memory 이므로, 아래의 소스를 추가하자  (Docker 의 Tensorflow 1.14.0)

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

Jupyter Consol 에러사항

E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 

GPU Memory 문제사항
  https://github.com/tensorflow/tensorflow/issues/24828
  https://lsjsj92.tistory.com/363
  https://devtalk.nvidia.com/default/topic/1051380/cudnn/could-not-create-cudnn-handle-cudnn_status_internal_error/

Tensorflow 2.0 GPU Memory 부족현상
  https://inpages.tistory.com/155

failed to allocate 2.62G (2811428864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
  https://stackoverflow.com/questions/39465503/cuda-error-out-of-memory-in-tensorflow


  • NVIDIA GPU Memory 사용량 확인
$ watch -n 0.1 nvidia-smi 


3. 현재 상황  

나의 랩탑에서는 상위 소스를 추가를 하면 예제를 Inference한 부분을 볼수 없지만, 다른 성능 좋은 Server에서는 잘 동작한다.
참 안타까운 일이며, 나의 랩탑(Laptop)의 한계를 많이 느낀다. (특히 GPU RAM)


관련부분 참조사이트 들이며, 너무 많이 참조하여 각 링크만 나열 


Object Detection Install 및 TEST
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md


Training 자료수집 
  https://www.slideshare.net/fermat39/polyp-detection-withtensorflowobjectdetectionapi
  https://www.kdnuggets.com/2019/03/object-detection-luminoth.html


Tensorflow Training 및 사용법  
  https://yongyong-e.tistory.com/24
  http://solarisailab.com/archives/2422
  https://hwauni.tistory.com/entry/API-Object-Detection-API%EB%A5%BC-%EC%9D%B4%EC%9A%A9%ED%95%9C-%EC%98%A4%EB%B8%8C%EC%A0%9D%ED%8A%B8-%EC%9D%B8%EC%8B%9D%ED%95%98%EA%B8%B0-Part-1-%EC%84%A4%EC%A0%95-%ED%8E%8C
  https://cloud.google.com/solutions/creating-object-detection-application-tensorflow?hl=ko

Tensorflow Object Detection 부분 추후 분리
  https://you359.github.io/tensorflow%20models/Tensorflow-Object-Detection-API/
  https://you359.github.io/tensorflow%20models/Tensorflow-Object-Detection-API-Installation/
  https://you359.github.io/tensorflow%20models/Tensorflow-Object-Detection-API-Training/

Tensorflow Object Detection 관련사항
  https://yongyong-e.tistory.com/31?category=836820
  https://yongyong-e.tistory.com/32?category=836820
  https://yongyong-e.tistory.com/35?category=836820    **
  https://towardsdatascience.com/creating-your-own-object-detector-ad69dda69c85  **
  https://gilberttanner.com/blog/live-object-detection

Tensorflow Object Detection API Training
  https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html
  https://towardsdatascience.com/custom-object-detection-using-tensorflow-from-scratch-e61da2e10087
  https://becominghuman.ai/tensorflow-object-detection-api-tutorial-training-and-evaluating-custom-object-detector-ed2594afcf73
  https://medium.com/pylessons/tensorflow-step-by-step-custom-object-detection-tutorial-d7ae840a74e2


Tensorflow Object Detection API
  https://github.com/tensorflow/models/tree/master/research/object_detection
  https://github.com/tensorflow/models/tree/master/research/object_detection/g3doc
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md


  • Shellscript 분석 중 혼동부분 정리 
항상 느끼지만, 매번 Opensource 의 Shell Script 잘 만들어지고, 자주 변경되어 많이 혼동됨 

${1:-none}
  https://stackoverflow.com/questions/38260927/what-does-this-line-build-target-1-none-means-in-shell-scripting

${@:2}
  https://unix.stackexchange.com/questions/92978/what-does-this-2-mean-in-shell-scripting

10/24/2019

CoCoData SET 관련내용

1. NVIDIA Docker에서 사용하는 DataSET 

각 DataSet 사용법 및 관련부분 링크 

NVIDIA Tranfer Learning (COCO2017)
  https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Detection/SSD/download_all.sh

COCOAPI (Coco Image /annotations)
  https://github.com/cocodataset/cocoapi

Google DATA SET 검색
   https://toolbox.google.com/datasetsearch

DATASET
아래 소스에서 사용하는 Dataset을 보도록하자 
  https://github.com/tensorflow/models/tree/master/research/slim

COCO DATASET 2017 Download



  • gsutil 설치 
$ sudo apt install curl
$ curl https://sdk.cloud.google.com | bash
google-cloud-sdk  // $HOME 설치됨 
$ source ~/.bashrc


$ gsutil ls gs://images.cocodataset.org/zips/     // 상위 *.zip file ls만 되고 download가 안되지만 다른 directory는 동작가능 
gs://images.cocodataset.org/zips/
gs://images.cocodataset.org/zips/test-stuff2017.zip
gs://images.cocodataset.org/zips/test2014.zip
gs://images.cocodataset.org/zips/test2015.zip
gs://images.cocodataset.org/zips/test2017.zip
gs://images.cocodataset.org/zips/train2014.zip
gs://images.cocodataset.org/zips/train2017.zip
gs://images.cocodataset.org/zips/unlabeled2017.zip
gs://images.cocodataset.org/zips/val2014.zip
gs://images.cocodataset.org/zips/val2017.zip

10/23/2019

NVIDIA Docker 기반의 Tensorflow 설치 및 기본테스트 (1차 분석)

1. NVIDIA Docker 사용목적 


NVIDIA Docker는 Docker-CE와 기능은 기본적으로 동일하지만, 기존 Docker가 GPU를 이용하지 못했기 때문에, NVIDIA Docker가 필요하게되었다.
하지만, Docker 19.03 부터는 GPU 사용가능 기능이 추가된 것 같으며,  필요한 Library를 설치한 후 사용하면 될것으로 생각되어진다.
별도의 nvidia-docker2를 이용해도 문제는 없을 것으로 생각되어진다.


기존 Docker 사용법 
  https://ahyuo79.blogspot.com/2018/05/docker.html


1.1 NVIDIA Docker 설치 및 설정 

설치방법 아래 참조
  https://github.com/NVIDIA/nvidia-docker

관련부분 참조
  https://github.com/JeonghunLee/NVIDIA_SDK

  • Docker의 권한설정
매번 sudo를 해야하기 때문에 아래와 같이 USER를 등록을 하자
$ sudo usermod -aG docker $USER 
$ sudo systemctl restart docker

1.2 일반 Docker 명령어 확인 

지금까지 Docker를 쉽게 설치해보고, Docker 문서에 따라 쉽게 Dockerfile을 만들어 보았지만, 
이 DockerFile를 직접 적극적으로 활용해서 응용하여 사용해 본적이 거의 없는 것 같다.

  • NVIDIA Docker 관련문서 (Docker 문서로 가장추천)
아래의 PDF를 보며 쉽게 Docker 명령어가 이해가며, 사용법도 쉽게 익히자
아래의 문서는 DGX에서 사용되어지는 Docker관련문서이다.
  https://www.nvidia.co.kr/content/apac/event/kr/deep-learning-day-2017/dli-1/Docker-User-Guide-17-08_v1_NOV01_Joshpark.pdf




  • Docker 전체구조 
Linux에서 Docker 설치 후 아래의 그림에서 각 명령어를 보면 쉽게 이해를 할 수 있다.

Docker Hub


상위 그림으로 전체 docker의 명령어를 쉽게 이해 할 수 있다.

  • 각  Docker 명령어는  관련부분 옵션확인 
$ docker version  // Docker version 확인 19.03.4 사용 
$ docker help // 모든 command 명령확인 

Management Commands:
  builder     Manage builds
  config      Manage Docker configs
  container   Manage containers
  context     Manage contexts
  engine      Manage the docker engine
  image       Manage images
  network     Manage networks
  node        Manage Swarm nodes
  plugin      Manage plugins
  secret      Manage Docker secrets
  service     Manage services
  stack       Manage Docker stacks
  swarm       Manage Swarm
  system      Manage Docker
  trust       Manage trust on Docker images
  volume      Manage volumes

Commands:
  attach      Attach local standard input, output, and error streams to a running container
  build       Build an image from a Dockerfile
  commit      Create a new image from a container's changes
  cp          Copy files/folders between a container and the local filesystem
  create      Create a new container
  deploy      Deploy a new stack or update an existing stack
  diff        Inspect changes to files or directories on a container's filesystem
  events      Get real time events from the server
  exec        Run a command in a running container
  export      Export a container's filesystem as a tar archive
  history     Show the history of an image
  images      List images
  import      Import the contents from a tarball to create a filesystem image
  info        Display system-wide information
  inspect     Return low-level information on Docker objects
  kill        Kill one or more running containers
  load        Load an image from a tar archive or STDIN
  login       Log in to a Docker registry
  logout      Log out from a Docker registry
  logs        Fetch the logs of a container
  pause       Pause all processes within one or more containers
  port        List port mappings or a specific mapping for the container
  ps          List containers
  pull        Pull an image or a repository from a registry
  push        Push an image or a repository to a registry
  rename      Rename a container
  restart     Restart one or more containers
  rm          Remove one or more containers
  rmi         Remove one or more images
  run         Run a command in a new container
  save        Save one or more images to a tar archive (streamed to STDOUT by default)
  search      Search the Docker Hub for images
  start       Start one or more stopped containers
  stats       Display a live stream of container(s) resource usage statistics
  stop        Stop one or more running containers
  tag         Create a tag TARGET_IMAGE that refers to SOURCE_IMAGE
  top         Display the running processes of a container
  unpause     Unpause all processes within one or more containers
  update      Update configuration of one or more containers
  version     Show the Docker version information
  wait        Block until one or more containers stop, then print their exit codes


$ docker COMMAND --help  // 개별 Command의 옵션들을 확인하자 
$ docker run         // Image를 가지고 와서 Container를 생성하고 이를 실행  (exit를 하면 container 종료)
$ docker attach      // run으로 Container로 생성된 container에 접속이가능 (주의, Exit 하게 될 경우 , container 종료)
$ docker images      // 현재 가지고 있는 docker Images들의 정보확인 
$ docker status      // 현재 동작중인 container   세부정보확인 (CPU/MEM 정보)
$ docker ps -a       // 현재 동작중인 container의 정보확인  

Tensorflow의 Docker 설치법
  https://www.tensorflow.org/install/docker?hl=ko


1.3 Docker Container 의 종료 혹은 임시 나오기 

Docker의 Container를 만들고 실행을 attach 모드를 사용하여 진입을 했을 경우, 내부에서 exit를 명령을 하면, Container는 종료되고 나온다.
물론 아래의 terminal 관련옵션을 설정했을 때 일이므로, 주의하자.
Terminal 모드로 동작 중인 Container에서 종료하지 말고 빠져나오는 방법은 이 때 봐야할 것이 run의 옵션이므로 이 부분을 자세히 보자


container 내부에서 명령어 exit 종료

  • docker run 옵션 
  1. - t 옵션 shell 에서 container의 default shell의 stdout을 볼 수 있다. 
  2. - i 옵션 shell 의 input key이 container에 입력이 되도록 한다 
  3. -d 옵션 detach mode로 background로 실행 




  • docker run -t -i  or -ti or -it 
결론적으로 기본적으로 attach(foreground mode)로 동작하며, container의 default shell로 실행이 동작가능하다
만약 -d 옵션을 주어 detach mode(Background mode)로 동작하며, 추후 docker attach 로 접속가능

  1. 종료방법: container shell 에서 직접 exit 입력  (default shell 만 존재할 경우 container도 동시 종료)
  2. 종료하지 않고 나오기:  [Ctrl + P] + [Ctrl + Q]로  빠져나옴  

  • docker run -i  or docker run 
Container terminal 모드로 들어간 것이 아니므로, container의 default shell 실행이 안됨
[Ctrl + P] + [Ctrl + Q] 사용못함


참고사항
  http://egloos.zum.com/sstories/v/9731853


1.4  Host에서  Docker Container를 접속방법 
    외부 일반 터미널 혹은 Background로 돌려 각 Container를 접속하고자 한다면,  attach를 사용
    상위에서 이름을 정하지 않았기 때문에 아래와 같이 Random으로 생성되므로, 이를 파악하여 접속

    Container는 Process가 하나로 동작되기 때문에, 만약 상위에 이미 접속중이고, 다시 아래와 같이 접속하면 동일한 화면을 볼수가 있다


    $ docker ps -a
    CONTAINER ID        IMAGE                                 COMMAND                  CREATED             STATUS              PORTS                          NAMES
    e0d05f676eb4        nvcr.io/nvidia/tensorflow:19.08-py3   "/usr/local/bin/nvid…"   8 seconds ago       Up 7 seconds        6006/tcp, 6064/tcp, 8888/tcp   hungry_faraday
    
    //docker run 에서 Portmapping을 하기전에 PORTS를 보면 사용되어지는 PORT를 알수 있다. 
    
    $ docker attach hungry_faraday
    root@650519457d6f:/workspace#  exit // exit 하면 docker container가 종료 
    


    • exec로 bash를 실행하여 process 추가 
    이렇게 접속을 하고,  exit로 빠져나와도 기존에 남아 있는 default shell process가 존재하기 때문에 문제 없음 ( docker run -it 옵션을 했을 경우)

    $ docker ps -a
    [sudo] password for jhlee: 
    CONTAINER ID        IMAGE                                 COMMAND                  CREATED             STATUS              PORTS                                                      NAMES
    7438ab2813a1        nvcr.io/nvidia/tensorflow:19.08-py3   "/usr/local/bin/nvid…"   45 minutes ago      Up 45 minutes       0.0.0.0:6006->6006/tcp, 0.0.0.0:8888->8888/tcp, 6064/tcp   tensorflow
    
    
    $ docker exec -it tensorflow /bin/bash 
    root@650519457d6f:/workspace#  // exit 하면 docker 종료되며 상위와 동일 
    


    Host Docker Container 접속
      https://bluese05.tistory.com/21


    Docker 사용법 참고
      https://devyurim.github.io/python/tensorflow/development%20enviroment/docker/2018/05/25/tensorflow-3.html
      https://noanswercode.tistory.com/11
      https://handcoding.tistory.com/203

    2. Tensorflow Docker 설치 및 테스트 진행 

    현재 NVIDIA에서는 기존에 GPU가 사용가능한 DOCKER인 NVIDIA-DOCKER를 제공을 했지만, DOCKER-CE 19.03 부터, 이제 불필요해진 것 같다.

    NVIDIA Docker 관련내용
      https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#enable-gpu-support



    2.1  Docker Tensorflow 설치 및 기본실행 

    • NGC(NVIDIA GPU CLOUD) 관련문서 
    Docker Pull하기위해서 아래의 Manual을 참조
      https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#pullcontainer
      https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#runcont

    • Tensorflow 설치 및 실행 관련사이트 
      https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow


    • Tensorflow Version 
      https://tensorflow.blog/category/tensorflow/


    • Docker Run 사용법
    docker run의 기본사용법 
    $ docker run options image:tags  //기본동작 
    $ docker run --help  //option 확인 
    $ docker run --gpus all  // GPU 사용 
                 -it         // intereactive mode로 이 부분은 상위 빠져나오기할때도 중요 
                 --rm        // container가 종료되며 자동으로 삭제 
                 -p Host-port:Container-Port  // Host 와 Container Portmapping 
                 -v HOST:Container            // Host 와 Container의 Mount를 하여 공유가능 
                 --name string     // 만약 name 옵션을 주지 않는다면, docker가 임의로 이름생성              
    


    • Docker Pull Container (GPU사용가능)
    상위 사이트에서 제공해주는 tensorflow의 Image를 pull 하여 이를 TEST를 진행하자.

    $ docker pull nvcr.io/nvidia/tensorflow:19.08-py3  


    • Docker 기반의 Tensorflow 기본 테스트 
    Docker를 이용하여 NVIDIA에서 제공하는 Tensorflow를 Container 가져와서 기본적인 테스트와 무슨기능이 있는지 대충 알아보자.

    $ docker run --gpus all -it --rm  \
    nvcr.io/nvidia/tensorflow:19.08-py3
    
    ================
    == TensorFlow ==
    ================
    
    NVIDIA Release 19.08 (build 7791926)
    TensorFlow Version 1.14.0
    
    Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
    Copyright 2017-2019 The TensorFlow Authors.  All rights reserved.
    
    Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
    NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
    
    NOTE: MOFED driver for multi-node communication was not detected.
          Multi-node communication performance may be reduced.
    
    NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
       insufficient for TensorFlow.  NVIDIA recommends the use of the following flags:
       nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...
    
    root@650519457d6f:/workspace# 
    root@650519457d6f:/workspace#  
    root@650519457d6f:/workspace# python      // Tensorflow 동작확인 
    >>> import tensorflow as tf
    >>> hello = tf.constant('Hello, TensorFlow!')
    >>> sess = tf.Session()
    >>> sess.run(hello)
    Hello, TensorFlow!
    >>> a = tf.constant(10)
    >>> b = tf.constant(32)
    >>> sess.run(a+b)
    42
    
    root@650519457d6f:/workspace#  find / -name cuda*  
    .....
    // cuda version 및 관련부분 확인 
    // tf-trt 예제도 확인가능  (Tensorflow-TensorRT)
    root@650519457d6f:/workspace# ll /usr/local/cuda-10.1/bin/    // CUDA GDB와 nvcc 확인 
    ....
    root@650519457d6f:/workspace# find / -name tensorrt*   // tensorrt version 및 관련사항 확인  
    .....
    // tensorrt 및 관련부분 확인 
    // 기타 UFF convert 및 관련사항 확인
    
    root@650519457d6f:/workspace# exit  // exit 하면 docker container가 종료되므로 삭제되며, docker ps 에서 찾을 수 없음 
    


    exit를 하면 container가 종료가 되어 본인이 생성했거나, 저장된 정보들이 모두 삭제된다.
    만약 유지하고 싶다면 상위 옵션에서 변경

    • 상위의  Docker에서 Tensorflow 필요한 Pakcage들을 확인
    1. cuda
    2. tensorrt 
    3. convert-to-uff

    2.2  Tensorboard 와 Jupyter notebook TEST 진행 

    Tensorflow기반에서 Tensorboard 와 Jupyter 관련부분 테스트 

     $ docker run --gpus all -it --rm  \
    -p 8888:8888 -p 6006:6006  \
    --name tensorflow \
    nvcr.io/nvidia/tensorflow:19.08-py3
    
    ================
    == TensorFlow ==
    ================
    
    NVIDIA Release 19.08 (build 7791926)
    .....
    
    //Tensorboard TEST 진행 기본 6006
    root@7438ab2813a1:/workspace#  tensorboard --logdir=/tmp
    or 
    root@7438ab2813a1:/workspace#  tensorboard --logdir=/tmp --port=8008  // Port를 변경하고자 한다면 
    //아래의 Host Browser에서 확인하고 Ctrl+C 중지 
    
    root@7438ab2813a1:/workspace#  jupyter notebook
    token 정보를 봐서 입력하여 로그인
    
    //아래의 Host Browser에서 확인하고 Ctrl+C 중지  
    


    • Docker 에서 상위 기능확인 
    HOST Browser에서 아래의 링크로 각각 확인을 해보자

    Tensorboard
      http://localhost:6006/

    Jupyter notebook
      http://localhost:8888/


    2.3 NVIDIA Tensorflow 구성확인

    • Host 와 Container의 Workspace 공유 
    Docker에 들어가기전에 아래와 같이 미리 HOST Directory 생성 후 Container로 진입하며, workspace를 공유하기위해서 mount 추가

     $ docker run --gpus all -it --rm  \
    -p 8888:8888 -p 6006:6006  \
    --name tensorflow \
    -v /home/jhlee/works/dockers/tensorflow/Workspace:/Workspace \
    nvcr.io/nvidia/tensorflow:19.08-py3
    ........
    root@55e1bf3c0391:/workspace# ls /Workspace   // 원래 Workspace directory는 없음 
    root@55e1bf3c0391:/workspace# cp -a * /Workspace
    root@55e1bf3c0391:/workspace# ls /Workspace    // 전부 Host로 복사완료 
    README.md  docker-examples  nvidia-examples
    


    • Host에서 NVIDIA Example  전체 구조확인 
    Host에서 NVIDIA Example을 간단히 기본구조를 확인가능하며 아래의 README.md가 있는 걸로 봐서 어디의 Github에 저장되어있는것 같은데 아직모름

    $ cd /home/jhlee/works/dockers/tensorflow/Workspace
    $ tree -L2 
    .
    ├── docker-examples
    │   ├── Dockerfile.addpackages            // DOCKER FILE로 Host에서 docker build를 확장 
    │   └── Dockerfile.customtensorflow       // DOCKER FILE로 Host에서 docker build를 확장
    ├── nvidia-examples        // 각 Model의 Example 
    │   ├── bert
    │   ├── big_lstm
    │   ├── build_imagenet_data
    │   ├── cnn
    │   ├── gnmt_v2
    │   ├── NCF
    │   ├── OpenSeq2Seq
    │   ├── resnet50v1.5
    │   ├── ssdv1.2             // 눈이 익은 이것을  SSD를 테스트를 진행해보자 
    │   ├── tensorrt
    │   ├── Transformer_TF
    │   ├── UNet_Industrial
    │   └── UNet_Medical
    └── README.md 


    2.4  Docker Dockerfile 수정 과 Image 생성 

    • Dockerfile 확인 및 수정  
    상위 Dockerfile에서 본인이 변경하고자 하는 것이 있으면 수정하여 사용하자 

    $ cd docker-examples
    $ cat Dockerfile.addpackages  //nvcr.io/nvidia/tensorflow:19.08-py3 기반으로 추가 Package  
    
    FROM nvcr.io/nvidia/tensorflow:19.08-py3
    
    # Install my-extra-package-1 and my-extra-package-2
    RUN apt-get update && apt-get install -y --no-install-recommends \
            my-extra-package-1 \
            my-extra-package-2 \
          && \
        rm -rf /var/lib/apt/lists/
    
    
    $ cat Dockerfile.customtensorflow  // 동일하며, 본인이 원하는 것을 추가하여 Image를 생성해도 될 것 같다. 
    
    FROM nvcr.io/nvidia/tensorflow:19.08-py3
    
    # Bring in changes from outside container to /tmp
    # (assumes my-tensorflow-modifications.patch is in same directory as Dockerfile)
    COPY my-tensorflow-modifications.patch /tmp
    
    # Change working directory to TensorFlow source path
    WORKDIR /opt/tensorflow
    
    # Apply modifications
    RUN patch -p1 < /tmp/my-tensorflow-modifications.patch
    
    # Rebuild TensorFlow for python 2 and 3
    RUN ./nvbuild.sh --python2
    RUN ./nvbuild.sh --python3
    
    # Reset default working directory
    WORKDIR /workspace

    본인이 원하면, nvcr.io/nvidia/tensorflow:19.08-py3 기반으로 Dockerfile을 이용하여 Image를 생성하여 본인만의 Tensorflow Image를 생성 및 구성가능


    • Docker Image 생성방법 

    $ docker build . -t my_images  // Dockerfile 위치가서 -t name or 'name:tag' image 생성가능 


    2.5  Tensorflow의 SSDv1.2  기본분석 

    처음 NVIDIA의 Tensorflow의 SSD를 Training을 하기 위해서 README.md를 보면서 이해가 가지 않았다.
    그래서 어쩔수 없이 소스를 분석을 해보기로 했다.

    • Host에서 README.md 확인
    README.md는 Github에서 사용하는 것으로 처음 이것 때문에 사용해야하는 절차가 헷갈렸다.
    이 것을 보면, Github의 내용처럼 동일하다.

    $ cd /home/jhlee/works/dockers/tensorflow/Workspace
    $ cd nvidia-examples/ssdv1.2#
    $ ls 
    Dockerfile  NOTICE  README.md  configs  download_all.sh  examples  img  models  qa  requirements.txt
    


    상위 SSD가 아래의 Github와 동일
      https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Detection/SSD

    이밖에 NVIDIA DeepLearning 다른 Example 확인가능
      https://github.com/NVIDIA/DeepLearningExamples


    • HOST에서 아래와 같이 Dockerfile을 확인  및 비교 
    $ cat Dockerfile  // Dockerfile을 보면 19.05 image 기반으로 더 추가하여 설치를 한다음 각각의 구성한다. 
    FROM nvcr.io/nvidia/tensorflow:19.05-py3 as base
    
    FROM base as sha
    
    RUN mkdir /sha
    RUN cat `cat HEAD | cut -d' ' -f2` > /sha/repo_sha
    
    FROM base as final
    
    WORKDIR /workdir
    
    RUN PROTOC_VERSION=3.0.0 && \
        PROTOC_ZIP=protoc-${PROTOC_VERSION}-linux-x86_64.zip && \
        curl -OL https://github.com/google/protobuf/releases/download/v$PROTOC_VERSION/$PROTOC_ZIP && \
        unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
        rm -f $PROTOC_ZIP
    
    COPY requirements.txt .
    RUN pip install Cython
    RUN pip install -r requirements.txt
    
    WORKDIR models/research/
    COPY models/research/ .
    RUN protoc object_detection/protos/*.proto --python_out=.
    ENV PYTHONPATH="/workdir/models/research/:/workdir/models/research/slim/:$PYTHONPATH"
    
    COPY examples/ examples
    COPY configs/ configs/
    COPY download_all.sh download_all.sh
    
    COPY --from=sha /sha .   


    결론은  nvcr.io/nvidia/tensorflow:19.05-py3 기반으로 base Image 만들고, 이를 다시 sha로 변경 후 각각의 새로운 Package들을 추가해서 새로운 Image를 생성한다.
      https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Detection/SSD/Dockerfile


    • download.sh 비교 
    구성을 보면 각각의 DATASET을 Download와 Directory 구성후 tf_record를 만든다.
    1. $1 : 1st Arg    Container name 
    2. $2:  2nd Arg /data/coco2017_tfrecords 의 root path
    3. $3:  3rd Arg  /checkpoints 의 root path 

    $ cat download_all.sh
    if [ -z $1 ]; then echo "Docker container name is missing" && exit 1; fi
    CONTAINER=$1
    COCO_DIR=${2:-"/data/coco2017_tfrecords"}
    CHECKPOINT_DIR=${3:-"/checkpoints"}
    mkdir -p $COCO_DIR
    chmod 777 $COCO_DIR
    cd $COCO_DIR
    curl -O http://images.cocodataset.org/zips/train2017.zip; unzip train2017.zip
    curl -O http://images.cocodataset.org/zips/val2017.zip; unzip val2017.zip
    curl -O http://images.cocodataset.org/annotations/annotations_trainval2017.zip; unzip annotations_trainval2017.zip
    # Download backbone checkpoint
    mkdir -p $CHECKPOINT_DIR
    chmod 777 $CHECKPOINT_DIR
    cd $CHECKPOINT_DIR
    wget http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz
    tar -xzf resnet_v1_50_2016_08_28.tar.gz
    mkdir -p resnet_v1_50
    mv resnet_v1_50.ckpt resnet_v1_50/model.ckpt
    nvidia-docker run --rm -it -u 123 -v $COCO_DIR:/data/coco2017_tfrecords $CONTAINER bash -c '
    cd /data/coco2017_tfrecords
    # Create TFRecords
    python /workdir/models/research/object_detection/dataset_tools/create_coco_tf_record.py \
        --train_image_dir=`pwd`"/train2017" \
        --val_image_dir=`pwd`"/val2017" \
        --val_annotations_file=`pwd`"/annotations/instances_val2017.json" \
        --train_annotations_file=`pwd`"/annotations/instances_train2017.json" \
        --testdev_annotations_file=`pwd`"/annotations/instances_val2017.json" \
        --test_image_dir=`pwd`"/val2017" \
        --output_dir=`pwd`'
    
    


    상위의 각각의 COCO DATA SET을 Download한 후 TF Record를 만드는 방식으로 동작

      https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Detection/SSD/download_all.sh


    COCO DATA Set의 크기가 방대해서 이미 Download하고 아래와 같이 실행하려고 했으나, 최신 Version으로 다시 하기로 결정

     $ docker run --gpus all -it --rm  \
    -p 8888:8888 -p 6006:6006  \
    --name tensorflow \
    -v /home/jhlee/works/dockers/tensorflow/Workspace:/Workspace \
    -v /home/jhlee/works/dockers/tensorflow/data:/data \
    -v /home/jhlee/works/dockers/tensorflow/checkpoints:/checkpoints \
    nvcr.io/nvidia/tensorflow:19.08-py3
    ........
    root@55e1bf3c0391:/workspace# ls /Workspace   // 원래 Workspace directory는 없음 
    root@55e1bf3c0391:/workspace# cp -a * /Workspace
    root@55e1bf3c0391:/workspace# ls /Workspace    // 전부 Host로 복사완료 
    README.md  docker-examples  nvidia-examples
    

    • NvLink (NVIDIA GPU Link)
    Nvidia-smi 기반으로  Nvlink 기능확인 및 활용 
    $ nvidia-smi nvlink -h
    
        nvlink -- Display NvLink information.
    
        Usage: nvidia-smi nvlink [options]
    
        Options include:
        [-h | --help]: Display help information
        [-i | --id]: Enumeration index, PCI bus ID or UUID.
    
        [-l | --link]: Limit a command to a specific link.  Without this flag, all link information is displayed.
        [-s | --status]: Display link state (active/inactive).
        [-c | --capabilities]: Display link capabilities.
        [-p | --pcibusid]: Display remote node PCI bus ID for a link.
        [-sc | --setcontrol]: Set the utilization counters to count specific NvLink transactions.
           The argument consists of an N-character string representing what is meant to be counted:
               First character specifies the counter set:
                   0 = counter 0
                   1 = counter 1
               Second character can be:
                   c = count cycles
                   p = count packets
                   b = count bytes
               Next N characters can be any of the following:
                   n = nop
                   r = read
                   w = write
                   x = reduction atomic requests
                   y = non-reduction atomic requests
                   f = flush
                   d = responses with data
                   o = responses with no data
                   z = all traffic
    
        [-gc | --getcontrol]: Get the utilization counter control information showing
    the counting method and packet filter for the specified counter set (0 or 1).
        [-g | --getcounters]: Display link utilization counter for specified counter set (0 or 1).
            Note that this currently requires root by default for security reasons.
            See "GPU performance counters" in the Known Issues section of the GPU driver README
        [-r | --resetcounters]: Reset link utilization counter for specified counter set (0 or 1).
        [-e | --errorcounters]: Display error counters for a link.
        [-ec | --crcerrorcounters]: Display per-lane CRC error counters for a link.
        [-re | --reseterrorcounters]: Reset all error counters to zero.
    


    • NvLink (NVIDIA GPU Link)
    동일한 OS와 동일한 GPU인데, 아래와 같이 동작이 다름
    NVLINK 부분에 대해서도 좀 더 자세히 알아야 할 것 같으며 , 이를 적극적으로 활용 

    $ nvidia-smi nvlink -c
    GPU 0: GeForce RTX 2080 Ti (UUID: GPU-0c2ff3ff-47a9-2a70-afba-eb1be7d1d5fb)
    GPU 1: GeForce RTX 2080 Ti (UUID: GPU-024cc4d2-008a-de50-b130-e24beae9d650)
    
    $ nvidia-smi nvlink -s
    GPU 0: GeForce RTX 2080 Ti (UUID: GPU-0c2ff3ff-47a9-2a70-afba-eb1be7d1d5fb)
      Link 0:  inactive
      Link 1:  inactive
    GPU 1: GeForce RTX 2080 Ti (UUID: GPU-024cc4d2-008a-de50-b130-e24beae9d650)
      Link 0:  inactive
      Link 1:  inactive
    


    아래의 사이트에서 나온 정보로 상위 현재 GPU 두개 사용하는 것이 다름

    $ nvidia-smi nvlink -c
    GPU 0: GeForce RTX 2080 Ti (UUID: GPU-1ac935c2-557f-282e-14e5-3f749ffd63ac)
             Link 0, P2P is supported: true
             Link 0, Access to system memory supported: true
             Link 0, P2P atomics supported: true
             Link 0, System memory atomics supported: true
             Link 0, SLI is supported: true
             Link 0, Link is supported: false
             Link 1, P2P is supported: true
             Link 1, Access to system memory supported: true
             Link 1, P2P atomics supported: true
             Link 1, System memory atomics supported: true
             Link 1, SLI is supported: true
             Link 1, Link is supported: false
    GPU 1: GeForce RTX 2080 Ti (UUID: GPU-13277ce5-e1e9-0cb1-8cee-6c9e6618e774)
             Link 0, P2P is supported: true
             Link 0, Access to system memory supported: true
             Link 0, P2P atomics supported: true
             Link 0, System memory atomics supported: true
             Link 0, SLI is supported: true
             Link 0, Link is supported: false
             Link 1, P2P is supported: true
             Link 1, Access to system memory supported: true
             Link 1, P2P atomics supported: true
             Link 1, System memory atomics supported: true
             Link 1, SLI is supported: true
             Link 1, Link is supported: false
    

    NVIDIA NVLINK
      http://blog.naver.com/PostView.nhn?blogId=computer8log&logNo=221378763004
      https://www.nvidia.com/ko-kr/data-center/nvlink/


    참고
      http://wp.study3.biz/wp-content/uploads/2019/04/Big-LSTM-Manjaro-Linux-RTX-2080Ti-x2-Nvlink-CUDA-10.1-i7-5960x-19.04-py3-TensorFlow-Training-performance-wordssecond-wps64160-63475-62864.txt
      http://wp.study3.biz/wp-content/uploads/2019/04/Big-LSTM-CentOS7.6-RTX-2080Ti-x2-Nvlink-CUDA-10.1-i7-5960x-19.04-py3-TensorFlow-Training-performance-wordssecond-wps68609-67909-68423.txt