10/25/2019

NVIDIA Docker SSD Traing 분석 (2차 분석)

1.  NVIDIA Object Detection SSD Docker    

NVIDIA Tensorflow DeepLearning Example은 현재 Github에서 제공을 해주고 있으며 각각의 아래의 사이트에서 확인을 하자.

  • Github NVIDIA DeepLearning SSD 사이트 확인 
현재 아래의 SSD Github의 README.md 기반으로 진행을 하며 이부분을 보고 진행을 하면된다.
  https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Detection/SSD

상위 소스를 이용하여 쉽게 NVIDIA Tensorflow Object Detection SSD Docker 구성이 가능하며, 테스트도 가능하다.

  • Github 기타 DeepLearning Example 
기타 아래의 NVIDIA DeepLearning Example이 존재하며 이부분들을 살펴보자 (아직 미테스트)
  https://github.com/NVIDIA/DeepLearningExamples



  • 기타 참고 사이트 

NVIDIA에서 제공해주는 각 Framework 별 Training 기능소개
  https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

Tensorflow의 사이트의 Tensorflow Guide
  https://www.tensorflow.org/tutorials?hl=ko


1.1 NVIDIA SSD Docker Quick Guide 

README.md 문서를 참고하며 아래와 같이 실행하면 쉽게 Docker를 이용하여 Object Detection 의 SSD  Model를 쉽게 Training 이 가능하다. (COCOSET 기반)

  • Quick Guide 1. Clone the repository
$ git clone https://github.com/NVIDIA/DeepLearningExamples
$ cd DeepLearningExamples/TensorFlow/Detection/SSD


  • Quick Guide 2. Build the SSD320 v1.2 TensorFlow NGC container.
$ docker build . -t nvidia_ssd 

상위와 같이 실행하면 dockerfile을 기반으로 새로운 Docker Image 생성
이외에도 Docker commit를 이용하여 docker에서 직접 Image 생성도 가능

  • Quick Guide  3. Download and preprocess the dataset. (COCO 2017)
$ ./download_all.sh nvidia_ssd /home/jhlee/works/ssd/data /home/jhlee/works/ssd/check 

  • Quick Guide 4. Launch the NGC container to run training/inference.
$ nvidia-docker run --rm -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-v /home/jhlee/works/ssd/data:/data/coco2017_tfrecords \
-v /home/jhlee/works/ssd/check:/checkpoints \
--ipc=host \
nvidia_ssd 

  • Quick Guide  5. Start training.
root@c7550d6b2c59:/workdir/models/research#  bash ./examples/SSD320_FP16_1GPU.sh /checkpoints  

  • Quick Guide  6. Start validation/evaluation.
root@c7550d6b2c59:/workdir/models/research#   bash examples/SSD320_evaluate.sh /checkpoints  


2. Object Detection의 SSD 실행 및 분석 


2.1  Quick Guide 1~2  의 실행 및 분석 

nvcr.io/nvidia/tensorflow:19.05-py3  기반으로 필요한 Package를 설치한 후 새로운 Docker Image를 생성하는 과정이다.

  • HOST에서 직접 아래와 같이 실행 
상위 Github의 명령대로 그대로 실행

$ cd ~/works
$ mkdir ssd 
$ cd ssd
$ mkdir data
$ mkdir check 
$ git clone https://github.com/NVIDIA/DeepLearningExamples
$ cd DeepLearningExamples/TensorFlow/Detection/SSD
$ ls 
configs  Dockerfile  download_all.sh  examples  img  models  NOTICE  README.md  requirements.txt 



  • Docker Image 생성 
Dockerfile에 nvcr.io/nvidia/tensorflow:19.05-py3 기반에 필요한 Package들을 설치를 진행 후 Image 생성이 됨

$ docker build . -t nvidia_ssd    // Dockerfile기반으로 Image 생성 


  • 상위 Dockerfile 분석 및 이해 
아래의 Docker File을 이해하기 위해서는 현재 Directory 위치가 중요

$ pwd 
/home/jhlee/works/ssd/DeepLearningExamples/TensorFlow/Detection/SSD
$ cat Dockerfile 
FROM nvcr.io/nvidia/tensorflow:19.05-py3 as base

FROM base as sha

RUN mkdir /sha
RUN cat `cat HEAD | cut -d' ' -f2` > /sha/repo_sha

FROM base as final

WORKDIR /workdir

RUN PROTOC_VERSION=3.0.0 && \
    PROTOC_ZIP=protoc-${PROTOC_VERSION}-linux-x86_64.zip && \
    curl -OL https://github.com/google/protobuf/releases/download/v$PROTOC_VERSION/$PROTOC_ZIP && \
    unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
    rm -f $PROTOC_ZIP

COPY requirements.txt .
RUN pip install Cython
RUN pip install -r requirements.txt

WORKDIR models/research/
COPY models/research/ .
RUN protoc object_detection/protos/*.proto --python_out=.
ENV PYTHONPATH="/workdir/models/research/:/workdir/models/research/slim/:$PYTHONPATH"

COPY examples/ examples
COPY configs/ configs/
COPY download_all.sh download_all.sh

COPY --from=sha /sha .  


  • Google Protocol Buffer
이부분 정보를 자세히 설명해주셔서 감사하다
  https://bcho.tistory.com/1182

  • DockerFile
책으로도 나왔으며, 쉽게 Dockerfile 생성 및 사용법을 알수 있음
  http://pyrasis.com/docker.html

상위 현재위치 File
  https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Detection/SSD

  • 생성된 Docker Image 확인 
$ docker images   
REPOSITORY                  TAG                             IMAGE ID            CREATED             SIZE
nvidia_ssd                  latest                          ab529215f717        5 minutes ago       6.97GB
none                        none                            a6bc644c75ed        6 minutes ago       6.96GB  //nvidia_ssd를 만들면서 생기는 image
nvcr.io/nvidia/tensorflow   19.08-py3                       be978d32a5c3        8 weeks ago         7.35GB
nvcr.io/nvidia/cuda         10.1-cudnn7-devel-ubuntu18.04   0ead98c22e04        8 weeks ago         3.67GB
nvidia/cuda                 9.0-devel                       2a64416134d8        8 weeks ago         2.03GB
nvcr.io/nvidia/cuda         10.1-devel-ubuntu18.04          946e78c7b298        8 weeks ago         2.83GB
nvidia/cuda                 10.1-base                       a5f5d3b655ca        8 weeks ago         106MB
nvcr.io/nvidia/tensorflow   19.05-py3                       01c8c4b0d7ff        5 months ago        6.96GB


nvidia_ssd 를 위해서 none 과 nvcr.io/nvidia/tensorflow:19.05-py3 가 필요


2.2  Quick Guide 3  실행 및 분석

Quick Guide 3은 CoCoDataSET을 Download하고 이 기반으로 TF Record format을 만드는 작업이다.

  • CoCoDATA Set Download 와 TF Record 생성 (download.sh)
Host에서 실행되는 Shell Script으로  Host에 아래의 두개 Directory 구성이 필요하다.
여기서 주역할은 COCOSET Download와 이 기반으로 TF Record를 생성이다.
  1. /data/coco2017_tfrecords : COCOSET의 DATA 저장장소 및 TF Record 저장장소 
  2. /checkpoints : Tensorflow의 checkpoint 파일로 이 부분은 별도로 알아보자. 

HOST 실행
$ ./download_all.sh nvidia_ssd /home/jhlee/works/ssd/data /home/jhlee/works/ssd/check 


$ cat ./download_all.sh // 기본분석  이전과 거의 유사하지만, 아래의 Container에서 Shell을 실행 ,이 부분 기존의 DATASET Download하는 부분으로 변경 

if [ -z $1 ]; then echo "Docker container name is missing" && exit 1; fi
## 1st ARG : CONTAINER NAME
## 2nd ARG : BASE PATH /data/coco2017_tfrecords
## 3st ARG : BASE PATH /checkpoints 
CONTAINER=$1
COCO_DIR=${2:-"/data/coco2017_tfrecords"}
CHECKPOINT_DIR=${3:-"/checkpoints"}
mkdir -p $COCO_DIR
chmod 777 $COCO_DIR
# Download backbone checkpoint
mkdir -p $CHECKPOINT_DIR
chmod 777 $CHECKPOINT_DIR
cd $CHECKPOINT_DIR
wget http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz
tar -xzf resnet_v1_50_2016_08_28.tar.gz
mkdir -p resnet_v1_50
mv resnet_v1_50.ckpt resnet_v1_50/model.ckpt
## nvidia-docker/docker로 사용가능하며, 아래의 Script는 반드시 Docker Container에서 실행과동시에 bash script 실행 
## docker 내부의 download_and_preprocess_mscoco.sh 에 의해 COCOSET 2017 Download 후 아래와 같이 TFRecords 생성 
nvidia-docker run --rm -it -u 123 -v $COCO_DIR:/data/coco2017_tfrecords $CONTAINER bash -c '
# Create TFRecords
bash /workdir/models/research/object_detection/dataset_tools/download_and_preprocess_mscoco.sh \
    /data/coco2017_tfrecords'



  • download_and_preprocess_mscoco.sh 분석
Container 내부에서 실행되는 실제적인 Shell Scirpt으로 분석하려며 Docker를 실행해서 봐야한다.
  1. COCOSET2017 Download (Annotation 부분포함)
  2. DataSET 기반으로 TFRecord 생성 

Shell Script 분석을 위해 아래와 같이 간단히 Container 실행하여 분석

$ nvidia-docker run --rm -it -u 123 -v $HOME/works/ssd/data:/data/coco2017_tfrecords nvidia_ssd 
================
== TensorFlow ==
================

NVIDIA Release 19.05 (build 6390160)
TensorFlow Version 1.13.1
.....

I have no name!@a4891a3ac177:/workdir/models/research$ cat object_detection/dataset_tools/download_and_preprocess_mscoco.sh 
#!/bin/bash
set -e

if [ -z "$1" ]; then
  echo "usage download_and_preprocess_mscoco.sh [data dir]"
  exit
fi

if [ "$(uname)" == "Darwin" ]; then
  UNZIP="tar -xf"
else
  UNZIP="unzip -nq"
fi

# Create the output directories.
OUTPUT_DIR="${1%/}"
SCRATCH_DIR="${OUTPUT_DIR}/raw-data"
mkdir -p "${OUTPUT_DIR}"
mkdir -p "${SCRATCH_DIR}"
CURRENT_DIR=$(pwd)

# Helper function to download and unpack a .zip file.
function download_and_unzip() {
  local BASE_URL=${1}
  local FILENAME=${2}

  if [ ! -f ${FILENAME} ]; then
    echo "Downloading ${FILENAME} to $(pwd)"
    wget -nd -c "${BASE_URL}/${FILENAME}"
  else
    echo "Skipping download of ${FILENAME}"
  fi
  echo "Unzipping ${FILENAME}"
  ${UNZIP} ${FILENAME}
}

cd ${SCRATCH_DIR}

## 말 그래도 cocoset의 Download하는데, 필요한 Image들이 많다 
## (이 부분은 DATASET을 자세히 알아봐야겠다.)

# Download the images.     
BASE_IMAGE_URL="http://images.cocodataset.org/zips"

TRAIN_IMAGE_FILE="train2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${TRAIN_IMAGE_FILE}
TRAIN_IMAGE_DIR="${SCRATCH_DIR}/train2017"

VAL_IMAGE_FILE="val2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${VAL_IMAGE_FILE}
VAL_IMAGE_DIR="${SCRATCH_DIR}/val2017"

TEST_IMAGE_FILE="test2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${TEST_IMAGE_FILE}
TEST_IMAGE_DIR="${SCRATCH_DIR}/test2017"

## Annotation 부분을 Download하는데, 보면 종류가 꽤 되는데, 이 부분 역시 DATASET의 역할을 알아야겠다.  

# Download the annotations.
BASE_INSTANCES_URL="http://images.cocodataset.org/annotations"
INSTANCES_FILE="annotations_trainval2017.zip"
download_and_unzip ${BASE_INSTANCES_URL} ${INSTANCES_FILE}

#
# Train 과 Validation 에는 annotations 중에서 instances_train2017.json / instances_val2017.json 만 사용 
#
TRAIN_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/instances_train2017.json"
VAL_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/instances_val2017.json"

# Download the test image info.
BASE_IMAGE_INFO_URL="http://images.cocodataset.org/annotations"
IMAGE_INFO_FILE="image_info_test2017.zip"
download_and_unzip ${BASE_IMAGE_INFO_URL} ${IMAGE_INFO_FILE}

#
# TEST시에는 annotations 중에서 image_info_test-dev2017.json 사용 
#
TESTDEV_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/image_info_test-dev2017.json"

# Build TFRecords of the image data.
cd "${CURRENT_DIR}"
python object_detection/dataset_tools/create_coco_tf_record.py \
  --logtostderr \
  --include_masks \
  --train_image_dir="${TRAIN_IMAGE_DIR}" \
  --val_image_dir="${VAL_IMAGE_DIR}" \
  --test_image_dir="${TEST_IMAGE_DIR}" \
  --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
  --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
  --testdev_annotations_file="${TESTDEV_ANNOTATIONS_FILE}" \
  --output_dir="${OUTPUT_DIR}"


상위에서  dataset_tools/create_coco_tf_record.py 를 이용하여 tf record format를 생성
만약 DATASET이 변경되면, dataset_tools를 참조

Preparing Inputs (다른 SET의 설정을 알수 있음)
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/preparing_inputs.md


root@c7550d6b2c59:/workdir/models/research# ls object_detection/dataset_tools/
__init__.py                    create_kitti_tf_record.py       create_pascal_tf_record.py       create_pycocotools_package.sh         oid_hierarchical_labels_expansion_test.py  tf_record_creation_util.py
create_coco_tf_record.py       create_kitti_tf_record_test.py  create_pascal_tf_record_test.py  download_and_preprocess_mscoco.sh     oid_tfrecord_creation.py                   tf_record_creation_util_test.py
create_coco_tf_record_test.py  create_oid_tf_record.py         create_pet_tf_record.py          oid_hierarchical_labels_expansion.py  oid_tfrecord_creation_test.py

## 아래를 보면 COCO의 Annotation은 JSON 형태 사용 
root@c7550d6b2c59:/workdir/models/research# cat object_detection/dataset_tools/create_coco_tf_record.py 
r"""Convert raw COCO dataset to TFRecord for object_detection.

Please note that this tool creates sharded output files.

Example usage:
    python create_coco_tf_record.py --logtostderr \
      --train_image_dir="${TRAIN_IMAGE_DIR}" \
      --val_image_dir="${VAL_IMAGE_DIR}" \
      --test_image_dir="${TEST_IMAGE_DIR}" \
      --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
      --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
      --testdev_annotations_file="${TESTDEV_ANNOTATIONS_FILE}" \
      --output_dir="${OUTPUT_DIR}"
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import hashlib
import io
import json
import os
import contextlib2
import numpy as np
import PIL.Image

from pycocotools import mask
import tensorflow as tf

from object_detection.dataset_tools import tf_record_creation_util
from object_detection.utils import dataset_util
from object_detection.utils import label_map_util


flags = tf.app.flags
tf.flags.DEFINE_boolean('include_masks', False,
                        'Whether to include instance segmentations masks '
                        '(PNG encoded) in the result. default: False.')
tf.flags.DEFINE_string('train_image_dir', '',
                       'Training image directory.')
tf.flags.DEFINE_string('val_image_dir', '',
                       'Validation image directory.')
tf.flags.DEFINE_string('test_image_dir', '',
                       'Test image directory.')
tf.flags.DEFINE_string('train_annotations_file', '',
                       'Training annotations JSON file.')
tf.flags.DEFINE_string('val_annotations_file', '',
                       'Validation annotations JSON file.')
tf.flags.DEFINE_string('testdev_annotations_file', '',
                       'Test-dev annotations JSON file.')
tf.flags.DEFINE_string('output_dir', '/tmp/', 'Output data directory.')

FLAGS = flags.FLAGS

tf.logging.set_verbosity(tf.logging.INFO)


def create_tf_example(image,
                      annotations_list,
                      image_dir,
                      category_index,
                      include_masks=False):
  """Converts image and annotations to a tf.Example proto.

  Args:
    image: dict with keys:
      [u'license', u'file_name', u'coco_url', u'height', u'width',
      u'date_captured', u'flickr_url', u'id']
    annotations_list:
      list of dicts with keys:
      [u'segmentation', u'area', u'iscrowd', u'image_id',
      u'bbox', u'category_id', u'id']
      Notice that bounding box coordinates in the official COCO dataset are
      given as [x, y, width, height] tuples using absolute coordinates where
      x, y represent the top-left (0-indexed) corner.  This function converts
      to the format expected by the Tensorflow Object Detection API (which is
      which is [ymin, xmin, ymax, xmax] with coordinates normalized relative
      to image size).
    image_dir: directory containing the image files.
    category_index: a dict containing COCO category information keyed
      by the 'id' field of each category.  See the
      label_map_util.create_category_index function.
    include_masks: Whether to include instance segmentations masks
      (PNG encoded) in the result. default: False.
  Returns:
    example: The converted tf.Example
    num_annotations_skipped: Number of (invalid) annotations that were ignored.

  Raises:
    ValueError: if the image pointed to by data['filename'] is not a valid JPEG
  """
  image_height = image['height']
  image_width = image['width']
  filename = image['file_name']
  image_id = image['id']

  full_path = os.path.join(image_dir, filename)
  with tf.gfile.GFile(full_path, 'rb') as fid:
    encoded_jpg = fid.read()
  encoded_jpg_io = io.BytesIO(encoded_jpg)
  image = PIL.Image.open(encoded_jpg_io)
  key = hashlib.sha256(encoded_jpg).hexdigest()

  xmin = []
  xmax = []
  ymin = []
  ymax = []
  is_crowd = []
  category_names = []
  category_ids = []
  area = []
  encoded_mask_png = []
  num_annotations_skipped = 0
  for object_annotations in annotations_list:
    (x, y, width, height) = tuple(object_annotations['bbox'])
    if width <= 0 or height <= 0:
      num_annotations_skipped += 1
      continue
    if x + width > image_width or y + height > image_height:
      num_annotations_skipped += 1
      continue
    xmin.append(float(x) / image_width)
    xmax.append(float(x + width) / image_width)
    ymin.append(float(y) / image_height)
    ymax.append(float(y + height) / image_height)
    is_crowd.append(object_annotations['iscrowd'])
    category_id = int(object_annotations['category_id'])
    category_ids.append(category_id)
    category_names.append(category_index[category_id]['name'].encode('utf8'))
    area.append(object_annotations['area'])

    if include_masks:
      run_len_encoding = mask.frPyObjects(object_annotations['segmentation'],
                                          image_height, image_width)
      binary_mask = mask.decode(run_len_encoding)
      if not object_annotations['iscrowd']:
        binary_mask = np.amax(binary_mask, axis=2)
      pil_image = PIL.Image.fromarray(binary_mask)
      output_io = io.BytesIO()
      pil_image.save(output_io, format='PNG')
      encoded_mask_png.append(output_io.getvalue())
  feature_dict = {
      'image/height':
          dataset_util.int64_feature(image_height),
      'image/width':
          dataset_util.int64_feature(image_width),
      'image/filename':
          dataset_util.bytes_feature(filename.encode('utf8')),
      'image/source_id':
          dataset_util.bytes_feature(str(image_id).encode('utf8')),
      'image/key/sha256':
          dataset_util.bytes_feature(key.encode('utf8')),
      'image/encoded':
          dataset_util.bytes_feature(encoded_jpg),
      'image/format':
          dataset_util.bytes_feature('jpeg'.encode('utf8')),
      'image/object/bbox/xmin':
          dataset_util.float_list_feature(xmin),
      'image/object/bbox/xmax':
          dataset_util.float_list_feature(xmax),
      'image/object/bbox/ymin':
          dataset_util.float_list_feature(ymin),
      'image/object/bbox/ymax':
          dataset_util.float_list_feature(ymax),
      'image/object/class/text':
          dataset_util.bytes_list_feature(category_names),
      'image/object/is_crowd':
          dataset_util.int64_list_feature(is_crowd),
      'image/object/area':
          dataset_util.float_list_feature(area),
  }
  if include_masks:
    feature_dict['image/object/mask'] = (
        dataset_util.bytes_list_feature(encoded_mask_png))
  example = tf.train.Example(features=tf.train.Features(feature=feature_dict))
  return key, example, num_annotations_skipped


def _create_tf_record_from_coco_annotations(
    annotations_file, image_dir, output_path, include_masks, num_shards):
  """Loads COCO annotation json files and converts to tf.Record format.

  Args:
    annotations_file: JSON file containing bounding box annotations.
    image_dir: Directory containing the image files.
    output_path: Path to output tf.Record file.
    include_masks: Whether to include instance segmentations masks
      (PNG encoded) in the result. default: False.
    num_shards: number of output file shards.
  """
  with contextlib2.ExitStack() as tf_record_close_stack, \
      tf.gfile.GFile(annotations_file, 'r') as fid:
    output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
        tf_record_close_stack, output_path, num_shards)
    groundtruth_data = json.load(fid)
    images = groundtruth_data['images']
    category_index = label_map_util.create_category_index(
        groundtruth_data['categories'])

    annotations_index = {}
    if 'annotations' in groundtruth_data:
      tf.logging.info(
          'Found groundtruth annotations. Building annotations index.')
      for annotation in groundtruth_data['annotations']:
        image_id = annotation['image_id']
        if image_id not in annotations_index:
          annotations_index[image_id] = []
        annotations_index[image_id].append(annotation)
    missing_annotation_count = 0
    for image in images:
      image_id = image['id']
      if image_id not in annotations_index:
        missing_annotation_count += 1
        annotations_index[image_id] = []
    tf.logging.info('%d images are missing annotations.',
                    missing_annotation_count)

    total_num_annotations_skipped = 0
    for idx, image in enumerate(images):
      if idx % 100 == 0:
        tf.logging.info('On image %d of %d', idx, len(images))
      annotations_list = annotations_index[image['id']]
      _, tf_example, num_annotations_skipped = create_tf_example(
          image, annotations_list, image_dir, category_index, include_masks)
      total_num_annotations_skipped += num_annotations_skipped
      shard_idx = idx % num_shards
      output_tfrecords[shard_idx].write(tf_example.SerializeToString())
    tf.logging.info('Finished writing, skipped %d annotations.',
                    total_num_annotations_skipped)


def main(_):
  assert FLAGS.train_image_dir, '`train_image_dir` missing.'
  assert FLAGS.val_image_dir, '`val_image_dir` missing.'
  assert FLAGS.test_image_dir, '`test_image_dir` missing.'
  assert FLAGS.train_annotations_file, '`train_annotations_file` missing.'
  assert FLAGS.val_annotations_file, '`val_annotations_file` missing.'
  assert FLAGS.testdev_annotations_file, '`testdev_annotations_file` missing.'

  if not tf.gfile.IsDirectory(FLAGS.output_dir):
    tf.gfile.MakeDirs(FLAGS.output_dir)
  train_output_path = os.path.join(FLAGS.output_dir, 'coco_train.record')
  val_output_path = os.path.join(FLAGS.output_dir, 'coco_val.record')
  testdev_output_path = os.path.join(FLAGS.output_dir, 'coco_testdev.record')

  _create_tf_record_from_coco_annotations(
      FLAGS.train_annotations_file,
      FLAGS.train_image_dir,
      train_output_path,
      FLAGS.include_masks,
      num_shards=100)
  _create_tf_record_from_coco_annotations(
      FLAGS.val_annotations_file,
      FLAGS.val_image_dir,
      val_output_path,
      FLAGS.include_masks,
      num_shards=10)
  _create_tf_record_from_coco_annotations(
      FLAGS.testdev_annotations_file,
      FLAGS.test_image_dir,
      testdev_output_path,
      FLAGS.include_masks,
      num_shards=100)


if __name__ == '__main__':
  tf.app.run()



  • TF Record 생성확인 
  1. coco_train.record :Pipeline에서 설정
  2. coco_val.record  : Pipeline에서 설정
  3. coco_testdev.record  : 현재 사용하는 지 미확인 

root@4b038f3383f2:/workdir/models/research# ls /data/coco2017_tfrecords/
annotation                          coco_testdev.record-00035-of-00100  coco_testdev.record-00071-of-00100  coco_train.record-00007-of-00100  coco_train.record-00043-of-00100  coco_train.record-00079-of-00100
coco_testdev.record-00000-of-00100  coco_testdev.record-00036-of-00100  coco_testdev.record-00072-of-00100  coco_train.record-00008-of-00100  coco_train.record-00044-of-00100  coco_train.record-00080-of-00100
coco_testdev.record-00001-of-00100  coco_testdev.record-00037-of-00100  coco_testdev.record-00073-of-00100  coco_train.record-00009-of-00100  coco_train.record-00045-of-00100  coco_train.record-00081-of-00100
coco_testdev.record-00002-of-00100  coco_testdev.record-00038-of-00100  coco_testdev.record-00074-of-00100  coco_train.record-00010-of-00100  coco_train.record-00046-of-00100  coco_train.record-00082-of-00100
coco_testdev.record-00003-of-00100  coco_testdev.record-00039-of-00100  coco_testdev.record-00075-of-00100  coco_train.record-00011-of-00100  coco_train.record-00047-of-00100  coco_train.record-00083-of-00100
coco_testdev.record-00004-of-00100  coco_testdev.record-00040-of-00100  coco_testdev.record-00076-of-00100  coco_train.record-00012-of-00100  coco_train.record-00048-of-00100  coco_train.record-00084-of-00100
coco_testdev.record-00005-of-00100  coco_testdev.record-00041-of-00100  coco_testdev.record-00077-of-00100  coco_train.record-00013-of-00100  coco_train.record-00049-of-00100  coco_train.record-00085-of-00100
coco_testdev.record-00006-of-00100  coco_testdev.record-00042-of-00100  coco_testdev.record-00078-of-00100  coco_train.record-00014-of-00100  coco_train.record-00050-of-00100  coco_train.record-00086-of-00100
coco_testdev.record-00007-of-00100  coco_testdev.record-00043-of-00100  coco_testdev.record-00079-of-00100  coco_train.record-00015-of-00100  coco_train.record-00051-of-00100  coco_train.record-00087-of-00100
coco_testdev.record-00008-of-00100  coco_testdev.record-00044-of-00100  coco_testdev.record-00080-of-00100  coco_train.record-00016-of-00100  coco_train.record-00052-of-00100  coco_train.record-00088-of-00100
coco_testdev.record-00009-of-00100  coco_testdev.record-00045-of-00100  coco_testdev.record-00081-of-00100  coco_train.record-00017-of-00100  coco_train.record-00053-of-00100  coco_train.record-00089-of-00100
coco_testdev.record-00010-of-00100  coco_testdev.record-00046-of-00100  coco_testdev.record-00082-of-00100  coco_train.record-00018-of-00100  coco_train.record-00054-of-00100  coco_train.record-00090-of-00100
coco_testdev.record-00011-of-00100  coco_testdev.record-00047-of-00100  coco_testdev.record-00083-of-00100  coco_train.record-00019-of-00100  coco_train.record-00055-of-00100  coco_train.record-00091-of-00100
coco_testdev.record-00012-of-00100  coco_testdev.record-00048-of-00100  coco_testdev.record-00084-of-00100  coco_train.record-00020-of-00100  coco_train.record-00056-of-00100  coco_train.record-00092-of-00100
coco_testdev.record-00013-of-00100  coco_testdev.record-00049-of-00100  coco_testdev.record-00085-of-00100  coco_train.record-00021-of-00100  coco_train.record-00057-of-00100  coco_train.record-00093-of-00100
coco_testdev.record-00014-of-00100  coco_testdev.record-00050-of-00100  coco_testdev.record-00086-of-00100  coco_train.record-00022-of-00100  coco_train.record-00058-of-00100  coco_train.record-00094-of-00100
coco_testdev.record-00015-of-00100  coco_testdev.record-00051-of-00100  coco_testdev.record-00087-of-00100  coco_train.record-00023-of-00100  coco_train.record-00059-of-00100  coco_train.record-00095-of-00100
coco_testdev.record-00016-of-00100  coco_testdev.record-00052-of-00100  coco_testdev.record-00088-of-00100  coco_train.record-00024-of-00100  coco_train.record-00060-of-00100  coco_train.record-00096-of-00100
coco_testdev.record-00017-of-00100  coco_testdev.record-00053-of-00100  coco_testdev.record-00089-of-00100  coco_train.record-00025-of-00100  coco_train.record-00061-of-00100  coco_train.record-00097-of-00100
coco_testdev.record-00018-of-00100  coco_testdev.record-00054-of-00100  coco_testdev.record-00090-of-00100  coco_train.record-00026-of-00100  coco_train.record-00062-of-00100  coco_train.record-00098-of-00100
coco_testdev.record-00019-of-00100  coco_testdev.record-00055-of-00100  coco_testdev.record-00091-of-00100  coco_train.record-00027-of-00100  coco_train.record-00063-of-00100  coco_train.record-00099-of-00100
coco_testdev.record-00020-of-00100  coco_testdev.record-00056-of-00100  coco_testdev.record-00092-of-00100  coco_train.record-00028-of-00100  coco_train.record-00064-of-00100  coco_val.record-00000-of-00010
coco_testdev.record-00021-of-00100  coco_testdev.record-00057-of-00100  coco_testdev.record-00093-of-00100  coco_train.record-00029-of-00100  coco_train.record-00065-of-00100  coco_val.record-00001-of-00010
coco_testdev.record-00022-of-00100  coco_testdev.record-00058-of-00100  coco_testdev.record-00094-of-00100  coco_train.record-00030-of-00100  coco_train.record-00066-of-00100  coco_val.record-00002-of-00010
coco_testdev.record-00023-of-00100  coco_testdev.record-00059-of-00100  coco_testdev.record-00095-of-00100  coco_train.record-00031-of-00100  coco_train.record-00067-of-00100  coco_val.record-00003-of-00010
coco_testdev.record-00024-of-00100  coco_testdev.record-00060-of-00100  coco_testdev.record-00096-of-00100  coco_train.record-00032-of-00100  coco_train.record-00068-of-00100  coco_val.record-00004-of-00010
coco_testdev.record-00025-of-00100  coco_testdev.record-00061-of-00100  coco_testdev.record-00097-of-00100  coco_train.record-00033-of-00100  coco_train.record-00069-of-00100  coco_val.record-00005-of-00010
coco_testdev.record-00026-of-00100  coco_testdev.record-00062-of-00100  coco_testdev.record-00098-of-00100  coco_train.record-00034-of-00100  coco_train.record-00070-of-00100  coco_val.record-00006-of-00010
coco_testdev.record-00027-of-00100  coco_testdev.record-00063-of-00100  coco_testdev.record-00099-of-00100  coco_train.record-00035-of-00100  coco_train.record-00071-of-00100  coco_val.record-00007-of-00010
coco_testdev.record-00028-of-00100  coco_testdev.record-00064-of-00100  coco_train.record-00000-of-00100    coco_train.record-00036-of-00100  coco_train.record-00072-of-00100  coco_val.record-00008-of-00010
coco_testdev.record-00029-of-00100  coco_testdev.record-00065-of-00100  coco_train.record-00001-of-00100    coco_train.record-00037-of-00100  coco_train.record-00073-of-00100  coco_val.record-00009-of-00010
coco_testdev.record-00030-of-00100  coco_testdev.record-00066-of-00100  coco_train.record-00002-of-00100    coco_train.record-00038-of-00100  coco_train.record-00074-of-00100  raw-data
coco_testdev.record-00031-of-00100  coco_testdev.record-00067-of-00100  coco_train.record-00003-of-00100    coco_train.record-00039-of-00100  coco_train.record-00075-of-00100
coco_testdev.record-00032-of-00100  coco_testdev.record-00068-of-00100  coco_train.record-00004-of-00100    coco_train.record-00040-of-00100  coco_train.record-00076-of-00100
coco_testdev.record-00033-of-00100  coco_testdev.record-00069-of-00100  coco_train.record-00005-of-00100    coco_train.record-00041-of-00100  coco_train.record-00077-of-00100
coco_testdev.record-00034-of-00100  coco_testdev.record-00070-of-00100  coco_train.record-00006-of-00100    coco_train.record-00042-of-00100  coco_train.record-00078-of-00100



TFRecord 와 TF Example
  https://www.tensorflow.org/tutorials/load_data/tfrecord

  • 상위에서 사용되어지는 COCO DATASET 2017
  http://images.cocodataset.org/zips/train2017.zip
  http://images.cocodataset.org/annotations/annotations_trainval2017.zip
  http://images.cocodataset.org/zips/val2017.zip
  http://images.cocodataset.org/zips/test2017.zip

  • Cocodata Set 관련내용 재확인
  https://ahyuo79.blogspot.com/2019/10/cocodata-set.html


2.3 Quick Guide 4~5 실행 및 분석 

우선 Docker의 Conatiner를 아래와 같이 실행한 후 Tensorflow Training Shell Script으로 Training을 진행을 한다.
나의 경우는 GPU가 하나이므로 Training부분이 아주 느리다


NVIDIA Docker는 추후 사라질게 될것 같으며, 아래와 같이 Docker로도 실행이 가능하지만, nvidia-toolkit을 반드시 설치해야한다.
관련부분은 이전부분 확인
  https://ahyuo79.blogspot.com/2019/10/nvidia-docker.html

  • 설치된 NGC Version2로 실행 

$ nvidia-docker run --rm -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8888:8888 -p 6006:6006  \
-v /home/jhlee/works/ssd/data:/data/coco2017_tfrecords \
-v /home/jhlee/works/ssd/check:/checkpoints \
--ipc=host \
--name nvidia_ssd \
nvidia_ssd 


  • docker로 변경하여 실행 (nvidia-docker2 미사용할 경우)
  1. Tensorboard/Jupyter Port mapping 추가
  2. name 설정하여 쉽게 찾기

$ docker run --gpus all --rm -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8888:8888 -p 6006:6006  \
-v /home/jhlee/works/ssd/data:/data/coco2017_tfrecords \
-v /home/jhlee/works/ssd/check:/checkpoints \
--ipc=host \
--name nvidia_ssd \
nvidia_ssd




  • Training 부분 실행 및 분석 
Training의 Shell Script 분석을 해보면, 내부적으로 사용되는 Config 파일도 존재하며 이 부분을 알아두자.

root@c7550d6b2c59:/workdir/models/research# bash ./examples/SSD320_FP16_1GPU.sh /checkpoints 


root@c7550d6b2c59:/workdir/models/research# cat examples/SSD320_FP16_1GPU.sh 

CKPT_DIR=${1:-"/results/SSD320_FP16_1GPU"}
PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd320_full_1gpus.config"

export TF_ENABLE_AUTO_MIXED_PRECISION=1

TENSOR_OPS=0
export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_RNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}

time python -u ./object_detection/model_main.py \
       --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
       --model_dir=${CKPT_DIR} \
       --alsologtostder \
       "${@:3}"



  • Configuring the Object Detection Training Pipeline
Pipeline 설정은 Training하기 위해서 필요한 설정이며, model에 따라 각각의 설정이 조금씩 다른 것 같다.
다양한 config 파일을 확인하고 싶다면, object_detection/samples/configs 에서 확인을 하자
pre-trained 모델은 resnet_v150 기준으로 동작을 하므로 관련된 기능을 알아두자

root@a79a83fc99f6:/workdir/models/research# cat configs/ssd320_full_1gpus.config 
# SSD with Resnet 50 v1 FPN feature extractor, shared box predictor and focal
# loss (a.k.a Retinanet).
# See Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from Imagenet classification checkpoint

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: true
    num_classes: 90         ## Label 의 갯수 object_detection/data/mscoco_label_map.pbtxt 의 label의 class 수와 동일
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5      ## 50% 넘은 것만을 화면에 표시 , 추후 object_detection/object_detection_tutorial.ipynb 사용시 파악가능 
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: [1.0, 2.0, 0.5]
        scales_per_octave: 2
      }
    }
    image_resizer {                ## 이 부분은 network의 input shape 인 것 같음 (kernel)
      fixed_shape_resizer {
        height: 320
        width: 320
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        depth: 256
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.0004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            scale: true,
            decay: 0.997,
            epsilon: 0.001,
          }
        }
        num_layers_before_predictor: 4
        kernel_size: 3
      }
    }
    feature_extractor {
      type: 'ssd_resnet50_v1_fpn'       # feature_extractor용으로 resnet 50을 사용하며 이 부분은 변경가능  
      fpn {
        min_level: 3
        max_level: 7
      }
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.0004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          scale: true,
          decay: 0.997,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.25
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6                 
        max_detections_per_class: 100     ### class마다 찾을 수 있는 MAX
        max_total_detections: 100         ### 전체 찾을 수 있는 MAX object_detection/object_detection_tutorial.ipynb 의 output_dict['num_detections'] 과 동일 
      }
      score_converter: SIGMOID
    }
  }
}

# 
# model은 Google의 pre-train 된 모델 
# SSD에서 내부 feature extractor를 pre-trained model사용하면 fine_tune_checkpoint_type  "classification" 
# faster R-CNN, fine_tune_checkpoint_type  "detection"  
#

train_config: {
  fine_tune_checkpoint: "/checkpoints/resnet_v1_50/model.ckpt"        ## 상위 pre-trained model download 위치 
  fine_tune_checkpoint_type: "classification"                         # 모델에 따라 설정이 다르다고 함 
  batch_size: 32    ## GPU의 Memory가 Out of Memory가 발생할수 있으므로, 본인의 GPU Memory 맞게 설정 or CPU모드로 변경 
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  num_steps: 100000    ## steps 100,000 정함 , ( object_detection/model_main.py --num_train_steps 으로 설정가능) 
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_crop_image {
      min_object_covered: 0.0
      min_aspect_ratio: 0.75
      max_aspect_ratio: 3.0
      min_area: 0.75
      max_area: 1.0
      overlap_thresh: 0.0
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: .02000000000000000000
          total_steps: 100000
          warmup_learning_rate: .00866640000000000000
          warmup_steps: 8000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100         ## max_total_detections와 동일하게 해야 할 것 같음  object_detection_tutorial.ipynb의 output_dict['detection_boxes']
  unpad_groundtruth_tensors: false
}

#
# Training Setting 
# 
# input_path:  //TF Record 
#      coco_train.record-00000-of-00100 
#      coco_train.record-00001-of-00100 
#      .....
# label_map_path:
#      mscoco_label_map.pbtxt
#
train_input_reader: {
  tf_record_input_reader {
    input_path: "/data/coco2017_tfrecords/*train*"    ## tf_record format 위치 
  }
  label_map_path: "object_detection/data/mscoco_label_map.pbtxt"  ## Label 위치 
}


# 
# Eval Setting 
# 
#
#

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  num_examples: 8000   ##  eval 시 examples의 갯수 
  ##max_evals: 10               ##  eval 할 수 있는 max 값 지정 (object_detection/model_main.py --eval_count 설정가능) , 원래 config 미존재 
}

# 
# Eval Setting 
# 
# input_path:  //TF Record 
#      coco_val.record-00000-of-00010
#      coco_val.record-00001-of-00010 
#  
# label_map_path:
#      mscoco_label_map.pbtxt
#

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/data/coco2017_tfrecords/*val*"  ## tf_record format 위치 
  }
  label_map_path: "object_detection/data/mscoco_label_map.pbtxt" ## Label 정보확인가능 
  shuffle: false
  num_readers: 1
}

  • Pre-trained Models
  https://  github.com/tensorflow/models/tree/master/research/slim

  • Configuring the Object Detection Training Pipeline
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/configuring_jobs.md
  https://medium.com/coinmonks/modelling-transfer-learning-using-tensorflows-object-detection-model-on-mac-692c8609be40
  https://devtalk.nvidia.com/default/topic/1049371/tensorrt/how-to-visualize-tf-trt-graphs-with-tensorboard-/


2.4 Quick Guide 6  실행 및 분석 

Training을 한 후 Validation 하는 부분으로 보정의 역할을 하는 것 같은데, 정확한 역할은 Tensorflow와  DataSet의 기본구조를 알아야 할 것 같다.


root@c7550d6b2c59:/workdir/models/research#  bash examples/SSD320_evaluate.sh /checkpoints 


root@c7550d6b2c59:/workdir/models/research#  cat examples/SSD320_evaluate.sh
CHECKPINT_DIR=$1

TENSOR_OPS=0
export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_RNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}

python object_detection/model_main.py --checkpoint_dir $CHECKPINT_DIR --model_dir /results --run_once --pipeline_config_path configs/ssd320_full_1gpus.config


  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md

2.5 생성된 최종 Check Point 파일 확인

NVIDIA Docker를 이용하여 최종적으로 만들어지는 파일은 Checkpoint File이며, PB파일은 본인이 직접 만들어야하고, Inference도 역시 이 기반으로 해봐야 할 것 같다.
현재 chekpoint가 아래와 같이 model.ckpt-0 와 model.ckpt-100000로 구성됨

root@c7550d6b2c59:/workdir/models/research# ls  /checkpoints/
checkpoint                                   model.ckpt-0.data-00000-of-00002  model.ckpt-100000.data-00000-of-00002  resnet_v1_50
eval                                         model.ckpt-0.data-00001-of-00002  model.ckpt-100000.data-00001-of-00002  resnet_v1_50_2016_08_28.tar.gz
events.out.tfevents.1572262719.c7550d6b2c59  model.ckpt-0.index                model.ckpt-100000.index                
graph.pbtxt                                  model.ckpt-0.meta                 model.ckpt-100000.meta                 



  • 기본구성 
  1. model.ckpt-0: Training  시작과 동시에 생성 (STEP0)
  2. model.ckpt-100000 : Pipeline STEP 수와 동일하며, 여기서 재학습도 할 경우 계속 증가  
  3. graph.pb.txt: Network 구조 파악 

Checkpoint 이해필요
  https://eehoeskrap.tistory.com/343
  https://eehoeskrap.tistory.com/370
  https://eehoeskrap.tistory.com/344
  https://gusrb.tistory.com/21
  http://jaynewho.com/post/8


2.6  CheckPoint 를 PB Format으로 변환 

Inference를 위해서 아래와 같이 PB파일로 변환

  • export_inference_graph.py 사용법 
  https://github.com/tensorflow/models/blob/master/research/object_detection/export_inference_graph.py


  • input_type
  1. image_tensor
  2. encoded_image_string_tensor
  3. tf_example

TFRecord 와 TF Example
  https://www.tensorflow.org/tutorials/load_data/tfrecord


root@c7550d6b2c59:/workdir/models/research# python object_detection/export_inference_graph.py \
    --input_type image_tensor \
    --pipeline_config_path configs/ssd320_full_1gpus.config \
    --trained_checkpoint_prefix  /checkpoints/model.ckpt-100000 \
    --output_directory /checkpoints/inference_graph_100000

root@c7550d6b2c59:/workdir/models/research# ls /checkpoints/inference_graph_100000/
checkpoint  frozen_inference_graph.pb  model.ckpt.data-00000-of-00001  model.ckpt.index  model.ckpt.meta  pipeline.config  saved_model

//새로 생성된 PB파일 확인 
root@c7550d6b2c59:/workdir/models/research# ls /checkpoints/inference_graph_100000/saved_model/
saved_model.pb  variables




Exporting a trained model for inference
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md


2.7  Tensorboard 로 테스트 진행 


  • Host에서 CheckPoint 구조확인 

$ cd ~/works/ssd/check  //host에서 checkpoint 구조 파악 
$ tree
.
├── checkpoint  //model.ckpt-100000 과 각 path 
├── eval        //training시 생성된 tensorboard log  validation/evaluation 시 생성된 부분은 /result/eval에 존재
│   └── events.out.tfevents.1572359812.c7550d6b2c59   // Tensorboard 용 Log file 
├── events.out.tfevents.1572262719.c7550d6b2c59       // Tensorboard 용 Log file 
├── graph.pbtxt                      // Network 관련정보 확인 
├── inference_graph_100000           // model.ckpt-100000x 기반의 pb 파일 생성 
│   ├── checkpoint
│   ├── frozen_inference_graph.pb
│   ├── model.ckpt.data-00000-of-00001
│   ├── model.ckpt.index
│   ├── model.ckpt.meta
│   ├── pipeline.config
│   └── saved_model
│       ├── saved_model.pb
│       └── variables
├── model.ckpt-0.data-00000-of-00002        //checkpoint-0
├── model.ckpt-0.data-00001-of-00002
├── model.ckpt-0.index
├── model.ckpt-0.meta
├── model.ckpt-100000.data-00000-of-00002   //checkpoint-100000
├── model.ckpt-100000.data-00001-of-00002
├── model.ckpt-100000.index
├── model.ckpt-100000.meta
├── resnet_v1_50                        //Pre-trained Model (SSD의 특징추출용으로 사용)
│   └── model.ckpt
├── resnet_v1_50_2016_08_28.tar.gz    //Pre-trained Model
├── resnet_v1_50_2016_08_28.tar.gz.1
└── resnet_v1_50_2016_08_28.tar.gz.2



  • Pre-trained Models
SSD 에서 feature_extractor으로 resetnet model 사용 (fine tuning / transfer learning)
  resnet_v1_50_2016_08_28.tar.gz
  https://github.com/tensorflow/models/tree/master/research/slim

  • Docker Container에서 Tensorboard 실행
상위 checkpoint에서 생성된 Tensorboard의 log가 존재하므로 분석이 가능

root@c7550d6b2c59:/workdir/models/research# tensorboard --logdir=/checkpoints 


  • Tensorboard Browser 연결
  http://localhost:6006/


  • Tensorboard -> Scalars


Images 관련 부분 소스 

root@c7550d6b2c59:/workdir/models/research# vi ./object_detection/utils/visualization_utils.py 
........

def draw_side_by_side_evaluation_image(eval_dict,
                                       category_index,
                                       max_boxes_to_draw=20,
                                       min_score_thresh=0.2,
                                       use_normalized_coordinates=True):
  """Creates a side-by-side image with detections and groundtruth.

  Bounding boxes (and instance masks, if available) are visualized on both
  subimages.

  Args:
    eval_dict: The evaluation dictionary returned by
      eval_util.result_dict_for_batched_example() or
      eval_util.result_dict_for_single_example().
    category_index: A category index (dictionary) produced from a labelmap.
    max_boxes_to_draw: The maximum number of boxes to draw for detections.
    min_score_thresh: The minimum score threshold for showing detections.
    use_normalized_coordinates: Whether to assume boxes and kepoints are in
      normalized coordinates (as opposed to absolute coordiantes).
      Default is True.

  Returns:
    A list of [1, H, 2 * W, C] uint8 tensor. The subimage on the left
      corresponds to detections, while the subimage on the right corresponds to
      groundtruth.
  """
........


class EvalMetricOpsVisualization(object):
....
  def get_estimator_eval_metric_ops(self, eval_dict):  ## 아래의에서 호출됨 

    if self._max_examples_to_draw == 0:
      return {}
    images = self.images_from_evaluation_dict(eval_dict)

    def get_images():
      """Returns a list of images, padded to self._max_images_to_draw."""
      images = self._images
      while len(images) < self._max_examples_to_draw:
        images.append(np.array(0, dtype=np.uint8))
      self.clear()
      return images

    def image_summary_or_default_string(summary_name, image): ## 이곳에서 Image 생성
      """Returns image summaries for non-padded elements."""
      return tf.cond(
          tf.equal(tf.size(tf.shape(image)), 4),
          lambda: tf.summary.image(summary_name, image),    ## Tensorboard Image 
          lambda: tf.constant(''))

    update_op = tf.py_func(self.add_images, [[images[0]]], [])
    image_tensors = tf.py_func(
        get_images, [], [tf.uint8] * self._max_examples_to_draw)
    eval_metric_ops = {}
    for i, image in enumerate(image_tensors):
      summary_name = self._summary_name_prefix + '/' + str(i)
      value_op = image_summary_or_default_string(summary_name, image)   ## Tensorboard Image 생성 
      eval_metric_ops[summary_name] = (value_op, update_op)
    return eval_metric_ops

.....

class VisualizeSingleFrameDetections(EvalMetricOpsVisualization): ## VisualizeSingleFrameDetections는 EvalMetricOpsVisualization
  """Class responsible for single-frame object detection visualizations."""

  def __init__(self,
               category_index,
               max_examples_to_draw=5,
               max_boxes_to_draw=20,
               min_score_thresh=0.2,
               use_normalized_coordinates=True,
               summary_name_prefix='Detections_Left_Groundtruth_Right'):
    super(VisualizeSingleFrameDetections, self).__init__(
        category_index=category_index,
        max_examples_to_draw=max_examples_to_draw,
        max_boxes_to_draw=max_boxes_to_draw,
        min_score_thresh=min_score_thresh,
        use_normalized_coordinates=use_normalized_coordinates,
        summary_name_prefix=summary_name_prefix)

  def images_from_evaluation_dict(self, eval_dict):
    return draw_side_by_side_evaluation_image(
        eval_dict, self._category_index, self._max_boxes_to_draw,
        self._min_score_thresh, self._use_normalized_coordinates)

...........

root@c7550d6b2c59:/workdir/models/research# vi ./object_detection/model_lib.py
....
    if mode == tf.estimator.ModeKeys.EVAL:  ## EVAL Mode 
.........
      eval_dict = eval_util.result_dict_for_batched_example(          ## Image 정보 
          eval_images,
          features[inputs.HASH_KEY],
          detections,
          groundtruth,
          class_agnostic=class_agnostic,
          scale_to_absolute=True,
          original_image_spatial_shapes=original_image_spatial_shapes,
          true_image_shapes=true_image_shapes)

      if class_agnostic:
        category_index = label_map_util.create_class_agnostic_category_index()
      else:
        category_index = label_map_util.create_category_index_from_labelmap(
            eval_input_config.label_map_path)
      vis_metric_ops = None
      if not use_tpu and use_original_images:
        eval_metric_op_vis = vis_utils.VisualizeSingleFrameDetections(   
            category_index,
            max_examples_to_draw=eval_config.num_visualizations,
            max_boxes_to_draw=eval_config.max_num_boxes_to_visualize,
            min_score_thresh=eval_config.min_score_threshold,
            use_normalized_coordinates=False)
        vis_metric_ops = eval_metric_op_vis.get_estimator_eval_metric_ops(     ## 이곳에서 Image 저장 , 상위참조 
            eval_dict)
....



관련내용정리 
tf.estimator.ModeKeys.TRAIN
tf.estimator.ModeKeys.EVAL
tf.estimator.ModeKeys.PREDICT

아래 사이트에서 설명이 너무 잘되어있음 
  https://bcho.tistory.com/1196

  • Tensorboard -> Images


  https://www.tensorflow.org/tensorboard/image_summaries

  • Tensorboard -> Graphs
보고 쉽게 이해가도록 했으며, 시각화가 너무 잘되어있어 좋다. 
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md


2.8  Object Detection 준비  

상위 Docker로 실행한 Terminal 에서  Jupyter 를 실행하여 Jupyter  TEST 진행

  • TEST Image 준비 

root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/test2017/000000000001.jpg object_detection/test_images/image1.jpg
root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/test2017/000000517810.jpg object_detection/test_images/image2.jpg

or

root@5208474af96a:/workdir/models/research# cat object_detection/test_images/image_info.txt  //아래의 사이트에서 image1.jpg image2.jpg download 후 복사  

Image provenance:
image1.jpg: https://commons.wikimedia.org/wiki/File:Baegle_dwa.jpg
image2.jpg: Michael Miley,
  https://www.flickr.com/photos/mike_miley/4678754542/in/photolist-88rQHL-88oBVp-88oC2B-88rS6J-88rSqm-88oBLv-88oBC4

root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/image1.jpg object_detection/test_images/image1.jpg   //아래 사이트에서 download함 
root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/image2.jpg object_detection/test_images/image2.jpg


  • Jupyter를 이용하여  object_detection/object_detection_tutorial.ipynb  실행 

root@c7550d6b2c59:/workdir/models/research# jupyter notebook   // error 발생 
root@c7550d6b2c59:/workdir/models/research# jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root

Tensorflow Jupiter Notebook
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_notebook.md

  • Jupyter notebook 실행 후 브라우저확인 
  http://localhost:8888/

  • Jupyter notebook 실행시 발생하는 에러
상위에서 옵션을 정의하여 해결함
  https://github.com/kaczmarj/neurodocker/issues/82
  http://melonicedlatte.com/web/2018/05/22/134429.html


2.9  기본 Object Detection 확인 


  • 별도의 Docker Terminal을 실행
각 파일의 위치파악하고 필요한 파일들을 각각 파악

$ docker exec -it nvidia_ssd /bin/bash  // 상위 docker 이미 Jupyter가 돌아가는 중이므로 별도의 Terminal 사용 

root@5208474af96a:/workdir/models/research# python object_detection/model_main.py --help

root@5208474af96a:/workdir/models/research# ls object_detection/object_detection_tutorial.ipynb  // Jupyter로 테스트 진행 
object_detection/object_detection_tutorial.ipynb

root@5208474af96a:/workdir/models/research# ls object_detection/ssd_mobilenet_v1_coco_2017_11_17  // 상위 Jupyter에서 사용하는 Model
frozen_inference_graph.pb

root@5208474af96a:/workdir/models/research# ls object_detection/data                              // 상위 Jupyter에서 사용하는 pbtxt
ava_label_map_v2.1.pbtxt           mscoco_complete_label_map.pbtxt     oid_object_detection_challenge_500_label_map.pbtxt
face_label_map.pbtxt               mscoco_label_map.pbtxt              pascal_label_map.pbtxt
fgvc_2854_classes_label_map.pbtxt  mscoco_minival_ids.txt              pet_label_map.pbtxt
kitti_label_map.pbtxt              oid_bbox_trainable_label_map.pbtxt



  • object_detection_tutorial.ipynb
별도의 Training을 안해도 소스를 보면 Model를 download하여 test directory만 설정해주면 된다.
사용모델: ssd_mobilenet_v1_coco_2017_11_17.tar.gz
  https://medium.com/@yuu.ishikawa/how-to-show-signatures-of-tensorflow-saved-model-5ac56cf1960f
  http://solarisailab.com/archives/2387

간단히 분석하며  상위 Model(pb파일)과 pbtxt를 이용하여 test_images 내의 image들을 테스트 진행

  • object_detection_tutorial.ipynb 문제사항    
테스트를 진행하면 마지막에 cuDNN 에러가 발생하며, 원인은 GPU Memory 이므로, 아래의 소스를 추가하자  (Docker 의 Tensorflow 1.14.0)

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

Jupyter Consol 에러사항

E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 

GPU Memory 문제사항
  https://github.com/tensorflow/tensorflow/issues/24828
  https://lsjsj92.tistory.com/363
  https://devtalk.nvidia.com/default/topic/1051380/cudnn/could-not-create-cudnn-handle-cudnn_status_internal_error/

Tensorflow 2.0 GPU Memory 부족현상
  https://inpages.tistory.com/155

failed to allocate 2.62G (2811428864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
  https://stackoverflow.com/questions/39465503/cuda-error-out-of-memory-in-tensorflow


  • NVIDIA GPU Memory 사용량 확인
$ watch -n 0.1 nvidia-smi 


3. 현재 상황  

나의 랩탑에서는 상위 소스를 추가를 하면 예제를 Inference한 부분을 볼수 없지만, 다른 성능 좋은 Server에서는 잘 동작한다.
참 안타까운 일이며, 나의 랩탑(Laptop)의 한계를 많이 느낀다. (특히 GPU RAM)


관련부분 참조사이트 들이며, 너무 많이 참조하여 각 링크만 나열 


Object Detection Install 및 TEST
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md


Training 자료수집 
  https://www.slideshare.net/fermat39/polyp-detection-withtensorflowobjectdetectionapi
  https://www.kdnuggets.com/2019/03/object-detection-luminoth.html


Tensorflow Training 및 사용법  
  https://yongyong-e.tistory.com/24
  http://solarisailab.com/archives/2422
  https://hwauni.tistory.com/entry/API-Object-Detection-API%EB%A5%BC-%EC%9D%B4%EC%9A%A9%ED%95%9C-%EC%98%A4%EB%B8%8C%EC%A0%9D%ED%8A%B8-%EC%9D%B8%EC%8B%9D%ED%95%98%EA%B8%B0-Part-1-%EC%84%A4%EC%A0%95-%ED%8E%8C
  https://cloud.google.com/solutions/creating-object-detection-application-tensorflow?hl=ko

Tensorflow Object Detection 부분 추후 분리
  https://you359.github.io/tensorflow%20models/Tensorflow-Object-Detection-API/
  https://you359.github.io/tensorflow%20models/Tensorflow-Object-Detection-API-Installation/
  https://you359.github.io/tensorflow%20models/Tensorflow-Object-Detection-API-Training/

Tensorflow Object Detection 관련사항
  https://yongyong-e.tistory.com/31?category=836820
  https://yongyong-e.tistory.com/32?category=836820
  https://yongyong-e.tistory.com/35?category=836820    **
  https://towardsdatascience.com/creating-your-own-object-detector-ad69dda69c85  **
  https://gilberttanner.com/blog/live-object-detection

Tensorflow Object Detection API Training
  https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html
  https://towardsdatascience.com/custom-object-detection-using-tensorflow-from-scratch-e61da2e10087
  https://becominghuman.ai/tensorflow-object-detection-api-tutorial-training-and-evaluating-custom-object-detector-ed2594afcf73
  https://medium.com/pylessons/tensorflow-step-by-step-custom-object-detection-tutorial-d7ae840a74e2


Tensorflow Object Detection API
  https://github.com/tensorflow/models/tree/master/research/object_detection
  https://github.com/tensorflow/models/tree/master/research/object_detection/g3doc
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md


  • Shellscript 분석 중 혼동부분 정리 
항상 느끼지만, 매번 Opensource 의 Shell Script 잘 만들어지고, 자주 변경되어 많이 혼동됨 

${1:-none}
  https://stackoverflow.com/questions/38260927/what-does-this-line-build-target-1-none-means-in-shell-scripting

${@:2}
  https://unix.stackexchange.com/questions/92978/what-does-this-2-mean-in-shell-scripting