Jeonghun (James) Lee: AI-Transfer Learning

레이블이 AI-Transfer Learning인 게시물을 표시합니다. 모든 게시물 표시

11/14/2019

Custom Object Detection SSD / Faster RCNN 실행 및 분석 (3차분석)

1. Tensorflow 및 Custom Object Detection 위한 준비

Object Detection을 위한 준비를 위해서 아래와 같이 설치를 진행한다.

Tensorflow 설치를 진행
필요 Python Package / 필요 Package 설치진행
Model을 Download하여 진행

NVIDIA Docker 및 SSD Traning 2차분석
https://ahyuo79.blogspot.com/2019/10/docker-tensorflow.html

NVIDIA Docker 및 Tensorflow 기본 사용법
https://ahyuo79.blogspot.com/2019/10/nvidia-docker.html

IOU 기능

https://ahyuo79.blogspot.com/2019/09/iou-intersection-over-union.html

Tensorflow Model
https://github.com/tensorflow/models

Tensorflow Model Branch 확인
Tensorflow의 Version에 Model source의 branch 변경하여 download
https://github.com/tensorflow/models/branches

1.1 Tensorflow 직접설치 및 설정

Tensorflow Object Detection를 사용하기 위해서는 아래와 같이 먼저 Tensorflow를 설치하고, 이후에 Object Detection Model을 Download와 관련 Package 설치한다.

Custom Object Detection 설치가이드
  https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html

1.2 General Tensorflow Docker 이용

Tensorflow Docker 기반으로 아래의 Model version을 Download하여 하나의 Image로 생성후 이를 진행하자.
이때 주의해야한 것은 Tags의 정보와 Tensorflow의 Version 일 것 같다.

Docker의 Tag 의미
  https://www.tensorflow.org/install/docker?hl=ko

Tensorflow Docker
Tensorflow version 과 상위의 model version을 같이 맞추도록하자
  https://hub.docker.com/r/tensorflow/tensorflow
  https://hub.docker.com/r/tensorflow/tensorflow/tags

1.3 NVIDIA Tensorflow Docker 이용

기존에 NVIDIA Tensorflow Docker를 설치하였던 것으로 이용 Object Detection을 사용가능.

NVIDIA Tensorflow Docker
  https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow/tags

이전에 NVIDIA SSD Docker 관련분석 참조
  https://ahyuo79.blogspot.com/2019/10/docker-tensorflow.html

2. Custom Data SET 구성

우선 다들 개와 고양이 사진으로 기본적으로 Custom DATA SET를 만들어 테스트를 진행하기에 나도 역시 쉽게 할수 있는 방법으로 시작

개와 고양이 사진 구하기 (DATASET)

$ cd ~/works/custom
$ git clone https://github.com/hardikvasa/google-images-download.git
$ cd google-images-download
$ python google_images_download/google_images_download.py --keywords "dogs" --size medium --output_directory ~/works/custom/data/
$ python google_images_download/google_images_download.py --keywords "cats" --size medium --output_directory ~/works/custom/data/

google image download 구할 수 있는 이미지들은 현재 제한적이며, 최대 100개까지 download가 가능하다.
옵션에서 limit를 100이상을 늘려도 한번에 100개이상의 image를 구할 수 없다.

google_image_download
https://google-images-download.readthedocs.io/en/latest/installation.html

google_image_download argument
https://google-images-download.readthedocs.io/en/latest/arguments.html

Image 정리 및 구성

$ cd ~/works/custom/data/
$ mkdir images        // Image들을 한곳정리  
$ mkdir annotation    // LableImg의 XML 저장장소 
$ mv ./dogs/*.jpg images/
$ mv ./cats/*.jpg images/

Annotation (LabelImg 사용, PascalVOC저장 )

$ cd ~/works/custom/labelImg    // labelImg 이미 이전에 설치됨
$ cat data/predefined_classes.txt   // Default Class 확인(개,고양이 있음), 만약 이름이 없다면, 새로생성 
dog
person
cat
tv
car
meatballs
marinara sauce
tomato soup
chicken noodle soup
french onion soup
chicken breast
ribs
pulled pork
hamburger

$ python3 labelImg.py   ~/works/custom/data/images    // images 안에 같이 xml 저장

주의사항
lableImg 실행 후 XML저장위치를 반드시 Change Save Dir ~/works/custom/data/annotation 설정
상위 정의 된 class의 순서가 달라도 상관 없지만 상위 이름과 label_map.pbtxt의 이름만 동일하면 된다.

https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#annotating-images

Label_map 정의

$ cd ~/works/custom/data
$ vi label_map.pbtxt
item {
    id: 1
    name: 'cat'
}

item {
    id: 2
    name: 'dog'
}

https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#creating-label-map

1.1 TF Record File 생성

다른블로그 혹은 Tensorflow 예제 사이트를 보면 XML->CSV 후 변환 CSV->TFRecord 로 변환하도록 하는데,
다른 소스들을 간단히 분석해보면 TF Record 작업은 거의 비슷한데, 왜 두번을 해야하는지 이해를 못해 아래와 같이 직접 변경시도

TF Record 만드는 법

현재 이방식으로 진행을 하지 않음
https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#creating-tensorflow-records

TF_RECORD 생성

tf_record는 반드시 Tensorflow가 설치된 상태에서 실행가능

 root@3aac229c45c3:/workdir/models/research# pip install lxml
## 이전처럼 --data-dir path 주의 
root@3aac229c45c3:/workdir/models/research# python create_pascal_tf_record.py \
 --data_dir=/data \
 --annotations_dir=/data/annotation \
 --label_map_path=/data/label_map.pbtxt \
 --output_path=/data/pascal.record

Lablelimg TF Record 생성방법
https://ahyuo79.blogspot.com/2019/11/coco-set-annotation-tools.html

2. Custom Training/Evolution

Custom Model을 두개를 이용하여 테스트를 해보고 비교

2.1 Pre-trained Model Download

SSD (Single Shot MultiBox Detector)는 Feature extractor 용으로 별도의 Network를 구성해서 사용하고 있는데, 그 부분을 Download하여 기본구성을 갖춘다.

check 기본구성

$ cd ~/works/custom/check
$ mkdir -p models/configs
$ mkdir -p models/resnet_v1_50_2016_08_28
$ mkdir -p train_resnet                  //SSD-Resnet50   의  Checkpoint directory (Training 후 생성됨)
$ mkdir -p train_inception               //SSD-Inceptionv2 의  checkpoint directory (Training 후 생성됨)
$ mkdir -p fasterrcnn_train_resnet       //Faster RCNN-Resnet50 의 checkpoint directory (Training 후 생성됨)

Resnet 50 Download

$ cd ~/works/custom/check/models
$ wget http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz
$ tar -xzf resnet_v1_50_2016_08_28.tar.gz
$ mv resnet_v1_50.ckpt resnet_v1_50_2016_08_28/model.ckpt

InceptionV2 Download

$ cd ~/works/custom/check/models
$ wget http://download.tensorflow.org/models/object_detection/ssd_inception_v2_coco_11_06_2017.tar.gz
$ tar -xzf ssd_inception_v2_coco_11_06_2017.tar.gz

Pre-Trained model 정보
https://github.com/tensorflow/models/tree/master/research/slim

check 의 model 구성

$ cd ~/works/custom/check/models
$ tree 
.
├── configs                          // Pipeline Config 저장장소 (Resnet , Inceptionv2 ) 
├── resnet_v1_50_2016_08_28          // Resnet 50 (Pre-trained Model)
│   └── model.ckpt                   // checkpoint   
├── resnet_v1_50_2016_08_28.tar.gz
├── ssd_inception_v2_coco_11_06_2017    // Inception V2 (Pre-trained Model)
│   ├── frozen_inference_graph.pb          // Inception Pb file 
│   ├── graph.pbtxt                        // Inception Graph 구성 
│   ├── model.ckpt.data-00000-of-00001     // checkpoint
│   ├── model.ckpt.index
│   └── model.ckpt.meta
└── ssd_inception_v2_coco_11_06_2017.tar.gz

2.2 SSD / Faster RCNN Pipeline 설정

SSD의 경우 feature extractor로 Resnet 50 와 Inception V2 로 사용가능하며, 다른 Network로도 구성가능하다.
그리고, Pipleline의 Field들은 *.proto 에 선언이 되어있어야 동작이 가능한 것 같다.
나중에 시간이 된다면 면밀히 다시 봐야할 것 같다.

Docker Container 실행

$ docker run --gpus all --rm -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8888:8888 -p 6006:6006  \
-v /home/jhlee/works/custom/data:/data \
-v /home/jhlee/works/custom/check:/checkpoints \
--ipc=host \
--name nvidia_ssd \
nvidia_ssd

SSD-Resnet 50 Pipeline 설정변경

root@f46c490016e0:/workdir/models/research# cp configs/ssd320_full_1gpus.config  /checkpoints/models/configs
root@f46c490016e0:/workdir/models/research# vi /checkpoints/models/configs/ssd320_full_1gpus.config 

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: true
    num_classes: 2    # label 갯수 (Cat/Dog) 
    box_coder {
      faster_rcnn_box_coder {   
        y_scale: 10.0              
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5    ## 테스트시, output_dict['detection_scores']가 0.5 이상인것만 
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
...
    image_resizer {            # 
      fixed_shape_resizer {
        height: 320
        width: 320
      }
    }
....

    feature_extractor {
      type: 'ssd_resnet50_v1_fpn'  # SSD의 feature extractor를 resnet 50 사용 
      fpn {
        min_level: 3
        max_level: 7
      }
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.0004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          scale: true,
          decay: 0.997,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.25
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {                    ## post process 설정확인 
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100    ## Class당 100개설정         output_dict['detection_classes'] 
        max_total_detections: 100        ## Max detection 100개 설정  output_dict['num_detections']
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  fine_tune_checkpoint: "/checkpoints/models/resnet_v1_50_2016_08_28/model.ckpt"
  fine_tune_checkpoint_type: "classification"
  batch_size: 2            # OUT OF MEMORY 문제로 32->2 변경, GPU Memory가 많다면 그대로  
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  num_steps: 100         # steps 100000 -> 1000  (간단히 테스트용으로 변경, 실제 Training은 원래대로 )
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
....


train_input_reader: {
  tf_record_input_reader {
    input_path: "/data/pascal.record"  # train TF Record 
  } 
  label_map_path: "/data/label_map.pbtxt" # label_map.pbtxt
}

eval_config: {
  #metrics_set: "coco_detection_metrics"
  #use_moving_averages: false
  num_examples: 8000   # eval 하지 않을 것이므로, 그대로 유지 
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/data/pascal.record"  # 현재 eval을 위한 tfrecord가 별도로 없음(Training과 동일하게 설정) 
  }
  label_map_path: "/data/label_map.pbtxt" # 설정만 변경 추후 
  shuffle: false
  num_readers: 1
}

Faster RCNN-Resnet 50 Pipeline 설정변경

root@f46c490016e0:/workdir/models/research# cp ./object_detection/samples/configs/faster_rcnn_resnet50_coco.config  /checkpoints/models/configs
root@f46c490016e0:/workdir/models/research# vi /checkpoints/models/configs/faster_rcnn_resnet50_coco.config
model {
  faster_rcnn {
    num_classes: 2     # label 갯수 90->2 (Cat/Dog) 
    image_resizer {                 #  
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet50'     ## Resnet 50 사용확인 
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }

....
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6       ## IOU threhold 도 조절가능  
        max_detections_per_class: 100      ## 이전과 동일하게 Post Processing으로 Class당 Max 100개 
        max_total_detections: 300          ## 이전과 다르게 MAX 300 설정됨 
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}


train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "/checkpoints/models/resnet_v1_50_2016_08_28/model.ckpt"
  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 300          ### 전체 Step 수 200000->300 (임시테스트를 위해 변경)
  data_augmentation_options { 
    random_horizontal_flip {
    }
  }
}

....  
###  상위 SSD와 동일 

train_input_reader: {
  tf_record_input_reader {
    input_path: "/data/pascal.record"  # train TF Record 
  } 
  label_map_path: "/data/label_map.pbtxt" # label_map.pbtxt
}

eval_config: {
  num_examples: 8000                                  ## evalution 
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/data/pascal.record"  # 설정만 변경 추후 eval을 사용할 경우 다시 변경 
  }
  label_map_path: "/data/label_map.pbtxt" # 설정만 변경 추후 
  shuffle: false
  num_readers: 1
}

Faster RCNN Precision FP32로 변경해서 실행해야하며, 현재 optimaizer 부분이 문제가 있다.
일단 Training은 되지만 관련부분을 자세히 볼 필요가 있다.

SSD Inception v2 Pipeline 설정변경

root@f46c490016e0:/workdir/models/research# cp ./object_detection/samples/configs/ssd_inception_v2_coco.config  /checkpoints/models/configs 
root@f46c490016e0:/workdir/models/research# vi /checkpoints/models/configs/ssd_inception_v2_coco.config 

model {
  ssd {
    num_classes: 2   ## Lable Number , label_map.pbtxt 참조 
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5   ## 테스트시, output_dict['detection_scores']가 0.5 이상인것만 
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }

..........

    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }

..........
    feature_extractor {
      type: 'ssd_inception_v2'    # SSD의 feature_extractor를 Inception_v2로 사용 
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid {
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 0
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6                 ## IOU Threshold 
        max_detections_per_class: 100  ## Class당 100개설정         output_dict['detection_classes'] 
        max_total_detections: 100      ## Max detection 100개 설정  output_dict['num_detections']
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 6     ## 24 -> 6  나의 경우 GPU 성능문제로 변경 
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "/checkpoints/ssd_inception_v2_coco_11_06_2017/model.ckpt"
  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 1000      ## 20000 -> 1000   랩탑에서 조금만 테스트하기 위해 변경 
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/data/pascal.record"          ## Train Record
  }
  label_map_path: "/data/label_map.pbtxt"      ## Train Labelmap
}

eval_config: {
  num_examples: 8000                                  ## evalution 
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/data/pascal.record"                 ## evalution의 test record
  }
  label_map_path: "/data/label_map.pbtxt"             ## evalution의 label map 
  shuffle: false
  num_readers: 1
}

세부 분석은 이전의 SSD 분석참조

eval_config 의 num_example
https://github.com/tensorflow/models/issues/5059
https://stackoverflow.com/questions/47086630/what-does-num-examples-2000-mean-in-tensorflow-object-detection-config-file

2.3 SSD / Faster RCNN Training

NVIDIA에서는 쉽게 Training 할 수 있도록 Shell Script로 쉽게 설정하였다. SSD의 경우 Precision을 FP16으로 사용하고 있지만,
Faster RCNN은 FP16으로 하면 에러가 발생하므로 주의해야한다.
간단히 Shell Script 내부를 보면 ./object_detection/model_main.py를 이용하여 실행하므로 이것으로 직접 실행해도 무방하다

Training Shell Script 수정 및 기본분석

root@1bfb89078878:/workdir/models/research# vi ./examples/SSD320_FP16_1GPU.sh     //Pipeline Config 부분 확인 및 수정 
CKPT_DIR=${1:-"/results/SSD320_FP16_1GPU"}
### Pipeline 추가하고 Resnet50 or InceptionV2 중 선택사용 
## SSD-Resnet 50 Pipleline  (FP16지원, 기본설정 )
PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd320_full_1gpus.config"

## SSD-Inception v2 Pipeline  (FP16지원, 추가설정)
#PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd_inception_v2_coco.config"

## Fastter RCNN-Resnet 50 Pipleline (FP32로만 사용)
#PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/faster_rcnn_resnet50_coco.config"

#FP16 PRESCISON MODE로 설정 (FP32로 설정시 주석처리) 
export TF_ENABLE_AUTO_MIXED_PRECISION=1


TENSOR_OPS=0
export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_RNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}

time python -u ./object_detection/model_main.py \
       --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
       --model_dir=${CKPT_DIR} \
       --alsologtostder \
       "${@:3}"

상위에서 본인 이 사용하고 싶은 Pipeline 을 정하고 아래와 같이 실행

SSD-Resnet 50 Training

root@f46c490016e0:/workdir/models/research#  bash ./examples/SSD320_FP16_1GPU.sh /checkpoints/train_resnet /checkpoints/models/configs 
// 1st checkpoints path , output
// 2nd pipeline path
..........

Training 결과 인 checkpoint는 이곳에 저장: /checkpoints/train_resnet

Fast-RCNN-Resnet 50 Training

root@f46c490016e0:/workdir/models/research#  bash ./examples/SSD320_FP16_1GPU.sh /checkpoints/fasterrcnn_train_resnet /checkpoints/models/configs 
// 1st checkpoints path , output
// 2nd pipeline path
..........

Training 결과 인 checkpoint는 이곳에 저장: /checkpoints/fasterrcnn_train_resnet

SSD-Inception V2 Training

root@1bfb89078878:/workdir/models/research# bash ./examples/SSD320_FP16_1GPU.sh /checkpoints/train_inception /checkpoints/models/configs 
// 1st checkpoints path , output
// 2nd pipeline path

Training 결과 인 checkpoint는 이곳에 저장: /checkpoints/train_inception

Training 후 생성된 CheckPoint File 확인 (e.g SSD-Resnet50)

root@f46c490016e0:/workdir/models/research# ls /checkpoints/train_resnet/
checkpoint                                   graph.pbtxt                       model.ckpt-0.index                  model.ckpt-300.data-00001-of-00002
eval                                         model.ckpt-0.data-00000-of-00002  model.ckpt-0.meta                   model.ckpt-300.index
events.out.tfevents.1574317170.7bdf29dc41cb  model.ckpt-0.data-00001-of-00002  model.ckpt-300.data-00000-of-00002  model.ckpt-300.meta

TF_ENABLE_AUTO_MIXED_PRECISION 관련내용
https://medium.com/tensorflow/automatic-mixed-precision-in-tensorflow-for-faster-ai-training-on-nvidia-gpus-6033234b2540

2.4 SSD validation/evaluation

Training 중 일부를 사용한다고 하며, Training 중 검증을 하기 위해서 사용한다고 하는데, 정확한 설정과 관련부분을 이해 해야 할 것 같다.

Shell script 수정

root@f46c490016e0:/workdir/models/research# vi examples/SSD320_evaluate.sh  //아래와 같이 pipeline 설정 
CHECKPINT_DIR=$1

TENSOR_OPS=0
export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_RNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}

## Resnet or Inception 선택 
python object_detection/model_main.py --checkpoint_dir $CHECKPINT_DIR --model_dir /results --run_once --pipeline_config_path /checkpoints/models/configs/ssd320_full_1gpus.config

# python object_detection/model_main.py --checkpoint_dir $CHECKPINT_DIR --model_dir /results --run_once --pipeline_config_path /checkpoints/models/configs/ssd_inception_v2_coco.config

validation 실행

root@f46c490016e0:/workdir/models/research# bash examples/SSD320_evaluate.sh /checkpoints/train_resnet 
or 
root@f46c490016e0:/workdir/models/research# bash examples/SSD320_evaluate.sh /checkpoints/train_inception

상위 결과를 Tensorboard로 확인하고자 하면, 아래의 위치로 변경해서 확인

root@f46c490016e0:/workdir/models/research# ls /results/eval/    // /result/eval Tensorboard Log 생성 
events.out.tfevents.1574322912.74244b7e90c7

2.5 Training 과 Validation 기본분석

Training 과 Validation 명령어는 아래의 명령어로 동일하며, 현재 생각으로는 Training 만 해도 Validation도 같이 동작되는 것으로 생각이 된다.
그리고, pipeline config에 이미 관련 옵션을 설정을 했기 때문에 validation도 진행을 하는 것으로 생각하며,

이유는 Training 만 돌려도 Tensorboard의 Validation Log까지 나오는 것으로 봐도 그렇다.

이전의 Validation 전용 명령어는 --run_once를 넣어 eval-only 한번 돌리는 것 뿐인 것 같다.

root@f46c490016e0:/workdir/models/research# python object_detection/model_main.py -h

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Binary to run train and evaluation on object detection model.
flags:

object_detection/model_main.py:
  --[no]allow_xla: Enable XLA compilation
    (default: 'false')
  --checkpoint_dir: Path to directory holding a checkpoint.  If `checkpoint_dir` is provided, this binary operates in eval-only mode, writing
    resulting metrics to `model_dir`.
  --eval_count: How many times the evaluation should be run
    (default: '1')
    (an integer)
  --[no]eval_training_data: If training data should be evaluated for this job. Note that one call only use this in eval-only mode, and
    `checkpoint_dir` must be supplied.
    (default: 'false')
  --hparams_overrides: Hyperparameter overrides, represented as a string containing comma-separated hparam_name=value pairs.
  --model_dir: Path to output model directory where event and checkpoint files will be written.
  --num_train_steps: Number of train steps.
    (an integer)
  --pipeline_config_path: Path to pipeline config file.
  --[no]run_once: If running in eval-only mode, whether to run just one round of eval vs running continuously (default).
    (default: 'false')
  --sample_1_of_n_eval_examples: Will sample one of every n eval input examples, where n is provided.
    (default: '1')
    (an integer)
  --sample_1_of_n_eval_on_train_examples: Will sample one of every n train input examples for evaluation, where n is provided. This is only used if
    `eval_training_data` is True.
    (default: '5')
    (an integer)

root@f46c490016e0:/workdir/models/research# vi python object_detection/model_main.py 
..........
  if FLAGS.checkpoint_dir:
    if FLAGS.eval_training_data:    ## 기본이 FALSE
      name = 'training_data'
      input_fn = eval_on_train_input_fn   
    else:
      name = 'validation_data'     ## name은 이것으로 설정 
      # The first eval input will be evaluated.
      input_fn = eval_input_fns[0]
    if FLAGS.run_once:             ## validation 할 경우 이곳만 실행 
      estimator.evaluate(input_fn,
                         steps=None,
                         checkpoint_path=tf.train.latest_checkpoint(
                             FLAGS.checkpoint_dir))
    else:                          ##  Training 할 경우 이곳 실행 
      model_lib.continuous_eval(estimator, FLAGS.checkpoint_dir, input_fn,
                                train_steps, name)  
.........

이외에도 간단한 training 하는 명령어가 존재하며, 그것을 사용해도 상관 없다.

3. Inference (chpt -> pb)

Training 이 종료가 되면 아래와 같이 최종 Inference를 위해서 pb파일로 변경
파이프라인의 step의 숫자에 따라 checkpoint 파일명은 달라지므로, 본인의 설정에 따라 아래 명령도 변경

SSD-Resnet 50 inference

root@f46c490016e0:/workdir/models/research# python object_detection/export_inference_graph.py \
    --input_type image_tensor \
    --pipeline_config_path /checkpoints/models/configs/ssd320_full_1gpus.config \
    --trained_checkpoint_prefix  /checkpoints/train_resnet/model.ckpt-100 \
    --output_directory /checkpoints/train_resnet/inference_graph_100

SSD-Inception V2 inference

root@f46c490016e0:/workdir/models/research# python object_detection/export_inference_graph.py \
    --input_type image_tensor \
    --pipeline_config_path /checkpoints/models/configs/ssd_inception_v2_coco.config \
    --trained_checkpoint_prefix  /checkpoints/train_inception/model.ckpt-100 \
    --output_directory /checkpoints/train_inception/inference_graph_100

Faster RCNN-Resnet 50 inference

root@f46c490016e0:/workdir/models/research# python object_detection/export_inference_graph.py \
    --input_type image_tensor \
    --pipeline_config_path /checkpoints/models/configs/faster_rcnn_resnet50_coco.config \
    --trained_checkpoint_prefix  /checkpoints/fasterrcnn_train_resnet/model.ckpt-100 \
    --output_directory /checkpoints/fasterrcnn_train_resnet/inference_graph_100

4. Tensorboard 로 확인

Training or Validation 이 종료된 후 Tensorflow의 Log를 분석
Training에 관련된 부분만 분석

Tensorboard

root@f46c490016e0:/workdir/models/research# tensorboard --logdir=/checkpoints/train_resnet  // SSD-Resnet50
or 
root@f46c490016e0:/workdir/models/research# tensorboard --logdir=/checkpoints/train_inception     //SSD-Inceptionv2  
or
root@f46c490016e0:/workdir/models/research# tensorboard --logdir=/checkpoints/fasterrcnn_train_resnet     //Faster RCNN-Resnet50

Tensorboard Browser 연결

http://localhost:6006/

5. Object Detection TEST

jupyter를 이용하여 상위에서 만들어진 pb파일을 이용하여 Test Image를 준비하고 관련 소스를 수정하여 최종 테스트를 진행하자

root@f46c490016e0:/workdir/models/research# jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root

Jupyter 연결

http://localhost:8888/

object_detection/object_detection_tutorial.ipynb 를 실행하여 검증

5.1 object_detection_tutorial.ipynb 수정사항

현재 inference 한 pb파일을 가지고 object_detection/object_detection_tutorial.ipynb 에서 소스를 수정하여 가볍게 테스트가 가능하다.

Download 미실행하며, Variables 의 수정

MODEL_NAME = '/checkpoints/train_resnet/inference_graph_100'
PATH_TO_LABELS = os.path.join('/data', 'label_map.pbtxt')
PATH_TO_TEST_IMAGES_DIR = '/data/test_images'

기존의 소스는 Download를 진행하여 Pre-trained 된 모델을 바로 이용하는 것이지만, 이를 우리가 inference한 것으로 변경하고
TEST Image하여 테스트를 진행하자

GPU Memory 문제발생시 추가

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

Allocator (GPU_0_bfc) ran out of memory trying to allocate
https://eehoeskrap.tistory.com/360

5.2 object_detection_tutorial.ipynb 기본소스 이해

이 소스의 중요 포인트는 run_inference_for_single_image 이며 이곳에서 나온 출력 값을 test 이미지에 적용하여 테스트해보는 것이다.

아래의 key in에 있는 정보들은 반드시 상위 정의된 pipeline config와 연동이 되며, 이 부분을 알아두도록하자. (SSD기준)

def run_inference_for_single_image(image, graph):
  with graph.as_default():
    with tf.Session() as sess:
      # Get handles to input and output tensors
      ops = tf.get_default_graph().get_operations()
      all_tensor_names = {output.name for op in ops for output in op.outputs}
      tensor_dict = {}
      for key in [
          'num_detections', 'detection_boxes', 'detection_scores',
          'detection_classes', 'detection_masks'
      ]:
        tensor_name = key + ':0'
        if tensor_name in all_tensor_names:
          tensor_dict[key] = tf.get_default_graph().get_tensor_by_name(
              tensor_name)
      if 'detection_masks' in tensor_dict:
        # The following processing is only for single image
        detection_boxes = tf.squeeze(tensor_dict['detection_boxes'], [0])
        detection_masks = tf.squeeze(tensor_dict['detection_masks'], [0])
        # Reframe is required to translate mask from box coordinates to image coordinates and fit the image size.
        real_num_detection = tf.cast(tensor_dict['num_detections'][0], tf.int32)
        detection_boxes = tf.slice(detection_boxes, [0, 0], [real_num_detection, -1])
        detection_masks = tf.slice(detection_masks, [0, 0, 0], [real_num_detection, -1, -1])
        detection_masks_reframed = utils_ops.reframe_box_masks_to_image_masks(
            detection_masks, detection_boxes, image.shape[0], image.shape[1])
        detection_masks_reframed = tf.cast(
            tf.greater(detection_masks_reframed, 0.5), tf.uint8)
        # Follow the convention by adding back the batch dimension
        tensor_dict['detection_masks'] = tf.expand_dims(
            detection_masks_reframed, 0)
      image_tensor = tf.get_default_graph().get_tensor_by_name('image_tensor:0')

      # Run inference
      output_dict = sess.run(tensor_dict,
                             feed_dict={image_tensor: np.expand_dims(image, 0)})

      # all outputs are float32 numpy arrays, so convert types as appropriate
      output_dict['num_detections'] = int(output_dict['num_detections'][0])       ### 현재 Pipeline에서 100으로 정의해서 항상 100개를 찾음 
      output_dict['detection_classes'] = output_dict[
          'detection_classes'][0].astype(np.uint8)                                ### 내가 정의한 label_map.pbtxt 기준으로 100개를 찾음 
      output_dict['detection_boxes'] = output_dict['detection_boxes'][0]          ### bbox의 
      output_dict['detection_scores'] = output_dict['detection_scores'][0]        ### 100개의 각각의 Confidence를 알수 있지만, 화면에 표시되는 것은 Threshold값이 넘은 것들 
      if 'detection_masks' in output_dict:
        output_dict['detection_masks'] = output_dict['detection_masks'][0]
  return output_dict

for image_path in TEST_IMAGE_PATHS:
  image = Image.open(image_path)
  # the array based representation of the image will be used later in order to prepare the
  # result image with boxes and labels on it.
  image_np = load_image_into_numpy_array(image)
  # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
  image_np_expanded = np.expand_dims(image_np, axis=0)
  # Actual detection.
  output_dict = run_inference_for_single_image(image_np, detection_graph)
  # Visualization of the results of a detection.
  vis_util.visualize_boxes_and_labels_on_image_array(
      image_np,                                          ## image output
      output_dict['detection_boxes'],      ## bbox의  정보배열 100개  (4개의 정보)  ymin/ymax/xmin/ymax = box * height/ box * width
      output_dict['detection_classes'],    ## class  정보배열 100개   (1,2 )
      output_dict['detection_scores'],     ## confidence 정보배열 100개 (0.5 이상만표시)
      category_index,                                    ## 상위 내가 정의한 label_map.pbtxt 정보 
      instance_masks=output_dict.get('detection_masks'),
      use_normalized_coordinates=True,
      line_thickness=8)                                  ## line의 두께설정 
  plt.figure(figsize=IMAGE_SIZE)
  plt.imshow(image_np)
  plt.title(image_path)

 # print("num_detections",output_dict['num_detections'])         ### Training이 되어 Max 100개 를 찾음 (상위 SSD Pipeline 부분 참조)
 # print("detection_boxes",output_dict['detection_boxes'])       ### 찾은 100개 배열 의 box의 위치 
 # print("detection_classes",output_dict['detection_classes'])   ### 찾은 100개 배열 의 class 1 or 2 (현재 1,2만 선언)
 # print("detection_scores",output_dict['detection_scores'])     ### 찾은 100개 배열 의 confidence 이며 pipeline의 threshold 값 이상인 것만 화면 표시 

  for i,v in enumerate (output_dict['detection_scores']):   ### i : index  v: list의 member 
      if v > 0.5:                                           ### 100의 중에 0.5가 넘는 것만 표시 
        print("  - class-name:", category_index.get(output_dict['detection_classes'][i]).get('name') )   ### category_index는 상위 정의된 lable_map.pbtxt 적용하여 이름을 출력          
        print("  - confidence: ",v * 100 )                  ### percent로 변경

6. 결론

SSD / Faster RCNN은 기본적으로 잘동작하고 있지만, 나의 랩탑에서 간단한 테스트는 가능하지만,

STEPS를 늘려 최종 테스트를 하는것은 힘들어서 Server에서 돌렸다.
(Laptop에서 문제가 발생하는 것은 거의 GPU Memory관련 문제였음)
Laptop에서는 GPU Memory를 항상 봐야하며, 한계가 있으며, Server 다르게 동작하므로 주의하도록 하자.

그리고, Transfer Learning 과 Fine Tuning은 개인적으로 지인의 일때문에, 한 달간 진행했지만, 좀 더 하면 금방익숙해 질거라고 본다.

나중에 기회가 되면 다시한번해보지만, 너무 어렵게 생각할 필요 없다.

항상 GPU Memory 확인

$ watch -n 0.1 nvidia-smi

7. 기타 추후 참고사이트 및 참고사이트

기타 참고사이트

  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md
  https://github.com/vijendra1125/Tensorflow_Object_detection_API-Custom_Faster_RCNN
  https://github.com/vijendra1125/Tensorflow_Object_detection_API-Custom_Faster_RCNN/issues/1

다양한 TFRecord Format 관련 부분

https://github.com/tensorflow/models/tree/master/research/object_detection/dataset_tools
https://github.com/tensorflow/models/blob/master/research/object_detection/dataset_tools/create_pascal_tf_record.py

추후관련사이트들을 다시 보고 정리

Object Detection 관련 참조사이트

https://bcho.tistory.com/1192
https://ukayzm.github.io/python-object-detection-tensorflow/

TF-TRF (Tensorflow 와 TensorRT)

https://developers-kr.googleblog.com/2018/05/tensorrt-integration-with-tensorflow.html

CHPT 와 PT

  https://gusrb.tistory.com/21
  http://jaynewho.com/post/8
  https://goodtogreate.tistory.com/entry/Saving-and-Restoring

Custom Object Detection

  https://github.com/5taku/custom_object_detection
  https://github.com/engiego/Custom-Object-Detection-engiegocustom
  https://www.slideshare.net/fermat39/mlnet-automl?next_slideshow=1
  https://towardsdatascience.com/custom-object-detection-using-tensorflow-from-scratch-e61da2e10087
  https://medium.com/coinmonks/tensorflow-object-detection-with-custom-objects-34a2710c6de5
  https://becominghuman.ai/tensorflow-object-detection-api-tutorial-training-and-evaluating-custom-object-detector-ed2594afcf73
  https://hwauni.tistory.com/entry/API-Custom-Object-Detection-API-Tutorial-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EC%A4%80%EB%B9%84-Part-1
  https://pythonprogramming.net/training-custom-objects-tensorflow-object-detection-api-tutorial/
  https://jameslittle.me/blog/2019/tensorflow-object-detection

Colab

https://hackernoon.com/object-detection-in-google-colab-with-custom-dataset-5a7bb2b0e97e
https://medium.com/analytics-vidhya/detecting-fires-using-tensorflow-b5b148952495

10/25/2019

NVIDIA Docker SSD Traing 분석 (2차 분석)

1. NVIDIA Object Detection SSD Docker

NVIDIA Tensorflow DeepLearning Example은 현재 Github에서 제공을 해주고 있으며 각각의 아래의 사이트에서 확인을 하자.

Github NVIDIA DeepLearning SSD 사이트 확인

현재 아래의 SSD Github의 README.md 기반으로 진행을 하며 이부분을 보고 진행을 하면된다.
https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Detection/SSD

상위 소스를 이용하여 쉽게 NVIDIA Tensorflow Object Detection SSD Docker 구성이 가능하며, 테스트도 가능하다.

Github 기타 DeepLearning Example

기타 아래의 NVIDIA DeepLearning Example이 존재하며 이부분들을 살펴보자 (아직 미테스트)
https://github.com/NVIDIA/DeepLearningExamples

기타 참고 사이트

NVIDIA에서 제공해주는 각 Framework 별 Training 기능소개
https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

Tensorflow의 사이트의 Tensorflow Guide
https://www.tensorflow.org/tutorials?hl=ko

1.1 NVIDIA SSD Docker Quick Guide

README.md 문서를 참고하며 아래와 같이 실행하면 쉽게 Docker를 이용하여 Object Detection 의 SSD Model를 쉽게 Training 이 가능하다. (COCOSET 기반)

Quick Guide 1. Clone the repository

$ git clone https://github.com/NVIDIA/DeepLearningExamples
$ cd DeepLearningExamples/TensorFlow/Detection/SSD

Quick Guide 2. Build the SSD320 v1.2 TensorFlow NGC container.

$ docker build . -t nvidia_ssd

상위와 같이 실행하면 dockerfile을 기반으로 새로운 Docker Image 생성
이외에도 Docker commit를 이용하여 docker에서 직접 Image 생성도 가능

Quick Guide 3. Download and preprocess the dataset. (COCO 2017)

$ ./download_all.sh nvidia_ssd /home/jhlee/works/ssd/data /home/jhlee/works/ssd/check

Quick Guide 4. Launch the NGC container to run training/inference.

$ nvidia-docker run --rm -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-v /home/jhlee/works/ssd/data:/data/coco2017_tfrecords \
-v /home/jhlee/works/ssd/check:/checkpoints \
--ipc=host \
nvidia_ssd

Quick Guide 5. Start training.

root@c7550d6b2c59:/workdir/models/research#  bash ./examples/SSD320_FP16_1GPU.sh /checkpoints

Quick Guide 6. Start validation/evaluation.

root@c7550d6b2c59:/workdir/models/research#   bash examples/SSD320_evaluate.sh /checkpoints

2. Object Detection의 SSD 실행 및 분석

2.1 Quick Guide 1~2 의 실행 및 분석

nvcr.io/nvidia/tensorflow:19.05-py3 기반으로 필요한 Package를 설치한 후 새로운 Docker Image를 생성하는 과정이다.

HOST에서 직접 아래와 같이 실행

상위 Github의 명령대로 그대로 실행

$ cd ~/works
$ mkdir ssd 
$ cd ssd
$ mkdir data
$ mkdir check 
$ git clone https://github.com/NVIDIA/DeepLearningExamples
$ cd DeepLearningExamples/TensorFlow/Detection/SSD
$ ls 
configs  Dockerfile  download_all.sh  examples  img  models  NOTICE  README.md  requirements.txt

Docker Image 생성

Dockerfile에 nvcr.io/nvidia/tensorflow:19.05-py3 기반에 필요한 Package들을 설치를 진행 후 Image 생성이 됨

$ docker build . -t nvidia_ssd    // Dockerfile기반으로 Image 생성

상위 Dockerfile 분석 및 이해

아래의 Docker File을 이해하기 위해서는 현재 Directory 위치가 중요

$ pwd 
/home/jhlee/works/ssd/DeepLearningExamples/TensorFlow/Detection/SSD
$ cat Dockerfile 
FROM nvcr.io/nvidia/tensorflow:19.05-py3 as base

FROM base as sha

RUN mkdir /sha
RUN cat `cat HEAD | cut -d' ' -f2` > /sha/repo_sha

FROM base as final

WORKDIR /workdir

RUN PROTOC_VERSION=3.0.0 && \
    PROTOC_ZIP=protoc-${PROTOC_VERSION}-linux-x86_64.zip && \
    curl -OL https://github.com/google/protobuf/releases/download/v$PROTOC_VERSION/$PROTOC_ZIP && \
    unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
    rm -f $PROTOC_ZIP

COPY requirements.txt .
RUN pip install Cython
RUN pip install -r requirements.txt

WORKDIR models/research/
COPY models/research/ .
RUN protoc object_detection/protos/*.proto --python_out=.
ENV PYTHONPATH="/workdir/models/research/:/workdir/models/research/slim/:$PYTHONPATH"

COPY examples/ examples
COPY configs/ configs/
COPY download_all.sh download_all.sh

COPY --from=sha /sha .

Google Protocol Buffer

이부분 정보를 자세히 설명해주셔서 감사하다
https://bcho.tistory.com/1182

DockerFile

책으로도 나왔으며, 쉽게 Dockerfile 생성 및 사용법을 알수 있음
http://pyrasis.com/docker.html

상위 현재위치 File
https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Detection/SSD

생성된 Docker Image 확인

$ docker images   
REPOSITORY                  TAG                             IMAGE ID            CREATED             SIZE
nvidia_ssd                  latest                          ab529215f717        5 minutes ago       6.97GB
none                        none                            a6bc644c75ed        6 minutes ago       6.96GB  //nvidia_ssd를 만들면서 생기는 image
nvcr.io/nvidia/tensorflow   19.08-py3                       be978d32a5c3        8 weeks ago         7.35GB
nvcr.io/nvidia/cuda         10.1-cudnn7-devel-ubuntu18.04   0ead98c22e04        8 weeks ago         3.67GB
nvidia/cuda                 9.0-devel                       2a64416134d8        8 weeks ago         2.03GB
nvcr.io/nvidia/cuda         10.1-devel-ubuntu18.04          946e78c7b298        8 weeks ago         2.83GB
nvidia/cuda                 10.1-base                       a5f5d3b655ca        8 weeks ago         106MB
nvcr.io/nvidia/tensorflow   19.05-py3                       01c8c4b0d7ff        5 months ago        6.96GB

nvidia_ssd 를 위해서 none 과 nvcr.io/nvidia/tensorflow:19.05-py3 가 필요

2.2 Quick Guide 3 실행 및 분석

Quick Guide 3은 CoCoDataSET을 Download하고 이 기반으로 TF Record format을 만드는 작업이다.

CoCoDATA Set Download 와 TF Record 생성 (download.sh)

Host에서 실행되는 Shell Script으로 Host에 아래의 두개 Directory 구성이 필요하다.
여기서 주역할은 COCOSET Download와 이 기반으로 TF Record를 생성이다.

/data/coco2017_tfrecords : COCOSET의 DATA 저장장소 및 TF Record 저장장소
/checkpoints : Tensorflow의 checkpoint 파일로 이 부분은 별도로 알아보자.

HOST 실행

$ ./download_all.sh nvidia_ssd /home/jhlee/works/ssd/data /home/jhlee/works/ssd/check

$ cat ./download_all.sh // 기본분석  이전과 거의 유사하지만, 아래의 Container에서 Shell을 실행 ,이 부분 기존의 DATASET Download하는 부분으로 변경 

if [ -z $1 ]; then echo "Docker container name is missing" && exit 1; fi
## 1st ARG : CONTAINER NAME
## 2nd ARG : BASE PATH /data/coco2017_tfrecords
## 3st ARG : BASE PATH /checkpoints 
CONTAINER=$1
COCO_DIR=${2:-"/data/coco2017_tfrecords"}
CHECKPOINT_DIR=${3:-"/checkpoints"}
mkdir -p $COCO_DIR
chmod 777 $COCO_DIR
# Download backbone checkpoint
mkdir -p $CHECKPOINT_DIR
chmod 777 $CHECKPOINT_DIR
cd $CHECKPOINT_DIR
wget http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz
tar -xzf resnet_v1_50_2016_08_28.tar.gz
mkdir -p resnet_v1_50
mv resnet_v1_50.ckpt resnet_v1_50/model.ckpt
## nvidia-docker/docker로 사용가능하며, 아래의 Script는 반드시 Docker Container에서 실행과동시에 bash script 실행 
## docker 내부의 download_and_preprocess_mscoco.sh 에 의해 COCOSET 2017 Download 후 아래와 같이 TFRecords 생성 
nvidia-docker run --rm -it -u 123 -v $COCO_DIR:/data/coco2017_tfrecords $CONTAINER bash -c '
# Create TFRecords
bash /workdir/models/research/object_detection/dataset_tools/download_and_preprocess_mscoco.sh \
    /data/coco2017_tfrecords'

download_and_preprocess_mscoco.sh 분석

Container 내부에서 실행되는 실제적인 Shell Scirpt으로 분석하려며 Docker를 실행해서 봐야한다.

COCOSET2017 Download (Annotation 부분포함)
DataSET 기반으로 TFRecord 생성

Shell Script 분석을 위해 아래와 같이 간단히 Container 실행하여 분석

$ nvidia-docker run --rm -it -u 123 -v $HOME/works/ssd/data:/data/coco2017_tfrecords nvidia_ssd 
================
== TensorFlow ==
================

NVIDIA Release 19.05 (build 6390160)
TensorFlow Version 1.13.1
.....

I have no name!@a4891a3ac177:/workdir/models/research$ cat object_detection/dataset_tools/download_and_preprocess_mscoco.sh 
#!/bin/bash
set -e

if [ -z "$1" ]; then
  echo "usage download_and_preprocess_mscoco.sh [data dir]"
  exit
fi

if [ "$(uname)" == "Darwin" ]; then
  UNZIP="tar -xf"
else
  UNZIP="unzip -nq"
fi

# Create the output directories.
OUTPUT_DIR="${1%/}"
SCRATCH_DIR="${OUTPUT_DIR}/raw-data"
mkdir -p "${OUTPUT_DIR}"
mkdir -p "${SCRATCH_DIR}"
CURRENT_DIR=$(pwd)

# Helper function to download and unpack a .zip file.
function download_and_unzip() {
  local BASE_URL=${1}
  local FILENAME=${2}

  if [ ! -f ${FILENAME} ]; then
    echo "Downloading ${FILENAME} to $(pwd)"
    wget -nd -c "${BASE_URL}/${FILENAME}"
  else
    echo "Skipping download of ${FILENAME}"
  fi
  echo "Unzipping ${FILENAME}"
  ${UNZIP} ${FILENAME}
}

cd ${SCRATCH_DIR}

## 말 그래도 cocoset의 Download하는데, 필요한 Image들이 많다 
## (이 부분은 DATASET을 자세히 알아봐야겠다.)

# Download the images.     
BASE_IMAGE_URL="http://images.cocodataset.org/zips"

TRAIN_IMAGE_FILE="train2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${TRAIN_IMAGE_FILE}
TRAIN_IMAGE_DIR="${SCRATCH_DIR}/train2017"

VAL_IMAGE_FILE="val2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${VAL_IMAGE_FILE}
VAL_IMAGE_DIR="${SCRATCH_DIR}/val2017"

TEST_IMAGE_FILE="test2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${TEST_IMAGE_FILE}
TEST_IMAGE_DIR="${SCRATCH_DIR}/test2017"

## Annotation 부분을 Download하는데, 보면 종류가 꽤 되는데, 이 부분 역시 DATASET의 역할을 알아야겠다.  

# Download the annotations.
BASE_INSTANCES_URL="http://images.cocodataset.org/annotations"
INSTANCES_FILE="annotations_trainval2017.zip"
download_and_unzip ${BASE_INSTANCES_URL} ${INSTANCES_FILE}

#
# Train 과 Validation 에는 annotations 중에서 instances_train2017.json / instances_val2017.json 만 사용 
#
TRAIN_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/instances_train2017.json"
VAL_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/instances_val2017.json"

# Download the test image info.
BASE_IMAGE_INFO_URL="http://images.cocodataset.org/annotations"
IMAGE_INFO_FILE="image_info_test2017.zip"
download_and_unzip ${BASE_IMAGE_INFO_URL} ${IMAGE_INFO_FILE}

#
# TEST시에는 annotations 중에서 image_info_test-dev2017.json 사용 
#
TESTDEV_ANNOTATIONS_FILE="${SCRATCH_DIR}/annotations/image_info_test-dev2017.json"

# Build TFRecords of the image data.
cd "${CURRENT_DIR}"
python object_detection/dataset_tools/create_coco_tf_record.py \
  --logtostderr \
  --include_masks \
  --train_image_dir="${TRAIN_IMAGE_DIR}" \
  --val_image_dir="${VAL_IMAGE_DIR}" \
  --test_image_dir="${TEST_IMAGE_DIR}" \
  --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
  --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
  --testdev_annotations_file="${TESTDEV_ANNOTATIONS_FILE}" \
  --output_dir="${OUTPUT_DIR}"

상위에서 dataset_tools/create_coco_tf_record.py 를 이용하여 tf record format를 생성
만약 DATASET이 변경되면, dataset_tools를 참조

Preparing Inputs (다른 SET의 설정을 알수 있음)
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/preparing_inputs.md

root@c7550d6b2c59:/workdir/models/research# ls object_detection/dataset_tools/
__init__.py                    create_kitti_tf_record.py       create_pascal_tf_record.py       create_pycocotools_package.sh         oid_hierarchical_labels_expansion_test.py  tf_record_creation_util.py
create_coco_tf_record.py       create_kitti_tf_record_test.py  create_pascal_tf_record_test.py  download_and_preprocess_mscoco.sh     oid_tfrecord_creation.py                   tf_record_creation_util_test.py
create_coco_tf_record_test.py  create_oid_tf_record.py         create_pet_tf_record.py          oid_hierarchical_labels_expansion.py  oid_tfrecord_creation_test.py

## 아래를 보면 COCO의 Annotation은 JSON 형태 사용 
root@c7550d6b2c59:/workdir/models/research# cat object_detection/dataset_tools/create_coco_tf_record.py 
r"""Convert raw COCO dataset to TFRecord for object_detection.

Please note that this tool creates sharded output files.

Example usage:
    python create_coco_tf_record.py --logtostderr \
      --train_image_dir="${TRAIN_IMAGE_DIR}" \
      --val_image_dir="${VAL_IMAGE_DIR}" \
      --test_image_dir="${TEST_IMAGE_DIR}" \
      --train_annotations_file="${TRAIN_ANNOTATIONS_FILE}" \
      --val_annotations_file="${VAL_ANNOTATIONS_FILE}" \
      --testdev_annotations_file="${TESTDEV_ANNOTATIONS_FILE}" \
      --output_dir="${OUTPUT_DIR}"
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import hashlib
import io
import json
import os
import contextlib2
import numpy as np
import PIL.Image

from pycocotools import mask
import tensorflow as tf

from object_detection.dataset_tools import tf_record_creation_util
from object_detection.utils import dataset_util
from object_detection.utils import label_map_util


flags = tf.app.flags
tf.flags.DEFINE_boolean('include_masks', False,
                        'Whether to include instance segmentations masks '
                        '(PNG encoded) in the result. default: False.')
tf.flags.DEFINE_string('train_image_dir', '',
                       'Training image directory.')
tf.flags.DEFINE_string('val_image_dir', '',
                       'Validation image directory.')
tf.flags.DEFINE_string('test_image_dir', '',
                       'Test image directory.')
tf.flags.DEFINE_string('train_annotations_file', '',
                       'Training annotations JSON file.')
tf.flags.DEFINE_string('val_annotations_file', '',
                       'Validation annotations JSON file.')
tf.flags.DEFINE_string('testdev_annotations_file', '',
                       'Test-dev annotations JSON file.')
tf.flags.DEFINE_string('output_dir', '/tmp/', 'Output data directory.')

FLAGS = flags.FLAGS

tf.logging.set_verbosity(tf.logging.INFO)


def create_tf_example(image,
                      annotations_list,
                      image_dir,
                      category_index,
                      include_masks=False):
  """Converts image and annotations to a tf.Example proto.

  Args:
    image: dict with keys:
      [u'license', u'file_name', u'coco_url', u'height', u'width',
      u'date_captured', u'flickr_url', u'id']
    annotations_list:
      list of dicts with keys:
      [u'segmentation', u'area', u'iscrowd', u'image_id',
      u'bbox', u'category_id', u'id']
      Notice that bounding box coordinates in the official COCO dataset are
      given as [x, y, width, height] tuples using absolute coordinates where
      x, y represent the top-left (0-indexed) corner.  This function converts
      to the format expected by the Tensorflow Object Detection API (which is
      which is [ymin, xmin, ymax, xmax] with coordinates normalized relative
      to image size).
    image_dir: directory containing the image files.
    category_index: a dict containing COCO category information keyed
      by the 'id' field of each category.  See the
      label_map_util.create_category_index function.
    include_masks: Whether to include instance segmentations masks
      (PNG encoded) in the result. default: False.
  Returns:
    example: The converted tf.Example
    num_annotations_skipped: Number of (invalid) annotations that were ignored.

  Raises:
    ValueError: if the image pointed to by data['filename'] is not a valid JPEG
  """
  image_height = image['height']
  image_width = image['width']
  filename = image['file_name']
  image_id = image['id']

  full_path = os.path.join(image_dir, filename)
  with tf.gfile.GFile(full_path, 'rb') as fid:
    encoded_jpg = fid.read()
  encoded_jpg_io = io.BytesIO(encoded_jpg)
  image = PIL.Image.open(encoded_jpg_io)
  key = hashlib.sha256(encoded_jpg).hexdigest()

  xmin = []
  xmax = []
  ymin = []
  ymax = []
  is_crowd = []
  category_names = []
  category_ids = []
  area = []
  encoded_mask_png = []
  num_annotations_skipped = 0
  for object_annotations in annotations_list:
    (x, y, width, height) = tuple(object_annotations['bbox'])
    if width <= 0 or height <= 0:
      num_annotations_skipped += 1
      continue
    if x + width > image_width or y + height > image_height:
      num_annotations_skipped += 1
      continue
    xmin.append(float(x) / image_width)
    xmax.append(float(x + width) / image_width)
    ymin.append(float(y) / image_height)
    ymax.append(float(y + height) / image_height)
    is_crowd.append(object_annotations['iscrowd'])
    category_id = int(object_annotations['category_id'])
    category_ids.append(category_id)
    category_names.append(category_index[category_id]['name'].encode('utf8'))
    area.append(object_annotations['area'])

    if include_masks:
      run_len_encoding = mask.frPyObjects(object_annotations['segmentation'],
                                          image_height, image_width)
      binary_mask = mask.decode(run_len_encoding)
      if not object_annotations['iscrowd']:
        binary_mask = np.amax(binary_mask, axis=2)
      pil_image = PIL.Image.fromarray(binary_mask)
      output_io = io.BytesIO()
      pil_image.save(output_io, format='PNG')
      encoded_mask_png.append(output_io.getvalue())
  feature_dict = {
      'image/height':
          dataset_util.int64_feature(image_height),
      'image/width':
          dataset_util.int64_feature(image_width),
      'image/filename':
          dataset_util.bytes_feature(filename.encode('utf8')),
      'image/source_id':
          dataset_util.bytes_feature(str(image_id).encode('utf8')),
      'image/key/sha256':
          dataset_util.bytes_feature(key.encode('utf8')),
      'image/encoded':
          dataset_util.bytes_feature(encoded_jpg),
      'image/format':
          dataset_util.bytes_feature('jpeg'.encode('utf8')),
      'image/object/bbox/xmin':
          dataset_util.float_list_feature(xmin),
      'image/object/bbox/xmax':
          dataset_util.float_list_feature(xmax),
      'image/object/bbox/ymin':
          dataset_util.float_list_feature(ymin),
      'image/object/bbox/ymax':
          dataset_util.float_list_feature(ymax),
      'image/object/class/text':
          dataset_util.bytes_list_feature(category_names),
      'image/object/is_crowd':
          dataset_util.int64_list_feature(is_crowd),
      'image/object/area':
          dataset_util.float_list_feature(area),
  }
  if include_masks:
    feature_dict['image/object/mask'] = (
        dataset_util.bytes_list_feature(encoded_mask_png))
  example = tf.train.Example(features=tf.train.Features(feature=feature_dict))
  return key, example, num_annotations_skipped


def _create_tf_record_from_coco_annotations(
    annotations_file, image_dir, output_path, include_masks, num_shards):
  """Loads COCO annotation json files and converts to tf.Record format.

  Args:
    annotations_file: JSON file containing bounding box annotations.
    image_dir: Directory containing the image files.
    output_path: Path to output tf.Record file.
    include_masks: Whether to include instance segmentations masks
      (PNG encoded) in the result. default: False.
    num_shards: number of output file shards.
  """
  with contextlib2.ExitStack() as tf_record_close_stack, \
      tf.gfile.GFile(annotations_file, 'r') as fid:
    output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
        tf_record_close_stack, output_path, num_shards)
    groundtruth_data = json.load(fid)
    images = groundtruth_data['images']
    category_index = label_map_util.create_category_index(
        groundtruth_data['categories'])

    annotations_index = {}
    if 'annotations' in groundtruth_data:
      tf.logging.info(
          'Found groundtruth annotations. Building annotations index.')
      for annotation in groundtruth_data['annotations']:
        image_id = annotation['image_id']
        if image_id not in annotations_index:
          annotations_index[image_id] = []
        annotations_index[image_id].append(annotation)
    missing_annotation_count = 0
    for image in images:
      image_id = image['id']
      if image_id not in annotations_index:
        missing_annotation_count += 1
        annotations_index[image_id] = []
    tf.logging.info('%d images are missing annotations.',
                    missing_annotation_count)

    total_num_annotations_skipped = 0
    for idx, image in enumerate(images):
      if idx % 100 == 0:
        tf.logging.info('On image %d of %d', idx, len(images))
      annotations_list = annotations_index[image['id']]
      _, tf_example, num_annotations_skipped = create_tf_example(
          image, annotations_list, image_dir, category_index, include_masks)
      total_num_annotations_skipped += num_annotations_skipped
      shard_idx = idx % num_shards
      output_tfrecords[shard_idx].write(tf_example.SerializeToString())
    tf.logging.info('Finished writing, skipped %d annotations.',
                    total_num_annotations_skipped)


def main(_):
  assert FLAGS.train_image_dir, '`train_image_dir` missing.'
  assert FLAGS.val_image_dir, '`val_image_dir` missing.'
  assert FLAGS.test_image_dir, '`test_image_dir` missing.'
  assert FLAGS.train_annotations_file, '`train_annotations_file` missing.'
  assert FLAGS.val_annotations_file, '`val_annotations_file` missing.'
  assert FLAGS.testdev_annotations_file, '`testdev_annotations_file` missing.'

  if not tf.gfile.IsDirectory(FLAGS.output_dir):
    tf.gfile.MakeDirs(FLAGS.output_dir)
  train_output_path = os.path.join(FLAGS.output_dir, 'coco_train.record')
  val_output_path = os.path.join(FLAGS.output_dir, 'coco_val.record')
  testdev_output_path = os.path.join(FLAGS.output_dir, 'coco_testdev.record')

  _create_tf_record_from_coco_annotations(
      FLAGS.train_annotations_file,
      FLAGS.train_image_dir,
      train_output_path,
      FLAGS.include_masks,
      num_shards=100)
  _create_tf_record_from_coco_annotations(
      FLAGS.val_annotations_file,
      FLAGS.val_image_dir,
      val_output_path,
      FLAGS.include_masks,
      num_shards=10)
  _create_tf_record_from_coco_annotations(
      FLAGS.testdev_annotations_file,
      FLAGS.test_image_dir,
      testdev_output_path,
      FLAGS.include_masks,
      num_shards=100)


if __name__ == '__main__':
  tf.app.run()

TF Record 생성확인

coco_train.record :Pipeline에서 설정
coco_val.record : Pipeline에서 설정
coco_testdev.record : 현재 사용하는 지 미확인

root@4b038f3383f2:/workdir/models/research# ls /data/coco2017_tfrecords/
annotation                          coco_testdev.record-00035-of-00100  coco_testdev.record-00071-of-00100  coco_train.record-00007-of-00100  coco_train.record-00043-of-00100  coco_train.record-00079-of-00100
coco_testdev.record-00000-of-00100  coco_testdev.record-00036-of-00100  coco_testdev.record-00072-of-00100  coco_train.record-00008-of-00100  coco_train.record-00044-of-00100  coco_train.record-00080-of-00100
coco_testdev.record-00001-of-00100  coco_testdev.record-00037-of-00100  coco_testdev.record-00073-of-00100  coco_train.record-00009-of-00100  coco_train.record-00045-of-00100  coco_train.record-00081-of-00100
coco_testdev.record-00002-of-00100  coco_testdev.record-00038-of-00100  coco_testdev.record-00074-of-00100  coco_train.record-00010-of-00100  coco_train.record-00046-of-00100  coco_train.record-00082-of-00100
coco_testdev.record-00003-of-00100  coco_testdev.record-00039-of-00100  coco_testdev.record-00075-of-00100  coco_train.record-00011-of-00100  coco_train.record-00047-of-00100  coco_train.record-00083-of-00100
coco_testdev.record-00004-of-00100  coco_testdev.record-00040-of-00100  coco_testdev.record-00076-of-00100  coco_train.record-00012-of-00100  coco_train.record-00048-of-00100  coco_train.record-00084-of-00100
coco_testdev.record-00005-of-00100  coco_testdev.record-00041-of-00100  coco_testdev.record-00077-of-00100  coco_train.record-00013-of-00100  coco_train.record-00049-of-00100  coco_train.record-00085-of-00100
coco_testdev.record-00006-of-00100  coco_testdev.record-00042-of-00100  coco_testdev.record-00078-of-00100  coco_train.record-00014-of-00100  coco_train.record-00050-of-00100  coco_train.record-00086-of-00100
coco_testdev.record-00007-of-00100  coco_testdev.record-00043-of-00100  coco_testdev.record-00079-of-00100  coco_train.record-00015-of-00100  coco_train.record-00051-of-00100  coco_train.record-00087-of-00100
coco_testdev.record-00008-of-00100  coco_testdev.record-00044-of-00100  coco_testdev.record-00080-of-00100  coco_train.record-00016-of-00100  coco_train.record-00052-of-00100  coco_train.record-00088-of-00100
coco_testdev.record-00009-of-00100  coco_testdev.record-00045-of-00100  coco_testdev.record-00081-of-00100  coco_train.record-00017-of-00100  coco_train.record-00053-of-00100  coco_train.record-00089-of-00100
coco_testdev.record-00010-of-00100  coco_testdev.record-00046-of-00100  coco_testdev.record-00082-of-00100  coco_train.record-00018-of-00100  coco_train.record-00054-of-00100  coco_train.record-00090-of-00100
coco_testdev.record-00011-of-00100  coco_testdev.record-00047-of-00100  coco_testdev.record-00083-of-00100  coco_train.record-00019-of-00100  coco_train.record-00055-of-00100  coco_train.record-00091-of-00100
coco_testdev.record-00012-of-00100  coco_testdev.record-00048-of-00100  coco_testdev.record-00084-of-00100  coco_train.record-00020-of-00100  coco_train.record-00056-of-00100  coco_train.record-00092-of-00100
coco_testdev.record-00013-of-00100  coco_testdev.record-00049-of-00100  coco_testdev.record-00085-of-00100  coco_train.record-00021-of-00100  coco_train.record-00057-of-00100  coco_train.record-00093-of-00100
coco_testdev.record-00014-of-00100  coco_testdev.record-00050-of-00100  coco_testdev.record-00086-of-00100  coco_train.record-00022-of-00100  coco_train.record-00058-of-00100  coco_train.record-00094-of-00100
coco_testdev.record-00015-of-00100  coco_testdev.record-00051-of-00100  coco_testdev.record-00087-of-00100  coco_train.record-00023-of-00100  coco_train.record-00059-of-00100  coco_train.record-00095-of-00100
coco_testdev.record-00016-of-00100  coco_testdev.record-00052-of-00100  coco_testdev.record-00088-of-00100  coco_train.record-00024-of-00100  coco_train.record-00060-of-00100  coco_train.record-00096-of-00100
coco_testdev.record-00017-of-00100  coco_testdev.record-00053-of-00100  coco_testdev.record-00089-of-00100  coco_train.record-00025-of-00100  coco_train.record-00061-of-00100  coco_train.record-00097-of-00100
coco_testdev.record-00018-of-00100  coco_testdev.record-00054-of-00100  coco_testdev.record-00090-of-00100  coco_train.record-00026-of-00100  coco_train.record-00062-of-00100  coco_train.record-00098-of-00100
coco_testdev.record-00019-of-00100  coco_testdev.record-00055-of-00100  coco_testdev.record-00091-of-00100  coco_train.record-00027-of-00100  coco_train.record-00063-of-00100  coco_train.record-00099-of-00100
coco_testdev.record-00020-of-00100  coco_testdev.record-00056-of-00100  coco_testdev.record-00092-of-00100  coco_train.record-00028-of-00100  coco_train.record-00064-of-00100  coco_val.record-00000-of-00010
coco_testdev.record-00021-of-00100  coco_testdev.record-00057-of-00100  coco_testdev.record-00093-of-00100  coco_train.record-00029-of-00100  coco_train.record-00065-of-00100  coco_val.record-00001-of-00010
coco_testdev.record-00022-of-00100  coco_testdev.record-00058-of-00100  coco_testdev.record-00094-of-00100  coco_train.record-00030-of-00100  coco_train.record-00066-of-00100  coco_val.record-00002-of-00010
coco_testdev.record-00023-of-00100  coco_testdev.record-00059-of-00100  coco_testdev.record-00095-of-00100  coco_train.record-00031-of-00100  coco_train.record-00067-of-00100  coco_val.record-00003-of-00010
coco_testdev.record-00024-of-00100  coco_testdev.record-00060-of-00100  coco_testdev.record-00096-of-00100  coco_train.record-00032-of-00100  coco_train.record-00068-of-00100  coco_val.record-00004-of-00010
coco_testdev.record-00025-of-00100  coco_testdev.record-00061-of-00100  coco_testdev.record-00097-of-00100  coco_train.record-00033-of-00100  coco_train.record-00069-of-00100  coco_val.record-00005-of-00010
coco_testdev.record-00026-of-00100  coco_testdev.record-00062-of-00100  coco_testdev.record-00098-of-00100  coco_train.record-00034-of-00100  coco_train.record-00070-of-00100  coco_val.record-00006-of-00010
coco_testdev.record-00027-of-00100  coco_testdev.record-00063-of-00100  coco_testdev.record-00099-of-00100  coco_train.record-00035-of-00100  coco_train.record-00071-of-00100  coco_val.record-00007-of-00010
coco_testdev.record-00028-of-00100  coco_testdev.record-00064-of-00100  coco_train.record-00000-of-00100    coco_train.record-00036-of-00100  coco_train.record-00072-of-00100  coco_val.record-00008-of-00010
coco_testdev.record-00029-of-00100  coco_testdev.record-00065-of-00100  coco_train.record-00001-of-00100    coco_train.record-00037-of-00100  coco_train.record-00073-of-00100  coco_val.record-00009-of-00010
coco_testdev.record-00030-of-00100  coco_testdev.record-00066-of-00100  coco_train.record-00002-of-00100    coco_train.record-00038-of-00100  coco_train.record-00074-of-00100  raw-data
coco_testdev.record-00031-of-00100  coco_testdev.record-00067-of-00100  coco_train.record-00003-of-00100    coco_train.record-00039-of-00100  coco_train.record-00075-of-00100
coco_testdev.record-00032-of-00100  coco_testdev.record-00068-of-00100  coco_train.record-00004-of-00100    coco_train.record-00040-of-00100  coco_train.record-00076-of-00100
coco_testdev.record-00033-of-00100  coco_testdev.record-00069-of-00100  coco_train.record-00005-of-00100    coco_train.record-00041-of-00100  coco_train.record-00077-of-00100
coco_testdev.record-00034-of-00100  coco_testdev.record-00070-of-00100  coco_train.record-00006-of-00100    coco_train.record-00042-of-00100  coco_train.record-00078-of-00100

TFRecord 와 TF Example
https://www.tensorflow.org/tutorials/load_data/tfrecord

상위에서 사용되어지는 COCO DATASET 2017

  http://images.cocodataset.org/zips/train2017.zip
  http://images.cocodataset.org/annotations/annotations_trainval2017.zip
  http://images.cocodataset.org/zips/val2017.zip
  http://images.cocodataset.org/zips/test2017.zip

Cocodata Set 관련내용 재확인

https://ahyuo79.blogspot.com/2019/10/cocodata-set.html

2.3 Quick Guide 4~5 실행 및 분석

우선 Docker의 Conatiner를 아래와 같이 실행한 후 Tensorflow Training Shell Script으로 Training을 진행을 한다.
나의 경우는 GPU가 하나이므로 Training부분이 아주 느리다

NVIDIA Docker는 추후 사라질게 될것 같으며, 아래와 같이 Docker로도 실행이 가능하지만, nvidia-toolkit을 반드시 설치해야한다.
관련부분은 이전부분 확인
https://ahyuo79.blogspot.com/2019/10/nvidia-docker.html

설치된 NGC Version2로 실행

$ nvidia-docker run --rm -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8888:8888 -p 6006:6006  \
-v /home/jhlee/works/ssd/data:/data/coco2017_tfrecords \
-v /home/jhlee/works/ssd/check:/checkpoints \
--ipc=host \
--name nvidia_ssd \
nvidia_ssd

docker로 변경하여 실행 (nvidia-docker2 미사용할 경우)

Tensorboard/Jupyter Port mapping 추가
name 설정하여 쉽게 찾기

$ docker run --gpus all --rm -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8888:8888 -p 6006:6006  \
-v /home/jhlee/works/ssd/data:/data/coco2017_tfrecords \
-v /home/jhlee/works/ssd/check:/checkpoints \
--ipc=host \
--name nvidia_ssd \
nvidia_ssd

Training 부분 실행 및 분석

Training의 Shell Script 분석을 해보면, 내부적으로 사용되는 Config 파일도 존재하며 이 부분을 알아두자.

root@c7550d6b2c59:/workdir/models/research# bash ./examples/SSD320_FP16_1GPU.sh /checkpoints

root@c7550d6b2c59:/workdir/models/research# cat examples/SSD320_FP16_1GPU.sh 

CKPT_DIR=${1:-"/results/SSD320_FP16_1GPU"}
PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd320_full_1gpus.config"

export TF_ENABLE_AUTO_MIXED_PRECISION=1

TENSOR_OPS=0
export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_RNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}

time python -u ./object_detection/model_main.py \
       --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
       --model_dir=${CKPT_DIR} \
       --alsologtostder \
       "${@:3}"

Configuring the Object Detection Training Pipeline

Pipeline 설정은 Training하기 위해서 필요한 설정이며, model에 따라 각각의 설정이 조금씩 다른 것 같다.
다양한 config 파일을 확인하고 싶다면, object_detection/samples/configs 에서 확인을 하자
pre-trained 모델은 resnet_v150 기준으로 동작을 하므로 관련된 기능을 알아두자

root@a79a83fc99f6:/workdir/models/research# cat configs/ssd320_full_1gpus.config 
# SSD with Resnet 50 v1 FPN feature extractor, shared box predictor and focal
# loss (a.k.a Retinanet).
# See Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from Imagenet classification checkpoint

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: true
    num_classes: 90         ## Label 의 갯수 object_detection/data/mscoco_label_map.pbtxt 의 label의 class 수와 동일
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5      ## 50% 넘은 것만을 화면에 표시 , 추후 object_detection/object_detection_tutorial.ipynb 사용시 파악가능 
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: [1.0, 2.0, 0.5]
        scales_per_octave: 2
      }
    }
    image_resizer {                ## 이 부분은 network의 input shape 인 것 같음 (kernel)
      fixed_shape_resizer {
        height: 320
        width: 320
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        depth: 256
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.0004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            scale: true,
            decay: 0.997,
            epsilon: 0.001,
          }
        }
        num_layers_before_predictor: 4
        kernel_size: 3
      }
    }
    feature_extractor {
      type: 'ssd_resnet50_v1_fpn'       # feature_extractor용으로 resnet 50을 사용하며 이 부분은 변경가능  
      fpn {
        min_level: 3
        max_level: 7
      }
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.0004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          scale: true,
          decay: 0.997,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.25
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6                 
        max_detections_per_class: 100     ### class마다 찾을 수 있는 MAX
        max_total_detections: 100         ### 전체 찾을 수 있는 MAX object_detection/object_detection_tutorial.ipynb 의 output_dict['num_detections'] 과 동일 
      }
      score_converter: SIGMOID
    }
  }
}

# 
# model은 Google의 pre-train 된 모델 
# SSD에서 내부 feature extractor를 pre-trained model사용하면 fine_tune_checkpoint_type  "classification" 
# faster R-CNN, fine_tune_checkpoint_type  "detection"  
#

train_config: {
  fine_tune_checkpoint: "/checkpoints/resnet_v1_50/model.ckpt"        ## 상위 pre-trained model download 위치 
  fine_tune_checkpoint_type: "classification"                         # 모델에 따라 설정이 다르다고 함 
  batch_size: 32    ## GPU의 Memory가 Out of Memory가 발생할수 있으므로, 본인의 GPU Memory 맞게 설정 or CPU모드로 변경 
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  num_steps: 100000    ## steps 100,000 정함 , ( object_detection/model_main.py --num_train_steps 으로 설정가능) 
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_crop_image {
      min_object_covered: 0.0
      min_aspect_ratio: 0.75
      max_aspect_ratio: 3.0
      min_area: 0.75
      max_area: 1.0
      overlap_thresh: 0.0
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: .02000000000000000000
          total_steps: 100000
          warmup_learning_rate: .00866640000000000000
          warmup_steps: 8000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100         ## max_total_detections와 동일하게 해야 할 것 같음  object_detection_tutorial.ipynb의 output_dict['detection_boxes']
  unpad_groundtruth_tensors: false
}

#
# Training Setting 
# 
# input_path:  //TF Record 
#      coco_train.record-00000-of-00100 
#      coco_train.record-00001-of-00100 
#      .....
# label_map_path:
#      mscoco_label_map.pbtxt
#
train_input_reader: {
  tf_record_input_reader {
    input_path: "/data/coco2017_tfrecords/*train*"    ## tf_record format 위치 
  }
  label_map_path: "object_detection/data/mscoco_label_map.pbtxt"  ## Label 위치 
}


# 
# Eval Setting 
# 
#
#

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  num_examples: 8000   ##  eval 시 examples의 갯수 
  ##max_evals: 10               ##  eval 할 수 있는 max 값 지정 (object_detection/model_main.py --eval_count 설정가능) , 원래 config 미존재 
}

# 
# Eval Setting 
# 
# input_path:  //TF Record 
#      coco_val.record-00000-of-00010
#      coco_val.record-00001-of-00010 
#  
# label_map_path:
#      mscoco_label_map.pbtxt
#

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/data/coco2017_tfrecords/*val*"  ## tf_record format 위치 
  }
  label_map_path: "object_detection/data/mscoco_label_map.pbtxt" ## Label 정보확인가능 
  shuffle: false
  num_readers: 1
}

Pre-trained Models

https:// github.com/tensorflow/models/tree/master/research/slim

Configuring the Object Detection Training Pipeline

  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/configuring_jobs.md
  https://medium.com/coinmonks/modelling-transfer-learning-using-tensorflows-object-detection-model-on-mac-692c8609be40
  https://devtalk.nvidia.com/default/topic/1049371/tensorrt/how-to-visualize-tf-trt-graphs-with-tensorboard-/

2.4 Quick Guide 6 실행 및 분석

Training을 한 후 Validation 하는 부분으로 보정의 역할을 하는 것 같은데, 정확한 역할은 Tensorflow와 DataSet의 기본구조를 알아야 할 것 같다.

root@c7550d6b2c59:/workdir/models/research#  bash examples/SSD320_evaluate.sh /checkpoints

root@c7550d6b2c59:/workdir/models/research#  cat examples/SSD320_evaluate.sh
CHECKPINT_DIR=$1

TENSOR_OPS=0
export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
export TF_ENABLE_CUDNN_RNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}

python object_detection/model_main.py --checkpoint_dir $CHECKPINT_DIR --model_dir /results --run_once --pipeline_config_path configs/ssd320_full_1gpus.config

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md

2.5 생성된 최종 Check Point 파일 확인

NVIDIA Docker를 이용하여 최종적으로 만들어지는 파일은 Checkpoint File이며, PB파일은 본인이 직접 만들어야하고, Inference도 역시 이 기반으로 해봐야 할 것 같다.
현재 chekpoint가 아래와 같이 model.ckpt-0 와 model.ckpt-100000로 구성됨

root@c7550d6b2c59:/workdir/models/research# ls  /checkpoints/
checkpoint                                   model.ckpt-0.data-00000-of-00002  model.ckpt-100000.data-00000-of-00002  resnet_v1_50
eval                                         model.ckpt-0.data-00001-of-00002  model.ckpt-100000.data-00001-of-00002  resnet_v1_50_2016_08_28.tar.gz
events.out.tfevents.1572262719.c7550d6b2c59  model.ckpt-0.index                model.ckpt-100000.index                
graph.pbtxt                                  model.ckpt-0.meta                 model.ckpt-100000.meta

기본구성

model.ckpt-0: Training 시작과 동시에 생성 (STEP0)
model.ckpt-100000 : Pipeline STEP 수와 동일하며, 여기서 재학습도 할 경우 계속 증가
graph.pb.txt: Network 구조 파악

Checkpoint 이해필요
  https://eehoeskrap.tistory.com/343
  https://eehoeskrap.tistory.com/370
  https://eehoeskrap.tistory.com/344
  https://gusrb.tistory.com/21
  http://jaynewho.com/post/8

2.6 CheckPoint 를 PB Format으로 변환

Inference를 위해서 아래와 같이 PB파일로 변환

export_inference_graph.py 사용법

https://github.com/tensorflow/models/blob/master/research/object_detection/export_inference_graph.py

input_type

image_tensor
encoded_image_string_tensor
tf_example

TFRecord 와 TF Example
https://www.tensorflow.org/tutorials/load_data/tfrecord

root@c7550d6b2c59:/workdir/models/research# python object_detection/export_inference_graph.py \
    --input_type image_tensor \
    --pipeline_config_path configs/ssd320_full_1gpus.config \
    --trained_checkpoint_prefix  /checkpoints/model.ckpt-100000 \
    --output_directory /checkpoints/inference_graph_100000

root@c7550d6b2c59:/workdir/models/research# ls /checkpoints/inference_graph_100000/
checkpoint  frozen_inference_graph.pb  model.ckpt.data-00000-of-00001  model.ckpt.index  model.ckpt.meta  pipeline.config  saved_model

//새로 생성된 PB파일 확인 
root@c7550d6b2c59:/workdir/models/research# ls /checkpoints/inference_graph_100000/saved_model/
saved_model.pb  variables

Exporting a trained model for inference
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md

2.7 Tensorboard 로 테스트 진행

Host에서 CheckPoint 구조확인

$ cd ~/works/ssd/check  //host에서 checkpoint 구조 파악 
$ tree
.
├── checkpoint  //model.ckpt-100000 과 각 path 
├── eval        //training시 생성된 tensorboard log  validation/evaluation 시 생성된 부분은 /result/eval에 존재
│   └── events.out.tfevents.1572359812.c7550d6b2c59   // Tensorboard 용 Log file 
├── events.out.tfevents.1572262719.c7550d6b2c59       // Tensorboard 용 Log file 
├── graph.pbtxt                      // Network 관련정보 확인 
├── inference_graph_100000           // model.ckpt-100000x 기반의 pb 파일 생성 
│   ├── checkpoint
│   ├── frozen_inference_graph.pb
│   ├── model.ckpt.data-00000-of-00001
│   ├── model.ckpt.index
│   ├── model.ckpt.meta
│   ├── pipeline.config
│   └── saved_model
│       ├── saved_model.pb
│       └── variables
├── model.ckpt-0.data-00000-of-00002        //checkpoint-0
├── model.ckpt-0.data-00001-of-00002
├── model.ckpt-0.index
├── model.ckpt-0.meta
├── model.ckpt-100000.data-00000-of-00002   //checkpoint-100000
├── model.ckpt-100000.data-00001-of-00002
├── model.ckpt-100000.index
├── model.ckpt-100000.meta
├── resnet_v1_50                        //Pre-trained Model (SSD의 특징추출용으로 사용)
│   └── model.ckpt
├── resnet_v1_50_2016_08_28.tar.gz    //Pre-trained Model
├── resnet_v1_50_2016_08_28.tar.gz.1
└── resnet_v1_50_2016_08_28.tar.gz.2

Pre-trained Models

SSD 에서 feature_extractor으로 resetnet model 사용 (fine tuning / transfer learning)
resnet_v1_50_2016_08_28.tar.gz
https://github.com/tensorflow/models/tree/master/research/slim

Docker Container에서 Tensorboard 실행

상위 checkpoint에서 생성된 Tensorboard의 log가 존재하므로 분석이 가능

root@c7550d6b2c59:/workdir/models/research# tensorboard --logdir=/checkpoints

Tensorboard Browser 연결

http://localhost:6006/

Tensorboard -> Scalars

Images 관련 부분 소스

root@c7550d6b2c59:/workdir/models/research# vi ./object_detection/utils/visualization_utils.py 
........

def draw_side_by_side_evaluation_image(eval_dict,
                                       category_index,
                                       max_boxes_to_draw=20,
                                       min_score_thresh=0.2,
                                       use_normalized_coordinates=True):
  """Creates a side-by-side image with detections and groundtruth.

  Bounding boxes (and instance masks, if available) are visualized on both
  subimages.

  Args:
    eval_dict: The evaluation dictionary returned by
      eval_util.result_dict_for_batched_example() or
      eval_util.result_dict_for_single_example().
    category_index: A category index (dictionary) produced from a labelmap.
    max_boxes_to_draw: The maximum number of boxes to draw for detections.
    min_score_thresh: The minimum score threshold for showing detections.
    use_normalized_coordinates: Whether to assume boxes and kepoints are in
      normalized coordinates (as opposed to absolute coordiantes).
      Default is True.

  Returns:
    A list of [1, H, 2 * W, C] uint8 tensor. The subimage on the left
      corresponds to detections, while the subimage on the right corresponds to
      groundtruth.
  """
........


class EvalMetricOpsVisualization(object):
....
  def get_estimator_eval_metric_ops(self, eval_dict):  ## 아래의에서 호출됨 

    if self._max_examples_to_draw == 0:
      return {}
    images = self.images_from_evaluation_dict(eval_dict)

    def get_images():
      """Returns a list of images, padded to self._max_images_to_draw."""
      images = self._images
      while len(images) < self._max_examples_to_draw:
        images.append(np.array(0, dtype=np.uint8))
      self.clear()
      return images

    def image_summary_or_default_string(summary_name, image): ## 이곳에서 Image 생성
      """Returns image summaries for non-padded elements."""
      return tf.cond(
          tf.equal(tf.size(tf.shape(image)), 4),
          lambda: tf.summary.image(summary_name, image),    ## Tensorboard Image 
          lambda: tf.constant(''))

    update_op = tf.py_func(self.add_images, [[images[0]]], [])
    image_tensors = tf.py_func(
        get_images, [], [tf.uint8] * self._max_examples_to_draw)
    eval_metric_ops = {}
    for i, image in enumerate(image_tensors):
      summary_name = self._summary_name_prefix + '/' + str(i)
      value_op = image_summary_or_default_string(summary_name, image)   ## Tensorboard Image 생성 
      eval_metric_ops[summary_name] = (value_op, update_op)
    return eval_metric_ops

.....

class VisualizeSingleFrameDetections(EvalMetricOpsVisualization): ## VisualizeSingleFrameDetections는 EvalMetricOpsVisualization
  """Class responsible for single-frame object detection visualizations."""

  def __init__(self,
               category_index,
               max_examples_to_draw=5,
               max_boxes_to_draw=20,
               min_score_thresh=0.2,
               use_normalized_coordinates=True,
               summary_name_prefix='Detections_Left_Groundtruth_Right'):
    super(VisualizeSingleFrameDetections, self).__init__(
        category_index=category_index,
        max_examples_to_draw=max_examples_to_draw,
        max_boxes_to_draw=max_boxes_to_draw,
        min_score_thresh=min_score_thresh,
        use_normalized_coordinates=use_normalized_coordinates,
        summary_name_prefix=summary_name_prefix)

  def images_from_evaluation_dict(self, eval_dict):
    return draw_side_by_side_evaluation_image(
        eval_dict, self._category_index, self._max_boxes_to_draw,
        self._min_score_thresh, self._use_normalized_coordinates)

...........

root@c7550d6b2c59:/workdir/models/research# vi ./object_detection/model_lib.py
....
    if mode == tf.estimator.ModeKeys.EVAL:  ## EVAL Mode 
.........
      eval_dict = eval_util.result_dict_for_batched_example(          ## Image 정보 
          eval_images,
          features[inputs.HASH_KEY],
          detections,
          groundtruth,
          class_agnostic=class_agnostic,
          scale_to_absolute=True,
          original_image_spatial_shapes=original_image_spatial_shapes,
          true_image_shapes=true_image_shapes)

      if class_agnostic:
        category_index = label_map_util.create_class_agnostic_category_index()
      else:
        category_index = label_map_util.create_category_index_from_labelmap(
            eval_input_config.label_map_path)
      vis_metric_ops = None
      if not use_tpu and use_original_images:
        eval_metric_op_vis = vis_utils.VisualizeSingleFrameDetections(   
            category_index,
            max_examples_to_draw=eval_config.num_visualizations,
            max_boxes_to_draw=eval_config.max_num_boxes_to_visualize,
            min_score_thresh=eval_config.min_score_threshold,
            use_normalized_coordinates=False)
        vis_metric_ops = eval_metric_op_vis.get_estimator_eval_metric_ops(     ## 이곳에서 Image 저장 , 상위참조 
            eval_dict)
....

관련내용정리
tf.estimator.ModeKeys.TRAIN
tf.estimator.ModeKeys.EVAL
tf.estimator.ModeKeys.PREDICT

아래 사이트에서 설명이 너무 잘되어있음
https://bcho.tistory.com/1196

Tensorboard -> Images

https://www.tensorflow.org/tensorboard/image_summaries

Tensorboard -> Graphs

보고 쉽게 이해가도록 했으며, 시각화가 너무 잘되어있어 좋다.

Tensorboard 관련내용 과 실제사용
https://itnext.io/how-to-use-tensorboard-5d82f8654496
https://pythonkim.tistory.com/39

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md

2.8 Object Detection 준비

상위 Docker로 실행한 Terminal 에서 Jupyter 를 실행하여 Jupyter TEST 진행

TEST Image 준비

root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/test2017/000000000001.jpg object_detection/test_images/image1.jpg
root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/test2017/000000517810.jpg object_detection/test_images/image2.jpg

or

root@5208474af96a:/workdir/models/research# cat object_detection/test_images/image_info.txt  //아래의 사이트에서 image1.jpg image2.jpg download 후 복사  

Image provenance:
image1.jpg: https://commons.wikimedia.org/wiki/File:Baegle_dwa.jpg
image2.jpg: Michael Miley,
  https://www.flickr.com/photos/mike_miley/4678754542/in/photolist-88rQHL-88oBVp-88oC2B-88rS6J-88rSqm-88oBLv-88oBC4

root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/image1.jpg object_detection/test_images/image1.jpg   //아래 사이트에서 download함 
root@c7550d6b2c59:/workdir/models/research# cp /data/coco2017_tfrecords/raw-data/image2.jpg object_detection/test_images/image2.jpg

Jupyter를 이용하여 object_detection/object_detection_tutorial.ipynb 실행

root@c7550d6b2c59:/workdir/models/research# jupyter notebook   // error 발생 
root@c7550d6b2c59:/workdir/models/research# jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root

Tensorflow Jupiter Notebook
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_notebook.md

Jupyter notebook 실행 후 브라우저확인

http://localhost:8888/

Jupyter notebook 실행시 발생하는 에러

상위에서 옵션을 정의하여 해결함
https://github.com/kaczmarj/neurodocker/issues/82
http://melonicedlatte.com/web/2018/05/22/134429.html

2.9 기본 Object Detection 확인

별도의 Docker Terminal을 실행

각 파일의 위치파악하고 필요한 파일들을 각각 파악

$ docker exec -it nvidia_ssd /bin/bash  // 상위 docker 이미 Jupyter가 돌아가는 중이므로 별도의 Terminal 사용 

root@5208474af96a:/workdir/models/research# python object_detection/model_main.py --help

root@5208474af96a:/workdir/models/research# ls object_detection/object_detection_tutorial.ipynb  // Jupyter로 테스트 진행 
object_detection/object_detection_tutorial.ipynb

root@5208474af96a:/workdir/models/research# ls object_detection/ssd_mobilenet_v1_coco_2017_11_17  // 상위 Jupyter에서 사용하는 Model
frozen_inference_graph.pb

root@5208474af96a:/workdir/models/research# ls object_detection/data                              // 상위 Jupyter에서 사용하는 pbtxt
ava_label_map_v2.1.pbtxt           mscoco_complete_label_map.pbtxt     oid_object_detection_challenge_500_label_map.pbtxt
face_label_map.pbtxt               mscoco_label_map.pbtxt              pascal_label_map.pbtxt
fgvc_2854_classes_label_map.pbtxt  mscoco_minival_ids.txt              pet_label_map.pbtxt
kitti_label_map.pbtxt              oid_bbox_trainable_label_map.pbtxt

object_detection_tutorial.ipynb

별도의 Training을 안해도 소스를 보면 Model를 download하여 test directory만 설정해주면 된다.
사용모델: ssd_mobilenet_v1_coco_2017_11_17.tar.gz
https://medium.com/@yuu.ishikawa/how-to-show-signatures-of-tensorflow-saved-model-5ac56cf1960f
http://solarisailab.com/archives/2387

간단히 분석하며 상위 Model(pb파일)과 pbtxt를 이용하여 test_images 내의 image들을 테스트 진행

object_detection_tutorial.ipynb 문제사항

테스트를 진행하면 마지막에 cuDNN 에러가 발생하며, 원인은 GPU Memory 이므로, 아래의 소스를 추가하자 (Docker 의 Tensorflow 1.14.0)

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

Jupyter Consol 에러사항

E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

GPU Memory 문제사항
  https://github.com/tensorflow/tensorflow/issues/24828
  https://lsjsj92.tistory.com/363
  https://devtalk.nvidia.com/default/topic/1051380/cudnn/could-not-create-cudnn-handle-cudnn_status_internal_error/

Tensorflow 2.0 GPU Memory 부족현상
  https://inpages.tistory.com/155

failed to allocate 2.62G (2811428864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

https://stackoverflow.com/questions/39465503/cuda-error-out-of-memory-in-tensorflow

NVIDIA GPU Memory 사용량 확인

$ watch -n 0.1 nvidia-smi

3. 현재 상황

나의 랩탑에서는 상위 소스를 추가를 하면 예제를 Inference한 부분을 볼수 없지만, 다른 성능 좋은 Server에서는 잘 동작한다.
참 안타까운 일이며, 나의 랩탑(Laptop)의 한계를 많이 느낀다. (특히 GPU RAM)

관련부분 참조사이트 들이며, 너무 많이 참조하여 각 링크만 나열

Object Detection Install 및 TEST
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md

Training 자료수집
  https://www.slideshare.net/fermat39/polyp-detection-withtensorflowobjectdetectionapi
  https://www.kdnuggets.com/2019/03/object-detection-luminoth.html

Tensorflow Training 및 사용법
  https://yongyong-e.tistory.com/24
  http://solarisailab.com/archives/2422
  https://hwauni.tistory.com/entry/API-Object-Detection-API%EB%A5%BC-%EC%9D%B4%EC%9A%A9%ED%95%9C-%EC%98%A4%EB%B8%8C%EC%A0%9D%ED%8A%B8-%EC%9D%B8%EC%8B%9D%ED%95%98%EA%B8%B0-Part-1-%EC%84%A4%EC%A0%95-%ED%8E%8C
  https://cloud.google.com/solutions/creating-object-detection-application-tensorflow?hl=ko

Tensorflow Object Detection 부분 추후 분리
  https://you359.github.io/tensorflow%20models/Tensorflow-Object-Detection-API/
  https://you359.github.io/tensorflow%20models/Tensorflow-Object-Detection-API-Installation/
  https://you359.github.io/tensorflow%20models/Tensorflow-Object-Detection-API-Training/

Tensorflow Object Detection 관련사항
  https://yongyong-e.tistory.com/31?category=836820
  https://yongyong-e.tistory.com/32?category=836820
  https://yongyong-e.tistory.com/35?category=836820 **
  https://towardsdatascience.com/creating-your-own-object-detector-ad69dda69c85 **
  https://gilberttanner.com/blog/live-object-detection

Tensorflow Object Detection API Training
  https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html
  https://towardsdatascience.com/custom-object-detection-using-tensorflow-from-scratch-e61da2e10087
  https://becominghuman.ai/tensorflow-object-detection-api-tutorial-training-and-evaluating-custom-object-detector-ed2594afcf73
  https://medium.com/pylessons/tensorflow-step-by-step-custom-object-detection-tutorial-d7ae840a74e2

Tensorflow Object Detection API
  https://github.com/tensorflow/models/tree/master/research/object_detection
  https://github.com/tensorflow/models/tree/master/research/object_detection/g3doc
  https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md

Shellscript 분석 중 혼동부분 정리

항상 느끼지만, 매번 Opensource 의 Shell Script 잘 만들어지고, 자주 변경되어 많이 혼동됨

${1:-none}
https://stackoverflow.com/questions/38260927/what-does-this-line-build-target-1-none-means-in-shell-scripting

${@:2}
https://unix.stackexchange.com/questions/92978/what-does-this-2-mean-in-shell-scripting

피드 구독하기: 글 ( Atom )