YOLOv3源码阅读：data_utils.py

2019-05-22 2019-09-03

目标检测

17 minutes read (About 2581 words) 0 visits

一、YOLO简介

YOLO（You Only Look Once）是一个高效的目标检测算法，属于One-Stage大家族，针对于Two-Stage目标检测算法普遍存在的运算速度慢的缺点，YOLO创造性的提出了One-Stage。也就是将物体分类和物体定位在一个步骤中完成。YOLO直接在输出层回归bounding box的位置和bounding box所属类别，从而实现one-stage。

经过两次迭代，YOLO目前的最新版本为YOLOv3，在前两版的基础上，YOLOv3进行了一些比较细节的改动，效果有所提升。

本文正是希望可以将源码加以注释，方便自己学习，同时也愿意分享出来和大家一起学习。由于本人还是一学生，如果有错还请大家不吝指出。

本文参考的源码地址为：https://github.com/wizyoung/YOLOv3_TensorFlow

二、代码和注释

文件目录：YOUR_PATH\YOLOv3_TensorFlow-master_utils.py

这一部分代码主要是准备训练用的数据。算得上是YOLO模型中另一个十分重要的部分。

# coding: utf-8

from __future__ import division, print_function

import numpy as np
import cv2
import sys
from utils.data_aug import *
import random

PY_VERSION = sys.version_info[0]
iter_cnt = 0


def parse_line(line):
    '''
    Given a line from the training/test txt file, return parsed info.
    return:
        line_idx: int64
        pic_path: string.
        boxes: shape [N, 4], N is the ground truth count, elements in the second
            dimension are [x_min, y_min, x_max, y_max]
        labels: shape [N]. class index.
    '''
    """
    这一部分代码的主要功能是对给定的数据字符串进行处理，提取出其中的有效信息，包括如下：
    1. 图片索引
    2. 图片路径
    3. 每一个目标框的坐标，
    4. 每一个目标框的label
    """
    if 'str' not in str(type(line)):
        line = line.decode()
    # 按照空格划分数据
    s = line.strip().split(' ')

    # 第一个数据是图片索引
    line_idx = int(s[0])

    # 第二个数据是图片的路径
    pic_path = s[1]

    # 去除掉前两个数据之后，剩下的就和目标框有关系了
    s = s[2:]

    # 每一个目标框都包含五个数据，4个坐标信息和1个label信息，因此数据总数除以5之后就是目标框的总数目
    box_cnt = len(s) // 5

    # 存储数据的list
    boxes = []
    labels = []

    # 对每一个目标框
    for i in range(box_cnt):
        # 提取出label以及四个坐标数据
        label, x_min, y_min, x_max, y_max = int(s[i * 5]), float(s[i * 5 + 1]), float(s[i * 5 + 2]), float(
            s[i * 5 + 3]), float(s[i * 5 + 4])
        boxes.append([x_min, y_min, x_max, y_max])
        labels.append(label)

    # numpy处理一下
    boxes = np.asarray(boxes, np.float32)
    labels = np.asarray(labels, np.int64)

    # 返回
    return line_idx, pic_path, boxes, labels


def process_box(boxes, labels, img_size, class_num, anchors):
    '''
    Generate the y_true label, i.e. the ground truth feature_maps in 3 different scales.
    params:
        boxes: [N, 5] shape, float32 dtype. `x_min, y_min, x_max, y_mix, mixup_weight`.
        labels: [N] shape, int64 dtype.
        class_num: int64 num.
        anchors: [9, 4] shape, float32 dtype.
    '''
    """
    这一部分是数据预处理中最重要的一部分,因为这里才是生成最后的y true的地方
    """
    # anchor的编号,分别对应于每一个不同尺寸的特征图,
    # 大尺寸的特征图对应的anchor是6,7,8,中尺寸的特征图对应的是3,4,5,小尺寸的对应的是0,1,2
    anchors_mask = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]

    # convert boxes form:
    # shape: [N, 2]
    # (x_center, y_center)
    # 计算目标框的中心坐标
    box_centers = (boxes[:, 0:2] + boxes[:, 2:4]) / 2
    # (width, height)
    # 计算目标框的大小
    box_sizes = boxes[:, 2:4] - boxes[:, 0:2]

    # [13, 13, 3, 5+num_class+1] `5` means coords and labels. `1` means mix up weight.
    # 储存数据的矩阵,初始全部数据都是0,分别对应的是三个不同尺寸的特征图
    # 矩阵的shape: [height, width , 3, 4 + 1 + num_class + 1],
    # 最后一维的第一个1表示的是前景后景的标志位,最后一个1表示的是mix up的权重.
    y_true_13 = np.zeros((img_size[1] // 32, img_size[0] // 32, 3, 6 + class_num), np.float32)
    y_true_26 = np.zeros((img_size[1] // 16, img_size[0] // 16, 3, 6 + class_num), np.float32)
    y_true_52 = np.zeros((img_size[1] // 8, img_size[0] // 8, 3, 6 + class_num), np.float32)

    # mix up weight default to 1.
    # mix up的权重默认值设置为1
    y_true_13[..., -1] = 1.
    y_true_26[..., -1] = 1.
    y_true_52[..., -1] = 1.

    # 将他们放在一起,可以统一操作
    y_true = [y_true_13, y_true_26, y_true_52]

    # [N, 1, 2]
    # 扩展一维,shape: [N, 1, 2]
    # 需要注意的是,这里的N表示的是目标框的数目,而不是样本的数据.
    box_sizes = np.expand_dims(box_sizes, 1)

    # broadcast tricks
    # [N, 1, 2] & [9, 2] ==> [N, 9, 2]
    # 使用numpy的广播机制,很容易计算出目标框和anchor之间的交集部分.
    mins = np.maximum(- box_sizes / 2, - anchors / 2)
    maxs = np.minimum(box_sizes / 2, anchors / 2)

    # [N, 9, 2]
    whs = maxs - mins

    # [N, 9]
    # 计算每一个目标框和每一个anchor的iou值
    iou = (whs[:, :, 0] * whs[:, :, 1]) / (
                box_sizes[:, :, 0] * box_sizes[:, :, 1] + anchors[:, 0] * anchors[:, 1] - whs[:, :, 0] * whs[:, :,
                                                                                                         1] + 1e-10)
    # [N]
    # 计算每一个目标框和某一个anchor之间的最佳iou值,并返回最佳iou值对应的下标索引
    best_match_idx = np.argmax(iou, axis=1)

    # 这个字典是为了后续的计算方便才定义的
    ratio_dict = {1.: 8., 2.: 16., 3.: 32.}

    for i, idx in enumerate(best_match_idx):

        # idx: 0,1,2 ==> 2; 3,4,5 ==> 1; 6,7,8 ==> 0
        # 根据上面的下标索引,下面的代码可以计算出该目标框应该对应与哪一张特征图.
        # 因为不同的anchor对应于不同尺寸的特征图,
        # 所以如果一个目标框和其中一个anchor具有最大的iou,那么我们应该将该目标框和这个anchor对应的特征图联系起来.
        feature_map_group = 2 - idx // 3

        # scale ratio: 0,1,2 ==> 8; 3,4,5 ==> 16; 6,7,8 ==> 32
        # 这里就是利用了前面定义的字典,方便的获取缩放倍数
        ratio = ratio_dict[np.ceil((idx + 1) / 3.)]

        # 计算目标框的中心.这里是指缩放之后的中心
        x = int(np.floor(box_centers[i, 0] / ratio))
        y = int(np.floor(box_centers[i, 1] / ratio))

        # 根据特征图的编号,获取anchor的下标索引
        k = anchors_mask[feature_map_group].index(idx)

        # 类别标记
        c = labels[i]
        # print(feature_map_group, '|', y,x,k,c)

        # 分别将数据添加到合适的位置,其中需要注意的是k的使用,它表明的是目标框对应的是哪个anchor
        # 目标框的中心
        y_true[feature_map_group][y, x, k, :2] = box_centers[i]
        # 目标框的尺寸
        y_true[feature_map_group][y, x, k, 2:4] = box_sizes[i]
        # 前景标记
        y_true[feature_map_group][y, x, k, 4] = 1.
        # 类别标记
        y_true[feature_map_group][y, x, k, 5 + c] = 1.
        # mix up权重
        y_true[feature_map_group][y, x, k, -1] = boxes[i, -1]

    # 当我们处理好所有的目标框之后就返回
    return y_true_13, y_true_26, y_true_52


def parse_data(line, class_num, img_size, anchors, mode):
    '''
    param:
        line: a line from the training/test txt file
        class_num: totol class nums.
        img_size: the size of image to be resized to. [width, height] format.
        anchors: anchors.
        mode: 'train' or 'val'. When set to 'train', data_augmentation will be applied.
    '''

    # 如果line不是一个list，说明这里的line是一个str
    if not isinstance(line, list):
        # 直接处理即可，返回图片索引，图片路径，以及gt目标框的坐标和对应的labels
        img_idx, pic_path, boxes, labels = parse_line(line)

        # 根据图片路径读取图片
        img = cv2.imread(pic_path)

        # expand the 2nd dimension, mix up weight default to 1.
        # 扩展矩阵的维度，这里主要是在每一行的末尾添加一个表示mix up权重的信息，此处默认设置为1
        boxes = np.concatenate((boxes, np.full(shape=(boxes.shape[0], 1), fill_value=1., dtype=np.float32)), axis=-1)

    else:
        # the mix up case
        # 如果line表示的是一个list，说明需要使用mix up策略

        # 处理第一张图片
        _, pic_path1, boxes1, labels1 = parse_line(line[0])
        # 读取第一张图片
        img1 = cv2.imread(pic_path1)
        # 处理第二张图片
        img_idx, pic_path2, boxes2, labels2 = parse_line(line[1])
        # 读取第二张图片
        img2 = cv2.imread(pic_path2)

        # 将他们混合在一起
        img, boxes = mix_up(img1, img2, boxes1, boxes2)

        labels = np.concatenate((labels1, labels2))

    # 如果是训练阶段，则会做一些数据增强的操作，如随机颜色抖动，随机裁剪，随机翻转等操作
    if mode == 'train':
        # random color jittering
        # NOTE: applying color distort may lead to bad performance sometimes
        # img = random_color_distort(img)

        # random expansion with prob 0.5
        if np.random.uniform(0, 1) > 0.5:
            img, boxes = random_expand(img, boxes, 2)

        # random cropping
        h, w, _ = img.shape
        boxes, crop = random_crop_with_constraints(boxes, (w, h))
        x0, y0, w, h = crop
        img = img[y0: y0+h, x0: x0+w]

        # resize with random interpolation
        h, w, _ = img.shape
        interp = np.random.randint(0, 5)
        img, boxes = resize_with_bbox(img, boxes, img_size[0], img_size[1], interp)

        # random horizontal flip
        h, w, _ = img.shape
        img, boxes = random_flip(img, boxes, px=0.5)
    else:
        img, boxes = resize_with_bbox(img, boxes, img_size[0], img_size[1], interp=1)

    # 将颜色的通道顺序进行更改
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32)

    # 规范化数据至0~1
    # the input of yolo_v3 should be in range 0~1
    img = img / 255.

    # 将给出的gt 目标框进行处理，返回对应的gt矩阵，用以后面的损失计算。
    y_true_13, y_true_26, y_true_52 = process_box(boxes, labels, img_size, class_num, anchors)

    # 返回
    return img_idx, img, y_true_13, y_true_26, y_true_52


def get_batch_data(batch_line, class_num, img_size, anchors, mode, multi_scale=False, mix_up=False, interval=10):
    '''
    generate a batch of imgs and labels
    param:
        batch_line: a batch of lines from train/val.txt files
        class_num: num of total classes.
        img_size: the image size to be resized to. format: [width, height].
        anchors: anchors. shape: [9, 2].
        mode: 'train' or 'val'. if set to 'train', data augmentation will be applied.
        multi_scale: whether to use multi_scale training, img_size varies from [320, 320] to [640, 640] by default. Note that it will take effect only when mode is set to 'train'.
        interval: change the scale of image every interval batches. Note that it's indeterministic because of the multi threading.
    '''


    # 全局的计数器
    global iter_cnt

    # multi_scale training
    # 是否使用多种尺寸进行训练， 默认是False
    if multi_scale and mode == 'train':
        # 设置随机数种子
        random.seed(iter_cnt // interval)
        # 设定选择范围，并随机采样
        random_img_size = [[x * 32, x * 32] for x in range(10, 20)]
        img_size = random.sample(random_img_size, 1)[0]

    # 计数器加1
    iter_cnt += 1

    # 用以保存数据的list
    img_idx_batch, img_batch, y_true_13_batch, y_true_26_batch, y_true_52_batch = [], [], [], [], []

    # mix up strategy
    # 是否使用mix up策略，默认是False
    if mix_up and mode == 'train':
        mix_lines = []
        batch_line = batch_line.tolist()
        for idx, line in enumerate(batch_line):
            if np.random.uniform(0, 1) < 0.5:
                mix_lines.append([line, random.sample(batch_line[:idx] + batch_line[idx+1:], 1)[0]])
            else:
                mix_lines.append(line)
        batch_line = mix_lines

    # 对一个batch中的数据，这里的line一般指的是一行文本数据
    for line in batch_line:
        # 处里数据中的信息，主要是数据索引（一般用不上）图片的像素矩阵，不同特征图所对应的gt信息。
        img_idx, img, y_true_13, y_true_26, y_true_52 = parse_data(line, class_num, img_size, anchors, mode)

        # 附加到这些list的末尾
        img_idx_batch.append(img_idx)
        img_batch.append(img)
        y_true_13_batch.append(y_true_13)
        y_true_26_batch.append(y_true_26)
        y_true_52_batch.append(y_true_52)

    # 使用numpy处理一下
    img_idx_batch, img_batch, y_true_13_batch, y_true_26_batch, y_true_52_batch = np.asarray(img_idx_batch, np.int64), np.asarray(img_batch), np.asarray(y_true_13_batch), np.asarray(y_true_26_batch), np.asarray(y_true_52_batch)

    # 返回
    return img_idx_batch, img_batch, y_true_13_batch, y_true_26_batch, y_true_52_batch