@zhenni94 2015-08-26T04:55:26.000000Z 字数 12749 阅读 1486

Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations

project homepage: http://www.stat.ucla.edu/~xianjie.chen/projects/pose_estimation/pose_estimation.html

Paper

Graphical Model

$\mathcal{G} = (\mathcal{V}, \mathcal{E})$
$\mathcal{V}$ : the point of the joints(parts).
$\mathcal{E}$ : spatial relation between the joints(parts).
$K = |\mathcal{V}|$ , simply regard as $K$ -node tree.
$l_i = (x, y)$ - locations : pixel location of each part $i$ .
$t_{ij} \in \{1, 2, ..., T_{ij}\}$ - types : a mixture of different spatial relationships.
$t = \{t_{ij}, t_{ji} | (i, j) \in \mathcal{E}\}$ : set of spatial relations

Score Function

Full score function:

$F (l, t | I) = \sum i \in V U (l i, I) + \sum (i, j) \in E R (l i, l j, t i j, t j i | I) + w 0$ $F(l, t| I) = \sum_{i \in \mathcal{V}}U(l_i, I) + \sum_{(i, j) \in \mathcal{E}} R(l_i, l_j, t_{ij}, t_{ji} | I) + w_0$
Unary Term
$U (l i, I) = w i ϕ (i | I (l i); θ)$ $U(l_i, I) = w_i\phi(i|I(l_i); \theta)$
IDPR term
R(li,lj,tij,tji|I)=<wtijij,ψ(lj−li−rtijij)>+wijφ(tij|I(li);θ)

+<wtjiji,ψ(li−lj−rtjiji)>+wjiφ(tji|I(li);θ)
- $r_{ij}^{t_{ij}}$ : mean relative position of type $t_{ij}$

DCNN

labels: $\cup_{i=0}^{k}\{i\} \times (\times_{j \in \mathbb{N}(i)} \{1, 2, ..., T_{ij}\})$
$\phi(i|I(l_i); \theta) = \log(p(c=i|I(l_i); \theta))$
$\psi(l_i-l_j-r_{ji}^{t_{ji}}) = \log(p(m_{ij}=t_{ij}|c=i, I(l_i); \theta))$

Inference

Dynamic Programming

S i (l i | I) = U (l i | I) + \sum k \in K (i) max l k, t i k, t k i (R (l i, l k, t i k, t k i | I) + S k (l k | I))

$S_i(l_i|I) = U(l_i|I)+\sum_{k\in \mathbb{K}(i)}\max_{l_k, t_{ik}, t_{ki}}(R(l_i, l_k, t_{ik}, t_{ki} | I) + S_k(l_k|I))$

Generalized Distance Transformation : $O(T^2L^2K) \rightarrow O(T^2LK)$

Learning

$r_{ij}^{t_{ij}}$ : K-means
$\theta$ : DCNN
$w$ -- $w_i, w_{ij}^{t_{ij}}, w_{ij}, w_0$ : S-SVM ( $y_n \in \{-1, 1\}$ )
$min w = 1 2 < w, w > + C \sum n max (0, 1 - y n < w, Φ (I n, l n, t n) >)$ $\min_w = \frac{1}{2}<w, w> + C\sum_n \max (0, 1-y_n<w, \Phi(I^n, l^n, t^n)>)$

Implementation

demo.m

conf is a structure of the given global configuration. conf.pa is the index of the parent of each joint. p_no is the number of the parts(joints).
The main part of this function is shown in the following.

// read data 
[pos_train, pos_val, pos_test, neg_train, neg_val, tsize] = LSP_data();
// train dcnn
train_dcnn(pos_train, pos_val, neg_train, tsize, caffe_solver_file);
// train graphical model
model = train_model(note, pos_val, neg_val, tsize);
// testing
boxes = test_model([note,'_LSP'], model, pos_test);
/* ... */
// evaluation
show_eval(pos_test, ests, conf, eval_method);

Read data : `LSP_data.m`

Some variables and constants:

trainval_frs_pos = 1:1000;      // training frames for positive
test_frs_pos = 1001:2000;       // testing  frames for positive
trainval_frs_neg = 615:1832;    // training frames for negative (of size 1218)
frs_pos = cat(2, trainval_frs_pos, test_frs_pos); // frames for negative
all_pos                         // num(pos)*1 struct array for positive
                                // struct: im, joints, r_degree, isflip
neg                             // num(neg)*1 struct array for negative
pos_trainval = all_pos(1 : numel(trainval_frs_pos));  // training and validation image struct for positive
pos_test = all_pos(numel(trainval_frs_pos)+1 : end);  // testing image struct for positive

Data preparing:

lsp_pc2oc : function joints = lsp_pc2oc(joints) : convert to person-centric
pos_trainval(ii).joints = Trans * pos_trainval(ii).joints; Create ground truth joints for model training. Augment the original 14 joint positions with midpoints of joints, defining a total of 26 joints
add_flip : flip trainval images (horizontally) (#pos_trainval *= 2)
init_scale : init dataset specific parameters
add_rotate : rotate trainval images (every $9^{\circ}$ ) (#pos_trainval *= 40)
val_id = randperm(numel(pos_trainval), 2000); : split training and validation data for positive (random choose 2000 image from the pos_trainval to be the validation set, #training = #pos_trianval - 2000 = 78000)
val_id = randperm(numel(neg), 500); split training and validation data for negtive (random choose 500 image from the neg to be the validation set, #neg_val = #neg - #neg_train = 1218 - 500 = 728)
add_flip : flip the negative data (#neg_val *= 2; #neg_train *= 2)

Train DCNN : `train_dcnn.m`

Some variable and constants:

mean_pixel = [128, 128, 128];           // the mean value of each pixel
K = conf.K;                             // K = T_{ij}

Prepare patches : `prepare_patches.m`

Prepare the patches and derive their labels to train dcnn

K-means : get $r_{ij}$ , $t_{ij}$ and the labels $\cup_{c = 0}^{K}\{c\}\times (\times_{j \in \mathbb{N}(i)} \{1, 2, ..., T_{ij}\})$

// generate the labels
clusters = learn_clusters(pos_train, pos_val, tsize);
label_train = derive_labels('train', clusters, pos_train, tsize);
label_val = derive_labels('val', clusters, pos_val, tsize);
// labels for negative (dummy)
dummy_label = struct('mix_id', cell(numel(neg_train), 1), ...
    'global_id', cell(numel(neg_train), 1));
// all the training data
train_imdata = cat(1, num2cell(pos_train), num2cell(neg_train));
train_labels = cat(1, num2cell(label_train), num2cell(dummy_label));
// random permute the data and store it in the format of LMDB
perm_idx = randperm(numel(train_imdata));
train_imdata = train_imdata(perm_idx);
train_labels = train_labels(perm_idx);
if ~exist([cachedir, 'LMDB_train'], 'dir')
    store_patch(train_imdata, train_labels, psize, [cachedir, 'LMDB_train']);
end
// validation data for positive
val_imdata = num2cell(pos_val);
val_labels = num2cell(label_val);
if ~exist([cachedir, 'LMDB_val'], 'dir')
    store_patch(val_imdata, val_labels, psize, [cachedir, 'LMDB_val']);
end

Learn clusters : `learn_clusters`(call `cluster_rp` cluster relative position)

nbh_IDs = get_IDs(pa, K);: get the neighbor of each part(joint)
clusters{ii}: cell : the mean relative postion of ii-th part
k-means
- X(ii,:) = norm_rp(imdata(ii), cur, nbh, tsize); relative position for ii-th data item
- mean_X = mean(X(valid_idx,:),1);
  normX = bsxfun(@minus, X(valid_idx,:), mean_X); centralize (normalize) the relative position
- Run R trials of the k-means algorithm and choose the one has the smallest distance
  [gInd{trial}, cen{trial}, sumdist(trial)] = k_means(normX, K);
  calculate the imgid(all the img belongs to the cluster k) of clusters{cur}{n}(k), where clusters{cur}{n}(k) is the k-th cluster of n-th neighbor of the cur-th joint.

Derive labels : `derive_labels`(call `assign_label`)

labels: a array of struct : mix_id, global_id, near, invalid
K : $T_{ij} = K$
get_id:
- nbh_IDs{ii}: get the neighors of ii-th part, $\mathbb{N}(ii)$
- target_IDs{ii} : indexes of ii-th neighbors in $\cup_{i}\mathbb{N}(i)$
- global_IDs{ii} : the labels : $\times_{j \in \mathbb{N}(ii)} T_{ii,j}$ dim , the indexes of $\{ii\} \times (\times_{j \in \mathbb{N}(ii)} \{1, 2, ..., T_{ii,j}\})$ in $\cup_{c=0}^{k}\{c\} \times (\times_{j \in \mathbb{N}(i)} \{1, 2, ..., T_{ij}\})$
n-th neighor of p-th part (joint) in ii-th data image. nbh_idx = nbh_IDs{p}(n);
- find the nearest cluster and calculate the ids
- labels(ii).mix_id{p}(n) the index of nearest cluster, $t_{ii, p}^{(n)}$
- labels(ii).near{p}{n}: the index of near clusters (dist < 3*dist(nearest))
- labels(ii).invalid(p) : for check and debug
- labels(ii).global_id : translate the mix_id{p} to global_id(p)

Train dcnn

System call caffe to train dcnn

system([caffe_root, '/build/tools/caffe train ', sprintf('-gpu %d -solver %s', ...
    conf.device_id, caffe_solver_file)]);

network

Get fully-convolutional net : `net_surgery.m`

Change the fully-connected layers to convolutional layers.

caffe matlab interface code: https://github.com/xianjiec/caffe/blob/dev/matlab/caffe/matcaffe.cpp

Get weights in the original network with the fully-connected layers
- Initialize caffe and load modelcaffe('reset'); caffe('init', deploy_file, model_file);
- fc_weights = caffe('get_weights');: all the weights in the networks
- fc_layer_ids(ii): the index of the ii-th fully connected layer in the original network
Get the weights in the fully-convolutional network
- Initialize caffe and load model: caffe('reset'); caffe('init', deploy_conv_file, model_file);
- conv_weights = caffe('get_weights'); all the weights in the fcn
- conv_layer_ids(ii): the index of the ii-th fully connected layer in fcn
Tranplant the parameters
- weights{1} : weights
- weights{2} : bias

trans_params = struct('weights', cell(numel(conv_names), 1), ...
  'layer_names', cell(numel(conv_names), 1));
for ii = 1:numel(conv_names)
  trans_params(ii).layer_names = conv_names{ii};
  weights = cell(2, 1);
  weights{1} = reshape(fc_weights(fc_layer_ids(ii)).weights{1}, size(conv_weights(conv_layer_ids(ii)).weights{1}));
  weights{2} = reshape(fc_weights(fc_layer_ids(ii)).weights{2}, size(conv_weights(conv_layer_ids(ii)).weights{2}));
  trans_params(ii).weights = weights;
end

Set parameters and save caffe('set_weights', trans_params); caffe('save', fully_conv_model_file);

Train graphical model : `train_model`

label_val : the label of validation data for positive (struct: mix_id, global_id, near, invalid)
build_model: prepare the weights parameter in the formula of full score for SVM
train : model = train(cls, model, pos_val, neg_val, 1); Use validation set to train SVM

build_model

weight parameters: $w_i, w_{ij}^{t_{ij}}, w_{ij}, w_0$
parameters and initialization :
- bias: $w_0$ =0, apps: $w_{i}$ =0.01;
- $w_{ij}^{t_{ij}}, w_{ij}$ -> For generalized distance transformation. pdef: 0.01, gaus: $\mathcal{N}(r_{ij}^{t_{ij}}, [1,1])$ ;

Structures of parts of model:

model.len = 0;                                     // number of parameters in the model
// 'i' is the index of the parameters in the whole model
model.bias    = struct('w',{},'i',{});             // bias
model.apps = struct('w',{},'i',{});                // appearance of each part
model.pdefs = struct('w',{},'i',{});               // prior of deformation (regressed)
model.gaus    = struct('w',{},'i',{},'mean',{}, 'var', {});   // deformation gaussian
// '***id' is the index of '***' in 'model.***'
model.components{1} = struct('parent',{}, 'pid', {}, 'nbh_IDs', {}, ...
  'biasid',{},'appid',{},'app_global_ids',{},'pdefid',{},'gauid',{},'idpr_global_ids',{});

Train graphical model: `train`

Negative mining on negative images : mining_onneg
- Call detect: [box,model] = detect(neg(i), model, -1, [], 0, i, -1);
Get positive examples using latent detections : poslatent
- skip small examples
- Call detect: box = detect(pos(ii), model, 0, bbox, overlap, ii, 1);
  bbox = [(joints.x-scale_x, joints.y-scale_y), (joints.x+scale_x, joints.y+scale_y)]
Computes expected number of nonzeros in sparse feature vector : sparselen
- non-sparse: bias, app, pdef, gaus(use only one kind of deformation in each mixture, while origin is 13);
- #needed_to_encode = 1 + numberofblocks*2 + #nonzeronumbers
Solve Quadratic Problem
Use the solver povided by Articulated Human Pose Estimation with Flexible Mixtures of Parts

// qp.x(:,i) = examples
// qp.i(:,i) = id
// qp.b(:,i) = bias of linear constraint
// qp.d(i)   = ||qp.x(:,i)||^2
// qp.a(i)   = ith dual variable
qp_prune();
qp_opt();
//...

Detect objects in image : `detect`

function [boxes,model,ex] = detect(iminfo, model, thresh, bbox, overlap, id, label)

The description of this function given by the author:

Detect objects in image using a model and a score threshold.
Higher threshold leads to fewer detections.
The function returns a matrix with one row per detected object. The last column of each row gives the score of the detection. The column before last specifies the component used for the detection. Each set of the first 4 columns specify the bounding box for a part.

If bbox is not empty, we pick best detection with significant overlap.
If label is included, we write feature vectors to a global QP structure.
This function updates the model (by running the QP solver) if upper and lower bound differs.

Read the original image (flip and rotate if needed)im = imreadx(iminfo);
If has box information(=positive), crop it to speed up latent search [im, bbox] = cropscale_pos(im, bbox, model.cnn.psize);
Compute the feature pyramid and prepare the score map : imCNNdet: [pyra, unary_map, idpr_map] = imCNNdet(im,model,useGpu);
- Compute the feature pyramid : pyra = impyra_fun(im, model, upS);
  i-th level, p-th part, n-th neighbor, joint_prob from the DCNN
- Compute unary term map: unary_map{i}{p} = sum(joint_prob(:,:,app_global_ids), 3);
- Compute the idpr term map: idpr_map{i}{p}{n}(:,:,m) = sum(joint_prob(:,:,idpr_global_ids{n}{m}),3);
Randomize order to increase effectiveness of model updating: levels = levels(randperm(length(levels)));
Dynamic Programming with generalized distance transformation

// Walk from leaves to root of tree, passing message to parent
for p = p_no:-1:2
    child = parts(p);
    par = parts(p).parent;
    parent = parts(par);
    cbid = find(child.nbh_IDs == parent.pid);
    pbid = find(parent.nbh_IDs == child.pid);
    [msg,parts(p).Ix,parts(p).Iy,parts(p).Im{cbid},parts(par).Im{pbid}] ...
        = passmsg(child, parent, cbid, pbid);
    parts(par).score = parts(par).score + msg;
end
// Add bias to root score
parts(1).score = parts(1).score + parts(1).b;
rscore = parts(1).score;

Backtrack to get optimal configurations of locations and types.
function [box,ex] = backtrack(x,y,parts,pyra,ex,write)
Optimize qp with coordinate descent, and update model, if upper and lower bound differs. model = optimize(model);

Test Model: `test_modal`

function boxes = test_model(note,model,test)
Returns candidate bounding boxes after non-maximum suppression

detect_fast : box = detect_fast(test(i), model, model.thresh, par); similar to detect
Non-maximum suppression. nms_pose : boxes{i} = nms_pose(box, overlap); (overlap=0.3)
Greedily select high-scoring detections and skip detections that are significantly covered (cover area of rate more than overlap) by a previously selected detection.
Only keep the highest scoring estimation for evaluation. boxes{i} = boxes{i}(1,:);

Evaluation

// estimation joints and scores generated from the detected boxes
ests = conf.box2det(boxes, p_no);
// generate part stick from joints locations
for ii = 1:numel(ests)
    ests(ii).sticks = conf.joint2stick(ests(ii).joints);
    pos_test(ii).sticks = conf.joint2stick(pos_test(ii).joints);
end
// Evaluation and Plots the results
eval_method = {'strict_pcp', 'pdj'};
show_eval(pos_test, ests, conf, eval_method);

Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations

Paper

Graphical Model

Score Function

DCNN

Inference

Learning

Implementation

demo.m

Read data : LSP_data.m

Train DCNN : train_dcnn.m

Prepare patches : prepare_patches.m

K-means : get rijr_{ij}, tijt_{ij} and the labels ∪Kc=0{c}×(×j∈N(i){1,2,...,Tij})\cup_{c = 0}^{K}\{c\}\times (\times_{j \in \mathbb{N}(i)} \{1, 2, ..., T_{ij}\})

Learn clusters : learn_clusters(call cluster_rp cluster relative position)

Derive labels : derive_labels(call assign_label)