Goal: predict a dence flow/corrspondence between images
relative offset
matchablity, 1 if correspondence exists, 0 if not
Cycle-consistency
training real images , 3D CAD model with 2 sythetic views
learn to predict flows , ,
:as ground-truth: provided by the rendering machine
Learning Dense Correspondence
minimize objective function
Transitive flow composition ,
Truncated Euclidean loss :
In experiments, pixels;
Why truncated loss: to be more robust to spurious outliers for
training, especially during the early stage when the network
output tends to be highly noisy.
Learning Dense Marchability
objective function: per-pixel cross-entropy loss
: ground-truth matchability map
Matchability map composition:
fix and , only train the CNN to infer (Due to the multiplicative nature in matchability composition)
Network
feature encoder of 8 convolution layers that
extracts relevant features from both input images with
shared network weights;
flow decoder of 9 fractionallystrided/up-sampling
convolution (uconv) layers that assembles
features from both input images, and outputs a dense
flow field;
matchability decoder of 9 uconv layers that
assembles features from both input images, and outputs a
probability map indicating whether each pixel in the source
image has a correspondence in the target.
conv+relu(except last uconv for decoders)
kernel 3*3
no pooling; stride = 2 when in/decrease the spatial dimension
output of matchability decoder + sigmoid for normalization
training: same network for
Experiments
Training set
real images: PASCAL3D+ dataset
cropped from bounding box;
rescaled to 128*128
3D CAD models: ShapeNet database
render 3D models from the same viewpoint
choose K=20 nearest models using HOG Euclidean distance
valid training quartet for each category: 80,000
Network training
Initialization:
feature encoder + flow decoder pathway: mimic SIFT flow by randomly sampling image pairs from the training quartets and training the network to minimize the Euclidean loss between the network prediction and the SIFT flow output on the sampled pair
other initialization strategies (e.g. predicting ground-truth flows between synthetic images), and found that initializing with SIFT flow output works the best.
Parametes:
ADAM solver = 0.9, = 0.999, lr = 0.001,
step size of 50k, step multiplier of 0.5 for 200k iterations.
batch = 40 during initialization and 10 quartets during fine-tuning.
Feature
embedding layout appears to be viewpoint-sensitive
(might implicitly learn that viewpoint is an important cue for correspondence/matchability tasks through our consistency training.)
Keypoint transfer task
Evaluate the quality ofcorrespondence output
For each category, we exhaustively sample pairs from the val split (not seen during training), and determine if a keypoint in the source image is transferred correctly (by measuring the Euclidean distance between our correspondence prediction and the annotated ground-truth (if exists) in the target image)
. A correct transfer: prediction falls within pixels of the ground-truth with H and W being the image height and width, respectively (both are 128 pixels in our case)
Metric: e percentage of correct keypoint transfer (PCK)
Matchability prediction
PASCAL-Part dataset(provides humanannotated part segment labeling)