CVPR 2020: Image-to-Image Translation(1)

 

SEAN: Image Synthesis with Semantic Region-Adaptive Normalization

  • Author: Peihao Zhu, Rameen Abdal, Yipeng Qin, Peter Wonka
  • Arxiv: 1911.12861
  • GitHub

Problem

synthetic image generation

Assumption in prior work

Starting from SPADE, 1) use only one style code for whole image, 2) insert style code only in the beginning of network.

None of previous networks use style information to generate spatially varying normalization parameters.

Insight

control the style of each semantic region individually, e.g., we can specify one style reference image per region

use style input images to create spatially varying normalization parameters per semantic region. An important aspect of this work is that the spatially varying normalization parameters are dependent on the segmentation mask as well as the style input images.

Technical overview

SEAN normalization

The input are style matrix ST and segmentation mask M. In the upper part, the style codes in ST undergo a per style convolution and are then broadcast to their corresponding regions according to M to yield a style map. The style map is processed by conv layers to produce per pixel normalization values γs and βs . The lower part (light blue layers) creates per pixel normalization values using only the region information similar to SPADE.

The Generator

(A) On the left, the style encoder takes an input image and outputs a style matrix ST. The generator on the right consists of interleaved SEAN ResBlocks and Upsampling layers. (B) A detailed view of a SEAN ResBlock used in (A).

Proof

  • Datasets: ADE20k, cityscapes, CelebA-HQ, Facades
  • Baselines: pix2pixHD, SPADE
  • Metrics: mIoU, pixel accuracy, FID; SSIM, RMSE, PSNR(for reconstruction)

Impact

application: style interpolation an per-region extension to SPADE

Attentive Normalization for Conditional Image Generation

  • Author: Yi Wang, Yubei Chen, Xiangyu Zhang, Jian-Tao Sun, Jiaya Jia less
  • Arxiv: 2004.03828

Conditional image generation of a GAN framework using our proposed attentive normalization module. (a) Class conditional image generation. (b) Image inpainting.

Problem

Conditional Image Synthesis

Assumption in prior work

Traditional convolution-based generative adversarial networks synthesize images based on hierarchical local operations, where long-range dependency relation is implicitly modeled with a Markov chain. It is still not sufficient for categories with complicated structures.

Self-Attention GAN: the self-attention module requires computing the correlation between every two points in the feature map. Therefore, the computational cost grows rapidly as the feature map becomes large.

Instance Normalization (IN): the previous solution of (IN) normalizes the mean and variance of a feature map along its spatial dimensions. This strategy ignores the fact that different locations may correspond to semantics with varying mean and variance.

Insight

Attentive Normalization (AN) predicts a semantic layout from the input feature map and then conduct regional instance normalization on the feature map based on this layout.

Technical overview

AN is formed by the proposed semantic layout learning (SLL) module, and a regional normalization, as shown in Figure 2. It has a semantics learning branch and a self-sampling branch. The semantic learning branch employs a certain number of convolutional filters to capture regions with different semantics (which are activated by a specific filter), with the assumption that each filter in this branch corresponds to some semantic entities.

Proof

  • Datasets: ImageNet; Paris Streetview
  • Baselines: SN-GAN, SA-GAN, BigGAN (Conditional Synthesis); CA (inpainting)
  • Metrics: FID, IS (Conditional Synthesis); PSRN, SSIM (inpainting)

Impact

semantics-aware attention + regional normalization

High-Resolution Daytime Translation Without Domain Labels

  • Author: Ivan Anokhin, Pavel Solovev, Denis Korzhenkov, Alexey Kharlamov, Taras Khakhulin, Alexey Silvestrov, Sergey I. Nikolenko, Victor S. Lempitsky, Gleb Sterkin
  • Arxiv: 2003.08791
  • GitHub
  • Project Site

Problem

an image-to-image translation problem suitable for the setting when domain labels are unavailable.

Assumption in prior work

Image-to-image translation approaches require domain labels at training as well as at inference time. The recent FUNIT model relaxes this constraint partially. Thus, to extract the style at inference time, it uses several images from the target domain as guidance for translation (known as the few-shot setting). The domain annotations are however still needed during training.

Insight

The only external (weak) supervision used by our approach are coarse segmentation maps estimated using an off-the-shelf semantic segmentation network.

Technical overview

HiDT learning data flow. We show half of the (symmetric) architecture; s′ = Es(x′) is the style extracted from the other image x′, and ŝ′ is obtained similarly to ŝ with x and x′ swapped. Light blue nodes denote data elements; light green, loss functions; others, functions (subnetworks). Functions with identical labels have shared weights. Adversarial losses are omitted for clarity.

Enhancement scheme: the input is split into subimages (color-coded) that are translated individually by HiDT at medium resolution. The outputs are then merged using the merging network Genh. For illustration purposes, we show upsampling by a factor of two, but in the experiments we use a factor of four. We also apply bilinear downsampling (with shifts – see text for detail) rather than strided subsampling when decomposing the input into medium resolution images

Proof

  • Datasets: 20,000 landscape photos labeled by a pre-trained classifier
  • Baselines: FUNIT, DRIT++
  • Metrics: domain-invariant perceptual distance (DIPD), adapted IS,

Impact

High-resolution translation

Swapping styles between two images. Original images are shown on the main diagonal. The examples show that HiDT is capable to swap the styles between two real images while preserving details.

Reusing Discriminators for Encoding: Towards Unsupervised Image-to-Image Translation

Problem

Unsupervised image-to-image translation

Assumption in prior work

Current translation frameworks will abandon the discriminator once the training process is completed. This paper contends a novel role of the discriminator by reusing it for encoding the images of the target domain.

Insight

We reuse early layers of certain number in the discriminator as the encoder of the target domain

We develop a decoupled training strategy by which the encoder is only trained when maximizing the adversary loss while keeping frozen otherwise.

Technical overview

Proof

  • Dataset: horse↔zebra, summer↔winter, vangogh↔photo and cat↔dog
  • Baselines: CycleGAN, UNIT, MUNIT, DRIT, U-GAT-IT
  • Metrics: FID, KID

Impact

sounds like a plug-in strategy to all I2I frameworks.

Semi-supervised Learning for Few-shot Image-to-Image Translation

  • Author: Yaxing Wang, Salman Khan, Abel Gonzalez-Garcia, Joost van de Weijer, Fahad Shahbaz Khan
  • Arxiv: 2003.13853
  • GitHub

Problem

Few-shot(both in source and target) unpaired image-to-image translation

(c) Few-shot semi-supervised (Ours): same as few-shot, but the source domain has only a limited amount of labeled data at train time.

Assumption in prior work

First, the target domain is required to contain the same categories or attributes as the source domain at test time, therefore failing to scale to unseen categories (see Fig. 1(a)). Second, they highly rely upon having access to vast quantities of labeled data (Fig. 1(a, b)) at train time. Such labels provide useful information during the training process and play a key role in some settings (e.g. scalable I2I translation).

Insight

We propose using semi-supervised learning to reduce the requirement of labeled source images and effectively use unlabeled data. More concretely, we assign pseudo-labels to the unlabeled images based on an initial small set of labeled images. These pseudo-labels provide soft supervision to train an image translation model from source images to unseen target domains. Since this mechanism can potentially introduce noisy labels, we employ a pseudo-labeling technique that is highly robust to noisy labels. In order to further leverage the unlabeled images from the dataset (or even external images), we use a cycle consistency constraint [48].

Technical overview

Proof

  • Metrics: FID, IS
  • Baselines: CycleGAN, StarGAN, MUIT, FUNIT