Comparing different similarity functions for reconstruction of image on CycleGAN. (https://tandon-a.github.io/CycleGAN_ssim/) Training cycleGAN with different loss functions to improve visual quality of produced images
Generative Deep Learning Models such as the Generative Adversarial Networks (GAN) are used for various image manipulation problems such as scene editing (removing an object from an image) , image generation, style transfer, producing an image as the end result.
To improve the quality of these generated images it is important to use an objective function (loss function) which is better suited to human perceptual judgements. In this post, I would present a brief overview of different loss functions used for this task.
According to error sensitivity theory, a distorted image is seen as a sum of the undistorted image and an error signal.
Undistorted Image | Error Signal | Distorted Image |
---|---|---|
Loss in quality of an image is thus assumed to be related to the visibility of the error signal. L2 loss tries to quantify this error signal by taking the mean of squared difference between intensities (pixel values) of the distorted and the undistorted image.
L2 loss has been the de facto standard for the industry for quite a long time now and this is mainly due to the following reasons -
Due to its simple structure, researchers have used l2 loss in all types of problems, from regression to image processing.
L2 loss assumes that pixels or signal points are independent of each other whereas images are highly structured - ordering of the pixels carry important information about the contents of an image. For a given error signal, L2 loss remains the same irrespective of the correlation between the original signal and the error signal even though this correlation can have strong impact on perceptual similarity.
Failure of these assumptions makes L2 loss an unsuitable candidate to improve the quality of generated images.
Loss in quality of an image is thus not only related to the visibility of the error signal. Contrary to the L2 loss, the structural similarity (SSIM) index provides a measure of the similarity by comparing two images based on luminance similarity, contrast similarity and structural similarity information.
Luminance of an image signal is estimated by mean intensity.
Luminance of two images is then compared by
Contrast is determined by difference of the luminance between the object and other objects in the field of view. This is done by calculating the standard deviation of the image signal.
Contrast similarity is then found out by,
Structural information is represented by strong inter-dependencies between spatially close pixels. Normalizing the incoming signals by first subtracting mean intensities and then dividing by respective standard deviation projects the image signals as unit vectors on hyperplanes defined by,
These unit vectors are associated with the structural information.
where
, gives the corelation between the two windows x and y.
SSIM Index is computed by taking into account the luminance, contrast and structural similarity. The constants C1, C2 and C3 are used to resolve cases where in denominator is tending to zero.
SSIM index is calculated as a local measure rather than a global measure in order to incorporate the fact that the human visual system (HVS) can perceive only a local area at high resolution at a time. In the above formulas x and y are windows on the full images X and Y (test/predicted image and reference image)
In comparison to L2 loss, SSIM index is a better image quality measure as it is better suited to the HVS. The following figure shows that SSIM index varies for different distortions while L2 Loss remains constant, showing superiority of SSIM index over L2 loss.
To take into account the scale at which local structure of an image is analyzed, researchers came up with a multi scale version of SSIM index, MS-SSIM. This is calculated by computing the SSIM index at various scales and then taking a weighted average of the computed values.
SSIM loss given by, 1 - SSIM Index, is used as the objective function for DL models.
While SSIM loss may seem more suitable as compared to L2 loss, it was designed for grayscale images and sometimes fails in estimating quality of color images. Training DL models with SSIM loss can lead to shift of colors.
To overcome this issue of SSIM loss, neural nets can be trained with a combination of different losses.
Here is a Guassian Window and is a small number (0.84) to weight the different loss functions involved. Guassian window is used to make L1 loss consistent with the MS-SSIM loss, in which the error at pixel q is propogated based on its contribution to MS-SSIM of the central pixel.
Input Image | SSIM Image | SSIM + L1 | SSIM + L2 |
---|---|---|---|
SSIM loss has been well adopted in the industry but it has its own limitations. SSIM loss cannot incorporate large geometric distortions.
Finding a linear function which fits the human similarity measure is an onerous task due to the complex context-dependent nature of human judgement.
In light of this problem, researchers in the community have used distance between features of images passed through DL models (VGG, Alex) as a similarity measure between images. Distance between the features is generally calculated as l2 distance or cosine distance.
where is the pretrained DL model (VGG or Alex) and represents the network layer.
Another loss metric is the recently proposed contextual loss which also measures distances between features computed using DL models. While calculating this metric, features of the reference image and the predicted/test image are treated as two separate collections. For every feature in the reference image feature collection, a similar feature is found amongst the test image feature collection. This is done by calculating the cosine distance between the two feature vectors and then converting that to a similaity value by exponentiating it. Maximum value of this metric is then taken as the similarity value of the two images. Authors of this research paper show promising results on using the contextual loss parameter.
I trained CycleGAN [6] model on Monet-Photo database with different loss functions used for calculating the cycle consistency loss. Some sample comparisons are provided below. The project is available here.
Input Image | L1 Image | SSIM Image | SSIM + L1 | SSIM + L2(a) | SSIM + L2(b) | SSIM + L1 + L2(b) |
---|---|---|---|---|---|---|
Input Image | L1 Image | SSIM Image | SSIM + L1 | SSIM + L2(a) | SSIM + L2(b) | SSIM + L1 + L2(b) |
---|---|---|---|---|---|---|
P.S. - I am implementing perceptual loss using deep features metric in this project. Stay tuned for that.