Abstract:
In this paper, we focus on human pose transfer in different videos, i.e., transferring the dance pose of a person in given video to a target person in the other video. Our methods can be summed up in three stages to tackle this challenging scenario. Firstly, we extract the frames and pose masks from the source video and target video. Secondly, we use our model to synthesize the frames of target person with the given dance pose. Thirdly, we refine the generated frames to improve the quality of outputs. Our model is built on three stages: 1) human pose extraction and normalization. 2) a GAN based on cross-domain correspondence mechanism to synthesize dance-guided person image in target video by consecutive frames and pose stick images. 3) coarse-to-fine generation strategy which includes two GANs: a GAN used to reconstruct human face in target video, the other generates smoothing frame sequences. Finally, we compress the sequential frames generated from our model into video format. Compared with previous works, our model manifests better person appearance consistency and time coherence in video-to-video synthesis for human motion transfer, which makes the generated video look more realistic. The qualitative and quantitative comparisons represent our approach performs significant improvements over the state-of-the-art methods. Experiments on synthetic frames and ground truth validate the effectiveness of the proposed method.