Abstract:
Deep learning is known to be effective at automating the generation of representations, which eliminates the need for handcrafted features. For the task of personalized recommendation, deep learning-based methods have achieved great success by learning efficient representations of multimedia items, especially images and videos. Previous works usually adopt simple, single-modality representations of user interest, such as user embeddings, which cannot fully characterize the diversity and volatility of user interest. To address this problem, in this paper we focus on learning and fusing multiple kinds of user interest representations by leveraging deep networks. Specifically, we consider efficient representations of four aspects of user interest: first, we use latent representation, i.e. user embedding, to profile the overall interest; second, we propose item-level representation, which is learned from and integrates the features of a user's historical items; third, we investigate neighbor-assisted representation, i.e. using neighboring users' information to characterize user interest collaboratively; fourth, we propose category-level representation, which is learned from the categorical attributes of a user's historical items. In order to integrate these multiple user interest representations, we study both early fusion and late fusion; where for early fusion, we study different fusion functions. We validate the proposed method on two real-world video recommendation datasets for micro-video and movie recommendations, respectively. Experimental results demonstrate that our method outperforms existing state-of-the-arts by a significant margin. Our code is publicly available.