Abstract:
With the development of Internet technology, network platforms have gradually become a tool for people to obtain hot news. How to filter out the current hot news from a large number of news collections and push them to users has important application value. In supervised learning scenarios, each piece of news needs to be labeled manually, which takes a lot of time and effort. From the perspective of semi-supervised learning, on the basis of the non-negative Positive-Unlabeled (nnPU) learning, this paper proposes a novel algorithm, called ‘Enhanced nnPU with Focal Loss’ (FLPU), for news headline classification, which uses the Focal Loss to replace the way the classical nnPU calculates the empirical risk of positive and negative samples. Then, by introducing the Virtual Adversarial Training (VAT) of the Adversarial training for large neural LangUage Models (ALUM) into FLPU, another (and better) algorithm, called ‘FLPU+ALUM’, is proposed for the same purpose, aiming to label only a small number of positive samples. The superiority of both algorithms to the state-of-the-art PU algorithms considered is demonstrated by means of experiments, conducted on two datasets for performance comparison. Moreover, through another set of experiments, it is shown that, if enriched by the proposed algorithms, the RoBERTa-wwm-ext model can achieve better classification performance than the state-of-the-art binary classification models included in the comparison. In addition, a ‘Ratio Batch’ method is elaborated and proposed as more stable for use in scenarios involving only a small number of labeled positive samples, which is also experimentally demonstrated.