Micro-expressions (MEs) are rapid and subtle facial movements that are difficult to detect and recognize. Most recent works have attempted to recognize MEs with spatial and temporal information from video clips. According to psychological studies, the apex frame conveys the most emotional information expressed in facial expressions. However, it is not clear how the single apex frame contributes to micro-expression recognition. To alleviate that problem, this paper firstly proposes a new method to detect the apex frame by estimating pixel-level change rates in the frequency domain. With frequency information, it performs more effectively on apex frame spotting than the currently existing apex frame spotting methods based on the spatio-temporal change information. Secondly, with the apex frame, this paper proposes a joint feature learning architecture coupling local and global information to recognize MEs, because not all regions make the same contribution to ME recognition and some regions do not even contain any emotional information. More specifically, the proposed model involves the local information learned from the facial regions contributing major emotion information, and the global information learned from the whole face. Leveraging the local and global information enables our model to learn discriminative ME representations and suppress the negative influence of unrelated regions to MEs. The proposed method is extensively evaluated using CASME, CASME II, SAMM, SMIC, and composite databases. Experimental results demonstrate that our method with the detected apex frame achieves considerably promising ME recognition performance, compared with the state-of-the-art methods employing the whole ME sequence. Moreover, the results indicate that the apex frame can significantly contribute to micro-expression recognition.