Abstract:This paper addresses the challenges of inadequate lesion extraction and grading effects in diabetic retinopathy by proposing an enhanced MFViT algorithm, based on the FastViT network, to improve the accuracy of retinal lesion grading. The first step involves designing a multi-scale feature extraction token mixer with improved positional information to extract multi-scale features that contain spatial location information layer by layer. Subsequently, a feature detail enhancement module is constructed to capture the relationship of cross-scale features within the image, enhance high-frequency details, and highlight the representation ability of low-resolution features. Finally, a cross-layer feature fusion module is proposed to adaptively fuse features at different levels, thereby further improving the network's classification performance. The MFViT algorithm achieves accuracy, precision, recall, specificity, and F1-Score of 94.5%, 94.5%, 94.7%, 98.6%, and 94.6% respectively in the retinal dataset. Compared with the currently popular algorithms, the proposed method has improved all evaluation indicators in diabetic retinopathy grading and has great potential in computer-aided clinical diagnosis.