| Issue |
Mechanics & Industry
Volume 26, 2025
Robotic Process Automation for Smarter Devices in Manufacturing
|
|
|---|---|---|
| Article Number | 27 | |
| Number of page(s) | 14 | |
| DOI | https://doi.org/10.1051/meca/2025019 | |
| Published online | 17 September 2025 | |
Original Article
The study on surface defect detection of stamped parts based on improved deep learning
1
School of Intelligent Manufacturing, Tianjin Electronic Information College, Tianjin 300350, China
2
Tianjin Bonus Robotics Technology Co., Ltd., Tianjin 300350, China
* e-mail: tian_xia329@126.com; tian.xia@tjdz.edu.cn
Received:
19
April
2025
Accepted:
10
July
2025
Stamped parts are widely used in industries such as automotive and aerospace, where surface defects like scratches and cracks can affect appearance and function. Traditional manual inspection methods are inefficient and prone to errors. This paper proposes an improved deep learning-based approach for detecting surface defects in stamped parts, combining image preprocessing, feature enhancement, and deep neural networks. The YOLOv8 model, enhanced with the Feature Fusion Attention Network (FFA-Net) and Gold-YOLO, was tested on an augmented image dataset. Experimental results show that the improved model achieves higher precision, recall, and detection accuracy compared to baseline methods. The model demonstrates robustness under various environmental challenges, making it suitable for industrial defect detection applications.
Key words: Surface defect detection / deep learning / Gold-YOLO / mechanical / industrial products
© T. Xia et al., Published by EDP Sciences 2025
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
Stamped parts are widely used in various mechanical and industrial products [1], including automobiles, aerospace, electronics, and home appliances. The stamping process involves the deformation of metal sheets under high pressure, and the surface of the molds may be affected by factors such as dust, leading to various surface defects on the stamped parts during manufacturing, such as scratches, dents, and cracks. These defects not only affect the aesthetic quality of the parts but may also impact the accuracy of subsequent processing. Therefore, the detection of surface defects in stamped parts is of significant importance in the production process. Surface imperfections such as scratches, dents, and cracks have the potential to negatively affect product performance, mainly in aerospace, by compromising structural precision, integrity, and wear corrosion rates. Imperfections may fail, decrease aerodynamic performance, and affect manufacturing processes. The aforementioned study suggests a more effective deep learning-based detection system based on YOLOv8 augmented with FFA-Net and Gold-YOLO. This model provides high accuracy and stability in defect detection under harsh conditions, facilitating safety, reliability, and quality in industrial use. Traditional surface defect detection methods for stamped parts mainly rely on manual visual inspection, which is inefficient and easily influenced by environmental factors, operator fatigue, and subjective biases, leading to misjudgments and omissions. Moreover, manual inspection has limited capability in identifying subtle defects, especially when it comes to detecting fine defects such as surface scratches and microcracks, which are challenging to detect. Additionally, traditional manual inspection methods cannot meet the speed and precision requirements of large-scale production [2]. Deep learning-based techniques have greatly enhanced the identification of slight surface imperfections like scratches and microcracks, which are frequently difficult for conventional inspection methods. In contrast to human or classical image processing techniques based on hand-crafted features and pre-defined thresholds, deep learning-based models, especially those based on convolutional neural networks (CNNs), extract hierarchical features automatically that represent fine and intricate patterns in defect images. Advanced models such as YOLOv8 augmented with Feature Fusion Attention Network (FFA-Net) and Gold-YOLO use attention mechanisms and multi-scale feature fusion to pay attention to significant details and adaptively identify defects of different sizes and shapes. Such approaches exhibit higher robustness under adverse industrial conditions, i.e., non-uniform illumination, noise, image blur, and partial occlusions, where conventional approaches usually fail. In addition, deep learning models are aided by large amounts of data augmentation and pretraining methods like Masked Autoencoders (MAE), which enable them to generalize even with minimal labeled defect data. As a result, defects that are subtle, low-contrast, or appear under changing environmental conditions like microcracks, fine scratches, pits, and irregular dents can be accurately identified by these deep learning methods. These models have good detection accuracy with low false positives and missed defects, surpassing manual inspection and traditional noise-sensitive and light-sensitive classical algorithms. Altogether, the utilization of enhanced deep learning techniques provides more accurate, real-time automated defect inspection in industrial manufacturing, relieving key challenges in quality assurance where defective imperfections can have extreme effects on functionality as well as aesthetics.
With the development of machine vision and deep learning technologies, automated defect detection methods based on computer vision have gradually become an important means of detecting surface defects in stamped parts. Deep learning, particularly Convolutional Neural Networks (CNNs), has made significant progress in image processing and pattern recognition. By training deep neural networks, computers can learn the characteristics of surface defects from a large dataset of images and perform automated defect detection. Compared to traditional methods, deep learning-based detection methods offer higher accuracy and robustness, effectively identifying different types of defects, especially improving the ability to recognize small and complex defects [3–8].
In recent years, deep learning-based methods for detecting surface defects in stamped parts have been widely applied. Many researchers have adopted technologies such as CNNs, Generative Adversarial Networks (GANs), and transfer learning, combined with image processing and data augmentation techniques, to enhance the accuracy and efficiency of defect detection [9–12]. These methods not only automate the detection of surface defects in stamped parts but also provide real-time results, significantly improving production efficiency and effectively reducing labor costs. The application of a deep learning-driven surface defect detection model combination of YOLOv8 and FFA-Net and Gold-YOLO in manufacturing environments can help lower labor expenses, enhance production efficiency, and lower defects significantly. Conventional manual inspection technologies are slow and inefficient, and cause errors due to operator fatigue and subjective decision-making. Defect detection can cut inspection labor by as much as 70–90%, translating into significant savings in labor expenses. The model delivers high precision and recall, with strong performance even under difficult conditions. This supports real-time defect detection, 10–30% improvement in throughput, and downtime reduction. The improved detection accuracy also minimizes missed defects and false alarms, resulting in a 20–40% decrease in defective output. This saves costs on scrap, warranty claims, and rework, and enhances customer satisfaction.
Although deep learning methods have made significant progress in defect detection for stamped parts, challenges remain. Particularly, under different stamping processes and surface conditions, defects exhibit a wide variety of types and manifestations, making it a pressing issue to design deep learning models with high generalization capabilities. Moreover, defect detection is not just a classification problem but also involves precise localization and feature extraction, which places higher demands on the accuracy and computational efficiency of algorithms. Explainability tools and methods such as Grad-CAM (Gradient-weighted Class Activation Mapping) help build confidence in defect detection models by visually highlighting the regions of an image that contribute most to the model's decision. By generating heatmaps that show which parts of a stamped part's surface the deep learning model focuses on to identify defects, these tools make the model's inner workings more transparent. This interpretability allows engineers to verify that the model is attending to actual defect areas rather than irrelevant background features, thereby increasing trust in its reliability and robustness in industrial defect detection tasks. The research on identifying surface defects in stamped components through enhanced deep learning adds to the Special Issue of Robotic Process Automation for Smarter Devices in Manufacturing by promoting automated quality assessment. Utilizing deep learning methods, the study improves the precision and effectiveness of defect detection in manufacturing processes. This allows robotic systems and intelligent devices to conduct real-time monitoring with little human involvement. As a result, it fosters the creation of smarter, automated manufacturing systems that enhance product quality and lower production expenses. The incorporation of AI-powered defect identification corresponds with the aim of more intelligent and autonomous manufacturing automation [13].
To further enhance the performance of surface defect detection for stamped parts, this paper proposes an improved deep learning-based approach. The method combines the advantages of image preprocessing, feature enhancement, and deep neural networks to improve detection accuracy and robustness, addressing the issues of insufficient data and difficulty in feature extraction that existing deep learning methods face in practical applications. Through experimental validation, this study will demonstrate the effectiveness of this approach in detecting surface defects in stamped parts, aiming to provide support for subsequent industrial applications. The use of deep learning combined with metaheuristic optimization, as verified by Rajya Lakshmi Gudivaka (2019), played a significant role in shaping our approach. The development of an YOLOv8-based model, improves accuracy, reduces false detections, and supports real-time, automated defect detection in complex industrial manufacturing settings [14]. The citation by Dinesh Kumar Reddy Basani (2024) supports our proposed work by demonstrating the effectiveness of multi-scale feature learning and data fusion in improving fault detection accuracy. This approach highlights the benefits of enhanced feature representation in deep learning models. Our study adopts their strategy for detecting surface defects in stamped parts. By leveraging improved deep learning techniques, our research aims to achieve robust and accurate visual inspection [15].
2 Dataset creation
2.1 Image acquisition hardware
In this study, the high-performance CMOS machine vision camera, Hikvision MV-CS200-10UC, with a resolution of 5472 × 3648 pixels, is selected as the image acquisition hardware. The lens is equipped with the Hikvision MVL-KF1224M-25MP industrial camera FA lens, offering high-precision image-capturing capabilities. To ensure image quality, the lighting source chosen is the Hikvision MV-LRSS-H-300-W 300 mm shadowless ring light, which provides uniform illumination to prevent image quality degradation due to uneven lighting. The entire acquisition platform is shown in Figure 1, and the experimental stamped part style is shown in Figure 2.
![]() |
Fig. 1 Experimental equipment diagram. |
![]() |
Fig. 2 Stamped parts used in the experiment. |
2.2 Original image preprocessing
To enhance the generalization ability of the dataset and the robustness of the mode [9–16], this study applied various data augmentation techniques to the 103 original images. These techniques include but are not limited to, grayscale conversion, edge enhancement, random rotation, random flipping, random brightness adjustment, random contrast adjustment, addition of Gaussian noise, random blurring, and random scaling. By applying these methods in isolation or random combinations, the original dataset was expanded to 510 images. As shown in Figure 3, the preprocessed images exhibit a diverse range of samples. Representativeness of the training data is another critical factor that plays a significant role in the success and generalization ability of deep learning models, especially in industry applications like defect inspection on stamped parts. The reason is that the training data should well represent the wide variability and diversity of defects such as scuffs, scratches, and pits that differ in shape, size, texture, and severity. Further, industrial settings bring with them varied issues like different lighting, noise interference, image blur, and occlusion, which all need to be incorporated into the training data to guarantee robust model performance. Although data augmentation methods like grayscale conversion, edge sharpening, random rotation, flipping, brightness and contrast adjustment, noise injection, and blurring serve to artificially enhance the diversity of the dataset, they cannot completely replace a big naturally diverse dataset covering all pertinent types of defects and environmental conditions. Without a representative training set, models have the danger of overfitting on particular attributes that appear in training images and can fail to generalize to actual production environments. These problems are tackled by this work through the integration of sophisticated preprocessing and feature enrichment modules such as FFA-Net, which enhance the clarity of images and mitigate environmental influences, thus increasing the effective representativeness of training data. However, limitations do exist because of the relatively small size of the initial dataset, possible defect type representation imbalance, and possible domain shifts that occur during model deployment in various production environments. Therefore, for robust industrial defect detection, a balanced dataset representing multiple real-world conditions is necessary to augment techniques as well as strong model architectures.
![]() |
Fig. 3 Generalized dataset. |
2.3 Dataset labeling process
This dataset covers three common types of defects in stamped parts: scratches, scuffs, and pits. Representative image examples of each defect type are provided, with scratches, scuffs, and pits shown in Figures 4a, b, and c, respectively. The labeling work was performed using the ALPHA MAKE SENSE software, and Figure 5 presents the labeled samples of the dataset.
![]() |
Fig. 4 Three types of defects. |
![]() |
Fig. 5 Defect annotation using MAKE SENSE. |
3 Approach to the improved model
3.1 YOLOv8 model baseline and improved model [17]
The YOLOv8 baseline training model is shown in Figure 6.
![]() |
Fig. 6 YOLOv8 baseline model. |
3.2 Key improvements
3.2.1 FFA-NET [18]
In actual production line environments, stamped parts are conveyed through a conveyor belt after stamping, and during this process, the products may be exposed to different lighting angles. Due to variations in lighting conditions, environmental factors, and artificial illumination, the surface defects of stamped parts may present different characteristics. These factors can pose significant challenges for traditional defect detection methods, affecting both the accuracy and reliability of detection. To address these issues, this paper introduces FFA-Net (Feature Fusion Attention Network) to improve image quality and defect detection accuracy. The stamping process in manufacturing faces challenges in surface defect detection due to varying environmental conditions. These include lighting variation, noise interference, and image blurring, which can obscure defects or lead to false detections. Traditional detection models, like the baseline YOLOv8, struggle to perform under these inconsistencies. To address these issues, the study incorporates the Feature Fusion Attention Network (FFA-Net) to enhance image clarity and eliminate “haze” caused by inconsistent lighting conditions. The improved model, which integrates FFA-Net and Gold-YOLO, maintains high detection accuracy even when subjected to high noise levels. The enhanced model also retains effective defect recognition under blurred conditions using advanced attention mechanisms. Additionally, the improved YOLOv8 model, augmented with attention-based and multi-layer feature integration techniques, exhibits better ability to detect defects even when they are partially occluded. The integration of FFA-Net and Gold-YOLO into the YOLOv8 framework significantly enhances its performance in industrial surface defect detection. FFA-Net improves image clarity and robustness by addressing lighting variations and environmental interference through attention mechanisms at the channel and pixel levels, local residual learning, and multi-level feature fusion. This ensures clearer input images with emphasized defect features. Gold-YOLO strengthens internal feature learning using a novel Aggregation-Distribution mechanism that aligns, fuses, and redistributes multi-scale features, alongside MAE-style pretraining that improves generalization in data-scarce scenarios. Together, these enhancements enable YOLOv8 to achieve higher accuracy and reliability, making it more effective for defect detection in complex, real-world production environments. FFA-Net is a feature fusion attention network that improves image clarity under varying lighting conditions by addressing degradation caused by uneven illumination and environmental factors like haze. It uses channel attention to focus on informative feature channels, filtering out irrelevant information, and pixel attention to refine local features to handle uneven haze distribution and lighting variations. This allows the network to restore intricate details, especially in poorly lit or shadowed areas. Local residual learning enables the model to bypass less affected regions and concentrate computational resources on areas with more significant degradation, such as shadows or highlights. By selectively enhancing global and local features, FFA-Net significantly improves image clarity, resulting in more reliable and accurate surface defect detection in complex and fluctuating lighting conditions. The hybrid LSTM-GA and HS-CS framework proposed by Durai Rajesh Natarajan and his team (2025) significantly advances our industrial surface defect detection approach by enhancing feature optimization, ensuring real-time performance, improving reliability and operational efficiency [19].
The fundamental idea of FFA-Net is to directly restore images affected by lighting and environmental factors through a feature fusion attention mechanism, removing the “haze” effect from the images to enhance their clarity. The network architecture achieves efficient image dehazing through three key components:
The channel attention weight can be calculated using the following formula.
where Ac α represents the channel attention weight, σ is the activation function (e.g., ReLU), MLP represents the multilayer perceptron, GlobalAvgPool(F) is the global average pooling operation, and F is the input feature map. The channel attention mechanism helps the network focus on the important channel features in the image, thereby improving the image dehazing effect.
Due to the uneven distribution of haze across different pixel locations, pixel-level attention helps the network more effectively remove the haze from the image and restore details. The calculation formula for pixel attention is as follows:
where Ap represents the pixel attention weight, W is the learned weight matrices, and F is the input feature map. The pixel attention mechanism helps to delicately process the local features within the image, further improving the image quality.
Local residual learning is introduced primarily to handle the information from different regions of the image, especially in low-light or light haze conditions. Local residual learning bypasses unimportant areas through multiple local residual connections, enabling the network to focus more on parts of the image with higher information content. The calculation formula for local residual learning is:
where: Fout is the output feature map, Fin is the input feature map, L represents the residual learning process.
The attention-based multi-level feature fusion is typically formulated as follows:
where: Ai represents the attention weight for the feature map at the i-th layer, Fi is the feature map at the i-th layer, Ffused is the fused feature map after attention-based multi-level fusion, and N is the total number of layers in the network.
FFA-Net can effectively address image quality issues caused by uneven lighting or environmental changes, enhancing the robustness and accuracy of surface defect detection for stamped parts. By leveraging feature fusion attention mechanisms, FFA-Net improves image clarity and reduces the impact of lighting and environmental variations, leading to more reliable and precise defect detection in real-world production environments. The pixel-level attention feature in FFA-Net provides various distinct benefits for analyzing local features during defect identification. It allows the network to carefully handle local features by giving attention weights to specific pixels, aiding in the recovery of details that could be hidden by inconsistent lighting or haze. Since haze and lighting distortions vary in distribution throughout an image, pixel-level attention enables the network to selectively concentrate on local regions with greater distortions, thereby effectively eliminating these artifacts. This results in improved image quality and sharper visuals, essential for effectively identifying minor and subtle surface flaws. Furthermore, pixel-level attention enhances the model's resilience by allowing it to more effectively address environmental changes and local anomalies in the image, thereby making the defect detection system more dependable in actual industrial settings with intricate lighting and interference situations.
3.2.2 Gold-YOL [20]
Gold-YOLO significantly enhances object detection performance by integrating convolution operations with self-attention mechanisms to extract and efficiently fuse information across different network layers. This hybrid approach enables the model to capture both local and global features more effectively.
Moreover, Gold-YOLO leverages multi-scale features for detection, ensuring better adaptability to objects of various sizes. Notably, it is the first model in the YOLO series to adopt MAE (Masked Autoencoders)-style pretraining. This pretraining technique greatly improves the model's learning efficiency and accuracy by enabling it to reconstruct missing parts of input images, thereby encouraging a deeper understanding of image structure during training. As a result, Gold-YOLO exhibits stronger generalization capabilities and higher detection accuracy, especially in complex or variable visual environments. The MAE-style pretraining in the Gold-YOLO model enhances its ability to detect surface defects by enabling unsupervised learning through image reconstruction, which improves feature representation and understanding of image structure. This approach boosts generalization, especially under varying industrial conditions, and is highly effective even with limited labeled data. Combined with Gold-YOLO's multi-scale feature fusion mechanism, MAE-style pretraining significantly improves detection accuracy, robustness, and efficiency, making the model well-suited for complex industrial defect detection tasks. Gold-YOLO improves detection efficiency in environments with limited labeled data by combining convolution with self-attention to better capture both local and global features. It uses an Aggregation-Distribution mechanism that aligns and fuses multi-layer features, enhancing the model's ability to integrate important information from different levels. Additionally, Gold-YOLO employs MAE-style (Masked Autoencoders) self-supervised pretraining, allowing the model to learn image structure by reconstructing missing parts, which boosts learning efficiency and generalization without needing large labeled datasets. This makes Gold-YOLO more accurate and robust, especially in complex or data-scarce industrial defect detection scenarios.
The Aggregation-Distribution Mechanism (GD) is one of the core innovations of Gold-YOLO, aimed at addressing the challenge of feature fusion across different levels. Traditional object detection models often struggle with effectively integrating multi-level information. The GD mechanism overcomes this challenge by introducing two key modules: the Feature Alignment Module (FAM) and the Information Fusion Module (IFM).
Assuming the input feature map is Fi ∈ RHi×Wi×Ci where Hi, Wi and Ci represents the height, width, and number of channels of the feature map at the i-th layer, the goal of feature alignment is to align the feature map to a unified scale space using a transformation matrix Ti. The feature alignment process can be expressed as:
where Ti is the transformation matrix, Faligned is the aligned feature map.
The Information Fusion Module (IFM) is responsible for performing a weighted fusion of the aligned features from different layers. Through the weighted mechanism, IFM assigns higher weights to more important features, thereby enhancing the model's detection accuracy. The information fusion process is carried out through a weighted summation, expressed as:
where αi is the weight of the feature map at the i-th layer, Faligned,i is the aligned feature map at the i-th layer, Ffused is the final fused feature map, and N is the total number of layers in the network.
The Information Injection Module is responsible for distributing the fused feature information back to the various layers of the network. This allows the model to continuously benefit from the fusion results across different layers, thereby enhancing the robustness and accuracy of the detection. The information injection process can be expressed as:
where: Fout is the output feature map after the information injection at the i-th layer, Fin is the input feature map at the i-th layer, and Inject represents the Information Injection Module, which distributes the fused feature information back to the different layers of the network.
To further improve the learning efficiency and accuracy of the model, Gold-YOLO adopts the MAE-style pretraining method during its training process. MAE (Masked Autoencoders) is an unsupervised learning approach that involves masking a part of the input image and then reconstructing the missing part using the network. In object detection, the use of MAE-style pretraining enables the network to have stronger feature representation capabilities and higher learning efficiency when dealing with complex visual tasks.
The pretraining steps of MAE can be expressed as:
where Fpred is the predicted feature (the reconstructed part of the image), Fmask is the masked feature (the image with a part of it hidden), LMAE is the loss function used in MAE pretraining, typically a reconstruction loss, E denotes the expected operation, indicating that the network aims to minimize the difference between the original input and the reconstructed input.
Through this pretraining strategy, Gold-YOLO effectively improves object detection accuracy, especially in situations where large amounts of labeled data are scarce. The model's generalization ability is significantly enhanced, enabling it to perform well in challenging environments with limited supervision.
The innovation of Gold-YOLO lies in its use of the Aggregation-Distribution Mechanism (GD) for efficient information fusion. This mechanism fully utilizes multi-scale features from different network layers, allowing for better feature integration. Combined with the MAE-style pretraining method, Gold-YOLO achieves further improvements in object detection accuracy and efficiency. The model is capable of providing high-precision detection results while maintaining low latency, making it particularly well-suited for automated defect detection tasks in industrial production environments. The integration of YOLOv8 with FFA-Net and Gold-YOLO forms a powerful deep-learning framework for identifying surface flaws in stamped components. YOLOv8 offers a quick and precise detection framework ideal for industrial uses. FFA-Net improves image sharpness by tackling lighting and environmental issues using attention mechanisms and feature integration. Gold-YOLO enhances feature representation and generalization by employing multi-scale fusion and pretraining similar to MAE. Collectively, they greatly enhance precision, recall, and total detection accuracy. The model operates consistently across different conditions such as noise, blur, occlusion, and changes in lighting. This combined method is extremely efficient for immediate, precise flaw identification in production settings.
3.2.3 Focal Modulation [21]
The Focal Modulation mechanism introduces the Focal Attention Module (FAM) to adjust and weigh contextual information at different scales. This allows the network to remain sensitive to local details while enhancing the understanding of global structure when handling multi-scale information. The process of focal modulation can be broken down into the following steps:
Feature Extraction: First, deep convolution layers are used to extract image features at different scales. These features capture both local detail information and global structural information.
Focal Enhancement: Next, the extracted features are weighted using the focal attention mechanism, emphasizing the key parts of the image, such as defect areas, edges, and other important details.
Feature Fusion: Finally, the weighted feature maps are fused to generate the final output feature map.
The weighted operation in focal modulation can be expressed as:
where Fmod is the modulated feature map after focal modulation, Fraw is the original image feature map at the i-th scale, A is the focal attention weight for the i-th scale, ⊙ and represents element-wise multiplication. The focal attention weight A is learned through the network and can be expressed as:
where σ is the activation function (such as ReLU), W is the learned weight matrix, Fraw and is the input feature map at the iii-th scale.
Focal modulation, through the combination of focal contextualization and focal modulation mechanisms, enhances the network's ability to understand multi-scale information in images. This makes the model more flexible and efficient when handling different types of visual information. The mechanism is particularly suitable for defect detection tasks, as it retains critical local details while also maintaining a strong awareness of global structural features. This method helps improve detection accuracy and robustness, especially for surface defects on stamped parts in complex industrial environments. The Focal Modulation module is a feature in defect detection models that adjusts and weighs contextual information at multiple spatial scales. It aims to improve detection accuracy and robustness in complex industrial environments where defects vary in size and appearance. However, experimental results show that the module has several limitations, including lower overall performance, significant fluctuations in precision and recall, and potential overfitting due to its strong emphasis on local features. This suggests that while the module is promising in theory, it underperformed in multi-scale defect detection tasks and requires further optimization for effective industrial defect detection. Industrial defect detection systems can be improved by incorporating post-detection visualization and feedback mechanisms. This involves overlaying bounding boxes on images or video frames and marking detected defects with distinct colors, labels, and confidence scores. This allows operators to quickly identify and classify defects like scratches, scuffs, and pits. Attention heatmaps derived from feature fusion attention mechanisms can highlight image regions that influenced detection decisions, providing transparency into the model's reasoning. Severity or defect size indicators can help operators prioritize issues by varying bounding box thickness or color intensity. An interactive feedback interface allows operators to confirm detections, correct false positives, or annotate missed defects, enabling data collection for iterative model improvement and retraining. Features like adjustable confidence thresholds and contextual explanations of defect categories support better operator comprehension. A comprehensive analytics dashboard presenting historical defect trends, operator feedback statistics, and correlations with environmental factors can provide deeper insights into production quality and model performance. Implementing these mechanisms through tools like OpenCV and web-based visualization frameworks ensures real-time feedback in production environments or batch reporting for offline analysis. Integrating visualization and feedback loops into the defect detection workflow improves defect classification accuracy and operational efficiency.
4 Experimental comparison and analysis
4.1 Experimental setup and evaluation criteria
4.1.1 Experimental setup
The hardware setup for the project includes the following components:
Image Acquisition Hardware:
Camera: Hikvision MV-CS200-10UC high-performance CMOS machine vision camera with a resolution of 5472 × 3648 pixels.
Lens: Hikvision MVL-KF1224M-25MP industrial camera FA lens, offering high precision for image capture.
Lighting: Hikvision MV-LRSS-H-300-W 300mm shadowless ring light, ensuring even illumination to prevent image quality degradation due to uneven lighting.
Computing Hardware:
GPU: NVIDIA RTX 4080 Super with 16GB memory, designed for high-performance image processing.
CPU: AMD Ryzen 9 7950X, a high-performance processor.
Memory: 32GB DDR5 RAM for efficient data processing.
Storage: 1TB NVMe SSD for fast data access and storage.
This setup provides a robust environment for high-resolution image capture and efficient processing of complex defect detection tasks. The study utilized a high-performance imaging setup to detect surface defects under challenging visual conditions like low light and glare. The Hikvision MV-CS200-10UC CMOS machine vision camera, coupled with a Hikvision MVL-KF1224M-25MP industrial FA lens, provided high-precision images. A uniform illumination was provided by a Hikvision MV-LRSS-H-300-W 300mm shadowless ring light. The improved model (YOLOv8 + FFA + Gold-YOLO) showed significantly higher precision, recall, and mean average precision values in all lighting conditions compared to the baseline YOLOv8 model. The FFA-Net module mitigated the effects of inconsistent lighting and glare, enhancing image clarity. Image preprocessing techniques were applied to simulate real-world variations. The combination of high-resolution imaging hardware, consistent lighting, and advanced software-based enhancements ensured the improved model's high accuracy and reliability under low-light and glare conditions.
Software Setup:
Operating System: Windows 11 Pro.
Drivers and Libraries: CUDA 12.3 or higher, supporting GPU acceleration; cuDNN, which optimizes neural network training performance.
Development Tools: Anaconda, PyCharm.
This software environment ensures efficient training and development for machine learning and deep learning tasks, especially in defect detection and image processing.
The results of the model operation are shown in Figure 7.
![]() |
Fig. 7 Output results of the model after running. |
4.1.2 Single-module experimental results
A comparative analysis of the performance of three individual improvement modules based on YOLOv8 was conducted, namely the introduction of the FFA-Net, Gold-YOLO, and FocalModulation modules. The evaluation metrics include Precision, Recall, mAP50, and mAP50-95. The experimental results are shown in Figure 8.
The experimental metrics of the three models under different indicators are shown in Table 1.
From the experimental results, it can be seen that YOLOv8 + FFA-Net delivers the most superior overall performance. Its Precision rises quickly and stabilizes around 0.9, while Recall eventually reaches 0.8, showing good detection accuracy and recall ability. In addition, mAP50 and mAP50-95 reach approximately 0.85 and 0.45, respectively, indicating that the model performs well in detection accuracy and multi-scale object recognition. This excellent performance is attributed to the enhancement of image details and quality by FFA-Net, especially its improved robustness under complex lighting and interference conditions. YOLOv8 was pre-trained for surface defect inspection of stamped parts in the manufacturing setting. The model was pre-trained by embedding it into the Feature Fusion Attention Network (FFA-Net) and Gold-YOLO modules to overcome challenges in the manufacturing setting. Stamped parts usually have defects such as scratches, scuffs, and pits, which are hard to detect precisely due to changing lighting conditions, noise from the environment, and thin features of the defects. The initial dataset of 103 images was also augmented with operations including grayscale, edges, rotation, flipping, brightness and contrast, noise, blur, and scaling. FFA-Net was presented to improve image quality by utilizing a feature fusion attention mechanism, highlighting key areas in an image and concentrating on local illumination changes. Gold-YOLO improved multi-layer feature fusion based on an Aggregation-Distribution mechanism and MAE-style pretraining. The integrated YOLOv8 + FFA-Net + Gold-YOLO model outperformed each module individually in terms of precision, recall, and mean average precision. There was rigorous robustness testing to ensure the stability and improved detection accuracy of the model in challenging real-world environments. The applicability of the model for real-time defect detection on factory production lines was provided with capable GPU hardware with CUDA and cuDNN acceleration.
In contrast, YOLOv8 + Gold-YOLO performs slightly less well. The Precision is slightly lower than FFA-Net but remains stable, with the final Recall around 0.75, mAP50 around 0.8, and mAP50-95 reaching approximately 0.4. Gold-YOLO enhances the model's multi-layer information fusion capabilities through its aggregation-distribution mechanism and MAE-style pretraining, making it suitable for scenarios with limited data, which aligns with the defect detection focus of this study.
The performance of YOLOv8 + FocalModulation is the weakest. Although its Precision shows a peak improvement locally, it fluctuates significantly and has a lower average value. The highest Recall only reaches 0.5, and mAP50 and mAP50-95 are approximately 0.55 and 0.3, respectively, clearly lagging behind the other two. This might be due to the overemphasis on local details by FocalModulation, leading to a decline in the model's overall generalization ability.
In summary, among the three improvement modules, the introduction of FFA-Net is the most effective in enhancing surface defect detection performance for stamped parts. Gold-YOLO excels in feature fusion and is suitable for small sample data as studied in this paper, while the FocalModulation module might suffer from overfitting issues in local feature emphasis, resulting in decreased recall ability. Therefore, its overall performance is not as strong as the other modules. If applied to the scenario of this study, further optimization is needed to adapt to industrial-level detection tasks. The final detection model for surface defect detection in stamped parts benefits from the complementary strengths of FFA-Net and Gold-YOLO. FFA-Net improves image quality by enhancing clarity and reducing the effects of uneven lighting and environmental interference, leading to higher precision and recall. Gold-YOLO strengthens the model's multi-scale feature extraction and fusion through self-attention mechanisms and MAE-style pretraining, enhancing generalization and detection accuracy, especially with limited data. While Focal Modulation improves sensitivity to local details, it tends to overfit and reduce overall stability, limiting its effectiveness. Combining FFA-Net and Gold-YOLO results in a robust and accurate model that performs well under challenging industrial conditions such as lighting variations, noise, blur, and occlusion.
For this reason, the schematic diagram of the improved model in this study is shown in Figure 9.
![]() |
Fig. 8 Comparison of experiments with three individual improvement modules. |
Experimental metrics for three models under different indicators.
![]() |
Fig. 9 Improved model in this study. |
4.1.3 Combined improved model design and ablation experiment
To further enhance the performance of YOLOv8 in surface defect detection of stamped parts, various model combinations were designed and ablation experiments were conducted. By analyzing the role of each module individually, we can identify the contribution of different modules to the overall detection performance. The YOLOv8 model is ideal for identifying surface flaws in stamped components because of its rapid, precise real-time detection and multi-scale feature extraction that captures defects of different sizes. Augmented with the Feature Fusion Attention Network (FFA-Net), it enhances image sharpness and resilience to inconsistent lighting and disturbances. The incorporation of Gold-YOLO enhances advanced multi-level feature fusion and MAE-style pretraining, improving generalization and precision, particularly when data is scarce. Collectively, these improvements allow the model to manage practical industrial issues like varying lighting, noise, blur, and occlusions, while proficiently identifying minor defects like fine scratches and microcracks, thereby making it extremely efficient for automated defect detection in intricate production settings.
The experiments compared the following model combinations. Since the individual module analysis has already been performed, the YOLOv8 + FocalModulation setup was removed during the ablation experiment comparison phase, as shown in Table 2.
The experimental results are shown in Figure 10.
From the results of the above ablation experiments, it can be seen that the YOLOv8 + FFA + Gold-YOLO combination performs the best, especially with significant improvements in Precision, Recall, mAP50, and mAP50-95. This indicates that the advantages of FFA in image detail enhancement and Gold-YOLO in multi-layer feature fusion effectively enhance the object detection performance of YOLOv8.
However, after introducing the FocalModulation module, although this module leads to a slight improvement in accuracy, its overall contribution to performance is not as expected. Instead, it causes performance fluctuations at certain stages, with some overfitting issues observed. Therefore, the FocalModulation module did not demonstrate significant optimization effects in this study and may require further optimization or adjustment.
Expérimental setup.
![]() |
Fig. 10 Experimental results. |
4.1.4 Real-world detection comparison
In the actual production line, the YOLOv8 + FFA + Gold-YOLO (the final selected improved model) and YOLOv8 baseline models are used for defect detection of stamped parts. Both trained models are applied to 100 images of stamped parts. The document discusses strategies for handling increased computational load when deploying a large-scale model. It highlights Gold-YOLO, which integrates convolution and self-attention mechanisms, and multi-scale feature fusion for efficient feature capture. The Aggregation-Distribution Mechanism (GD) in Gold-YOLO enhances performance by aligning and fusing multi-layer features, minimizing redundant computation. FFA-Net employs a feature fusion attention mechanism to focus computation on important features, filtering out irrelevant information and reducing unnecessary processing. MAE-style pretraining improves learning efficiency and generalization, reducing the demand for extensive labeled data and overall computational resources. High-performance hardware, GPU acceleration tools, and robustness against noise, lighting variations, blur, and occlusion reduce the need for repeated image capture or additional processing. The selection of modules ensures computational efficiency, with the FocalModulation module excluded due to potential overfitting and performance fluctuations. These strategies collectively ensure efficient and scalable deployment of the model in large-scale industrial settings.
To evaluate the robustness of the YOLOv8 + FFA + Gold-YOLO combination model in practical applications, the model was tested under various environmental conditions. The tests aimed to simulate potential challenges in stamped part defect detection, such as lighting variations, noise interference, image blurring, and occlusion, to ensure the stability and reliability of the model on the production line.
(1) Illumination variation test
In the actual production process, the imaging of stamped parts may be affected by changes in ambient lighting. To simulate this situation, images were processed under three different lighting conditions: normal lighting, low lighting, and strong lighting. Under each lighting condition, 100 images of stamped parts were tested, and the accuracy, Precision, Recall, and mAP values of each model were calculated.
The results show that the YOLOv8 + FFA + Gold-YOLO model has significantly higher accuracy and mAP values under both low lighting and strong lighting conditions compared to the YOLOv8 baseline model. This demonstrates that the improved model is more robust under varying lighting conditions Table 3
(2) Noise interference test
Gaussian noise of different intensities was added to the original images. The test was conducted with three levels of noise intensity: low, medium, and high. The experimental results show that the YOLOv8 + FFA + Gold-YOLO model maintains relatively high detection accuracy even under high noise conditions, outperforming the YOLOv8 baseline model. This indicates that the improved model has stronger robustness against noise interference Table 4.
(3) Image blur test
In the image blurring test, we simulated out-of-focus and motion blur during the imaging process. We applied two types of blur: Gaussian blur and motion blur, and processed the images with different levels of blurring.
The experimental results show that the YOLOv8 + FFA + Gold-YOLO model can maintain relatively high detection accuracy even when processing blurred images. In contrast, the performance of the YOLOv8 baseline model significantly decreased under strong blurring conditions, demonstrating the vulnerability of the baseline model in such scenarios Table 5.
(4) Light occlusion test
To simulate potential occlusion situations that may occur on the production line, we applied lighting occlusion to certain defect areas in the images and evaluated the performance of both models under these conditions.
The results show that the YOLOv8 + FFA + Gold-YOLO model is able to detect defects even with a certain degree of occlusion, while the YOLOv8 baseline model is more prone to false positives or missed detections when the occlusion is more severe Table 6.
Illumination variation test.
Noise interference test.
Blur Type Test.
Light occlusion test data.
5 Conclusion
Through these robustness tests, we can conclude that the YOLOv8 + FFA + Gold-YOLO model demonstrates strong robustness under conditions such as lighting changes, noise interference, image blurring, and occlusion. In contrast, the YOLOv8 baseline model exhibits lower detection accuracy under these extreme conditions and is more prone to false positives and missed detections. Therefore, the improved model is better suited for application in complex industrial production environments, as it provides higher defect detection accuracy under various unstable factors.
Funding
This research was funded by the Natural Science Research Project for higher education institutions in Tianjin Municipality (Grant No. 2019KJ162).
Conflicts of interest
The authors do not have any conflicts.
Data availability statement
No datasets were generated or analyzed during the current study.
Author contribution Statement
Tian Xia is responsible for designing the framework, analyzing the performance, validating the results, and writing the article. Lanju Zhou, Lihong Quan is responsible for collecting the information required for the framework, provision of software, critical review, and administering the process.
References
- T. Taylor, A. Clough, Critical review of automotive hot-stamped sheet steel from an industrial perspective, Mater. Sci. Technol. 34, 809 (2018) [CrossRef] [Google Scholar]
- Q. Luo, X. Zhang, C. Zhang, L. Wang, Y. Chen, Automated visual defect detection for flat steel surface: a survey, IEEE Trans. Instrum. Meas. 69, 626 (2020) [CrossRef] [Google Scholar]
- X. Fang, H. Wu, J. Shen, Z. Zhou, Research progress of automated visual surface defect detection for industrial metal planar materials, Sensors 20, 5136 (2020) [Google Scholar]
- S.A. Singh, K.A. Desai, Automated surface defect detection framework using machine vision and convolutional neural networks, J. Intell. Manuf. 34, 1995 (2023) [Google Scholar]
- L. Geng, J. Yu, Y. Li, Machine vision detection method for surface defects of automobile stamping parts, Am. Sci. Res. J. Eng. Technol. Sci. 53, 128 (2019) [Google Scholar]
- X. Zheng, R. Liu, H. Yang, Recent advances in surface defect inspection of industrial products using deep learning techniques, Int. J. Adv. Manuf. Technol. 113, 35 (2021) [Google Scholar]
- H. Tian, X. Liu, Y. Wu, Surface defects detection of stamping and grinding flat parts based on machine vision, Sensors 20, 4531 (2020) [Google Scholar]
- M. Prunella, G. Bontempi, N. Di Mauro, Deep learning for automatic vision-based recognition of industrial surface defects: A survey, IEEE Access 11, 43370 (2023) [Google Scholar]
- S.B. Block, B.R. Costa, J.R. da Silva, Inspection of imprint defects in stamped metal surfaces using deep learning and tracking, IEEE Trans. Ind. Electron. 68, 4498 (2020) [Google Scholar]
- J. Bai, L. Chen, F. Liu, A comprehensive survey on machine learning-driven material defect detection: Challenges, solutions, and prospects. arXiv:2406.07880 (2024) [Google Scholar]
- X. Ma, M. Zhang, Y. Wang, Stamping part surface crack detection based on machine vision, Measurement 251, (2025) [Google Scholar]
- L. Wei, D. Huang, Y. Sun, Surface defects detection of cylindrical high-precision industrial parts based on deep learning algorithms: a review, Oper. Res. Forum 5, 58 (2024) [Google Scholar]
- J. Ribeiro, R. Lima, T. Eckhardt, S. Paiva, Robotic process automation and artificial intelligence in industry 4.0-a literature review, Proc. Comput. Sci. 181, 51–58 (2021) [Google Scholar]
- R.L. Gudivaka, R.K. Gudivaka, M. Karthick, Deep learning-based defect detection and optimization in IoRT using metaheuristic techniques and the flower pollination algorithm, Int. J. Eng. Res. Sci. Technol. 15, (2019) [Google Scholar]
- K. Dinesh, Enhanced fault diagnosis in IoT: Uniting data fusion with deep multi-scale fusion neural network, Internet of Things (2024) [Google Scholar]
- T. Kumar, S. Singh, N. Chauhan, Image data augmentation approaches: a comprehensive survey and future directions, IEEE Access (2024) [Google Scholar]
- M. Sohan, S.R. Thotakura, C.V.R. Reddy, A review on yolov8 and its advancements, Int. Conf. Data Intell. Cogn. Inform., Springer, Singapore (2024) [Google Scholar]
- X. Qin, Z. Wang, Y. Bai, FFA-Net: Feature fusion attention network for single image dehazing, Proc. AAAI Conf. Artif. Intell. 34, 11908 (2020) [Google Scholar]
- D.R. Natarajan, S. Peddi, D.T. Valivarthi, S. Narla, S.S. Kethu, D. Kurniadi, Hybrid LSTM-GA and HS-CS algorithms for automation-assisted defect identification in CNC milling robotics: enhancing global search and optimization, in Proc. Int. Conf. Comput. Sustain. Intell. Future (COMP-SIF) (2025) [Google Scholar]
- P. Li, Z. Chen, J. Zhang, Abnormal driving detection algorithm based on improved YOLO-v8 with self-attention and GOLD-YOLO mechanism, Int. Conf. Auton. Unmanned Syst., Springer Nature, Singapore (2024) [Google Scholar]
- A. Khan, M.A. Rahman, D. Park, CamoFocus: enhancing camouflage object detection with split-feature focal modulation and context refinement, Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (2024) [Google Scholar]
Cite this article as: T. Xia, L. Zhou, L. Quan, The study on surface defect detection of stamped parts based on improved deep learning, Mechanics & Industry 26, 27 (2025), https://doi.org/10.1051/meca/2025019
All Tables
All Figures
![]() |
Fig. 1 Experimental equipment diagram. |
| In the text | |
![]() |
Fig. 2 Stamped parts used in the experiment. |
| In the text | |
![]() |
Fig. 3 Generalized dataset. |
| In the text | |
![]() |
Fig. 4 Three types of defects. |
| In the text | |
![]() |
Fig. 5 Defect annotation using MAKE SENSE. |
| In the text | |
![]() |
Fig. 6 YOLOv8 baseline model. |
| In the text | |
![]() |
Fig. 7 Output results of the model after running. |
| In the text | |
![]() |
Fig. 8 Comparison of experiments with three individual improvement modules. |
| In the text | |
![]() |
Fig. 9 Improved model in this study. |
| In the text | |
![]() |
Fig. 10 Experimental results. |
| In the text | |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.




















