2025 Spring Meeting and 21st Global Congress on Process Safety

(134c) Integration of Unsupervised and Supervised Machine Learning Models for Single/Multiple Leak Detection and Localization Using Minimal Sensor Data

Authors

Ahmad K. Sleiti - Presenter, Qatar University
Wahib Al-Ammari, Qatar University
M. Hamilton, Memorial University
Hicham Ferroudji, Texas A&M University at Qatar
Sina Rezaei Gomari, Teesside University
I. Hassan, Texas A&M University at Qatar
Rashid Hassan, Department of Petroleum Engineering, Texas A&M University, College Station, United States
Mohammad Azizur Rahman, College of Science and Engineering (CSE), Hamad Bin Khalifa University (HBKU), Qatar
The integrity of gas pipelines, both offshore and onshore, is critical for ensuring safe and efficient energy transportation. Leaks in these pipelines can lead to catastrophic environmental damage, economic losses, and safety hazards. Traditional leak detection methods often rely on extensive sensor networks and complex algorithms, which can be costly and challenging to implement, especially in remote or inaccessible locations. There is a pressing need for reliable, cost-effective leak detection systems that utilize minimal sensor data without compromising accuracy. Recent advancements in machine learning offer promising solutions to address these challenges by leveraging data-driven models for anomaly detection and localization.

This research introduces an innovative approach that integrates unsupervised and supervised machine learning models to detect and localize single and multiple leaks in gas pipelines using minimal sensor data—specifically, inlet and outlet mass flow rates and pressures. The novelty of this work lies in its hybrid methodology that combines the strengths of unsupervised learning for anomaly classification with supervised learning for precise leak localization and quantification.

An unsupervised model, such as a Gaussian Mixture Model (GMM), is employed for classification purposes. The GMM is chosen due to its capability to model complex data distributions and identify underlying patterns without prior labeling, making it ideal for distinguishing between normal operation, single leaks, and multiple leaks. Upon detecting an anomaly, a supervised model—implemented using a Random Forest Regressor—is activated to estimate the sizes and locations of the leaks accurately. This two-tiered approach ensures that leaks are not only detected but also precisely located and quantified, facilitating prompt and effective response measures.

Methodology:

The models are trained on a comprehensive dataset comprising both experimental and synthetic data. The synthetic data are generated using advanced pipeline simulation software (OLGA) to mimic various leak scenarios under different operational conditions, including steady-state and transient flows. Noise is introduced into the data to simulate real-world sensor inaccuracies and environmental interferences. The methodologies of this study are summarized as follows:

  1. Data Collection and Preprocessing

Experimental data are collected from controlled leak experiments on an experimental multi-phase flow setup designed to simulate single and multiple leak scenarios. The experimental pipeline has a total length of 6 meters and is constructed from stainless steel, with an outer diameter (OD) of 60 mm and an inner diameter (ID) of 52.5 mm. Three pre-designed leak points with varying diameters (3 mm, 2.5 mm, and 1.8 mm) are included to introduce controlled leakages. The first leak is located 3.457 m from the inlet, with each subsequent leak spaced 90 mm apart, enabling the study of multiple leak events under different flow conditions. To enhance the performance of the machine learning models, synthetic data are generated using pipeline simulation (OLGA software), varying leak sizes (0.5 cm to 5.0 cm in diameter), and locations (1200 m to 12000 m along the pipeline). OLGA models were validated with the experimental data under normal, single, and multiple leak conditions. Furthermore, feature engineering is performed to calculate differential pressures and mass flow discrepancies between inlet and outlet to serve as input features.

  1. Unsupervised Learning for Classification

For the unsupervised learning component, a Gaussian Mixture Model (GMM) was selected due to its proficiency in handling multi-modal distributions. The GMM was trained to classify data into three categories: normal (no leak), single leak, and multiple leaks. To enhance the model's performance, cross-validation techniques were employed to optimize the number of components and covariance types within the GMM, ensuring accurate classification across diverse scenarios.

  1. Supervised Learning for Localization and Sizing

In the supervised learning phase, a Random Forest Regressor was chosen for its robustness to overfitting and its ability to capture non-linear relationships inherent in the data. This model was trained on the subset of data identified as leak scenarios by the unsupervised GMM. Additionally, feature importance analysis was conducted to identify the most significant features influencing leak size and location predictions, allowing the model to focus on the most impactful variables and improve prediction accuracy.

  1. Model Testing and Evaluation:

The integrated models were tested under various operational conditions, including different pressures, flow rates, and pipeline diameters, to evaluate their generalizability and robustness. To simulate real-world sensor inaccuracies, random noise was added to the input features, assessing the models' resilience against data imperfections. Performance metrics used for evaluation included classification accuracy for the unsupervised model and mean absolute percentage error (MAPE) for the supervised model. The results demonstrated high levels of accuracy, confirming the effectiveness of the integrated approach in detecting and localizing leaks with minimal sensor data.

Key Results

The integrated models demonstrated exceptional performance throughout the testing phase. The unsupervised Gaussian Mixture Model achieved over 99% accuracy in correctly classifying normal operations, single leaks, and multiple leaks. In terms of localization and sizing, the supervised Random Forest Regressor attained an absolute error of less than 3.20% in predicting leak sizes and locations. Furthermore, the models maintained high accuracy levels even when significant noise was introduced into the input features, indicating strong resilience to real-world sensor inaccuracies. The consistent performance across various operational conditions showcases the models' adaptability to different pipeline environments, confirming their operational versatility.

Discussion

The high classification accuracy underscores the effectiveness of the GMM in identifying leak-related anomalies using minimal sensor data. The Random Forest Regressor's low error margin in localization and sizing validates its suitability for regression tasks in complex systems like gas pipelines. The integration of unsupervised and supervised learning models leverages the strengths of both approaches—unsupervised learning for detecting unforeseen patterns and supervised learning for making precise predictions based on labeled data.

The reliance on minimal sensor data significantly reduces implementation costs and complexity, making the proposed method highly practical for widespread adoption. Additionally, the use of both experimental and synthetic data for training enhances the models' generalizability and reliability.

Conclusion and Future Work

This research presents a novel and efficient approach for leak detection and localization in gas pipelines, combining unsupervised and supervised machine learning models. The results demonstrate that high accuracy can be achieved using minimal sensor inputs, offering a cost-effective and reliable solution for pipeline monitoring.

Future work will focus on expanding the models to accommodate different types of pipelines, including those transporting liquids or multiphase flows. Moreover, integrating real-time data analytics and incorporating additional sensor inputs, such as temperature and vibration data, could further enhance model accuracy and reliability.