2023 AIChE Annual Meeting
(147v) Enhancing Feature Engineering and Machine Learning through Systems Engineering for Improved Diagnosis: Case Studies in Speech Disorder and Autism Spectrum Disorder
In the realm of machine learning (ML), the incorporation of domain knowledge is vital to ensuring comprehensive and reliable outcomes. The premise of my research is to recognize that ML results can be incomplete or even misleading if a deep understanding of the underlying domain is missing. Hence, my doctoral work aims to bridge the gap between domain knowledge and ML techniques, enabling the generation of complete and interpretable ML models that lead to consistent and reliable results.
To contextualize this integration, process systems engineering (PSE) serves as a valuable framework. By adopting a systems engineering perspective, we transcend purely data-driven models and embrace the integration of machine learning with human learning through seamless integration of domain knowledge. Domain knowledge also plays a pivotal role in feature engineering, a critical component in constructing machine learning models. By leveraging domain knowledge, we develop novel, often physically or biologically meaningful, features that significantly enhance the performance of machine learning models, aligning with Andrew Ngâs assertion that "applied machine learning is basically feature engineering."
Specifically, my research interest is the development of innovative features that harness domain knowledge to enhance the performance, interpretability, and reliability of ML models. In addition to the aforementioned benefits of systems engineering in integrating domain knowledge with machine learning, we leverage the parallels between disorder/disease detection in medical/clinical research and fault detection in PSE. By drawing inspiration from PSE principles, my research seeks to enhance the detection of speech disorders and autism in children.
Childhood speech disorder detection is important as approximately 1 in 12 children between the ages of three and five are affected by speech-language deficits, making it one of the most prevalent disabilities in children. The clinical assessment of this deficit in children is commonly performed using auditory perceptual analysis (APA), which has many drawbacks, including its vulnerability to inconsistencies among evaluators and challenging to apply to children. My study focuses on developing an automatic speech disorder detection algorithm via feature engineering and ML using acoustic landmarks (LMs) derived based on the LM theory of speech perception.
We introduce novel knowledge-based features derived from domain expertise drawing an analogy to the concept of body-mass index (BMI). BMI incorporates both weight and height to determine whether an individual is overweight. Similarly, we propose to use ratios of landmark counts, which serve as more informative features than their individual counterparts, as they help mitigate the impact of individual variations within the same class of samples.
We developed and validated a ML algorithm based on the novel features using a data set of speech of 39 typically developing children and 12 children with speech disorder from the Speech Evaluation and Exemplars Database (SEED). Results show that the raw LM features are not informative in detecting speech disorder, with only 64% sensitivity and 54% specificity. In contrast, our algorithm achieved 96% sensitivity, 92% specificity, and overall accuracy of 94% with 10 features. 9 out of 10 selected features are novel ratio-based features. This study shows that the integration of LM-based features and domain knowledge and ratio-base features significantly improves the classification accuracy of patients with speech disorder from typically developing children, which can serve as a more reliable and objective diagnostic tool.
In another study, I investigated the detection of autism spectrum disorder (ASD), which is a complex neurodevelopmental condition with its prevalence estimated at 1 in 44 children in the United States. ASD poses significant personal, familial, and societal challenges to affected children. Diagnostic methods primarily rely on behavioral criteria such as difficulties in communication and social interaction, which can be subjective and challenging to apply to younger children. Biomarkers detected in bodily fluids, such as blood and urine, offer a minimally invasive and cost-effective approach to improve diagnostic accuracy and facilitate early intervention for better outcomes in individuals with ASD. However, finding reliable proteins as biomarkers for ASD has been challenging. Thus, we propose a novel approach to overcome these challenges by systematically generating physically meaningful features that are resilient to confounding factors such as age, gender, diet, and comorbid diseases.
We introduce a novel set of engineered features, including protein ratios, which reduce within-class variations. Our automated computer-assisted biomarker detection framework integrates protein biomarkers and ratio-based features with a hybrid feature selection technique and a linear machine learning model. The effectiveness of the framework was demonstrated using a dataset of serum samples from 76 typically developing (TD) boys and 78 boys with ASD. The proposed algorithm achieved an area under the curve (AUC) of 0.95 with 8 features outperforming previous studies (AUC of 0.86 with 9 proteins). 7 out of the 8 selected features are ratios.
This project shows that the outstanding performance of the proposed method is achieved mainly through the introduction of biomarkers that are defined beyond the traditional physical trait to include bio-information that can only be extracted by considering their interactions and correlations. The proposed robust and interpretable algorithm has great potential for early ASD detection.
As a final remark, through the integration of domain knowledge and machine learning techniques, our research delivers comprehensive and interpretable results in speech-language deficits and ASD biomarker detection. By leveraging domain expertise, we enhance the accuracy and reliability of speech disorder diagnosis and contribute to the identification of reliable biomarkers for ASD. Our methodology aligns with the principles of systems engineering, providing insights into complex systems and paving the way for future advancements in these domains.