2021 Annual Meeting
(241e) Interactive Software for Teaching Multivariable Data Analytics
Authors
Especially for students who have not had linear algebra or applied statistics training, the mathematical details of latent variable methods can be challenging to understand. The tendency will be for students to treat each method as a black box. After all, many software packages are available for applying latent variable methods, and it might seem that a more profound understanding is not needed for problem-solving. What is more, often multiple methods are applied to a problem at hand, and decisions for algorithms are made solely based on preliminary results. This approach makes it challenging to explain and interpret the models that are creating by these methods, in particular, in relating the models that are generated with the chemistry/biology occurring in the process, to reconcile the data analytics results with domain knowledge. It is the synthesis of domain knowledge with data analytics that is the added value of a well-trained chemical engineer and is likely to result in the best chemical engineering solutions for the particular problem at hand. Furthermore, preliminary black-box results can lead to choosing overly complicated models that overfit the data. Advancing the understanding and intuition on latent variable methods is needed to avoid overfitting for some types of biased data and ultimately assure model interpretability, leading to higher value, acceptance, and applicability.
This presentation describes software and examples that were developed to train students to achieve a deep understanding of latent variable methods [1],[2] and the related machine learning methods of lasso [3] and elastic net [4] (for the remainder of this abstract, these methods will be collectively referred to as latent variable methods, although some of these methods are more associated with the machine learning community). The graphical user interface was designed for the explicit purpose of teaching undergraduate and graduate students, which is a distinguishing feature from the graphical user interfaces in existing chemometrics software packages which are focused on just applying a method to a dataset. The software takes the perspective of the optimization being solved, so that the students can gain an understanding of the relationship between the latent variable method that is selected and the results that are produced.
This tool, referred to as Latent Variable Demonstrator (LAVADE), compares a wide range of latent variable regression techniques with traditional regression techniques on carefully designed examples. The examples are designed to be easy to understand, and various options to customize the problem are available to learn exactly how the different algorithms approach the model construction. Perturbing the signal step by a step with more noise fosters an understanding of how the different methods deal with noise. Furthermore, the tool allows the student to play and compete with the algorithms, making it exciting to gather knowledge and intuition to explain the algorithmsâ behavior on real-world problems.
References:
-
Leo H. Chiang, Evan L. Russell, and Richard D. Braatz. Fault Detection and Diagnosis in Industrial Systems. London, UK: Springer Verlag, 2000.
-
K.V. Mardia. Multivariate Analysis. London, UK: Academic Press, 2003.
-
Robert Tibshirani. âRegression Shrinkage and Selection Via the Lasso.â Journal of the Royal Statistical Society: Series B (Methodological), 58(1), pp. 267â288, 1996.
-
Hui Zou and Trevor Hastie. âRegularization and variable selection via the elastic net.â Journal of the Royal Statistical Society. Series B: Statistical Methodology, 67(2), pp. 301â320, 2005.