The primary objective of this research is to develop a robust and accurate model for real-time monitoring of chemical reactions in a continuous flow synthesis system using Raman spectroscopy. This research also aims to facilitate process intensification and the transfer of materials from a batch setting to a continuous flow system. Continuous flow synthesis offers several advantages over traditional batch processes, including enhanced reaction control, improved reproducibility, and increased safety. However, effective implementation of continuous flow reactors requires real-time analytical tools to monitor and control critical reaction parameters efficiently. Raman spectroscopy, along with other non-destructive analytical techniques such as Near-Infrared (NIR) and Fourier Transform Infrared (FTIR) spectroscopy, provides valuable insights into reaction progression, compound identification, and structural analysis.
In this study, we aim to construct a predictive model that correlates Raman spectral data with key polymerization parameters, including molecular weight (Mw) and residual monomer concentrations. Spectral data is inherently highly complex with each spectrum containing thousands of wavenumber intensities, and nonlinear relationships between spectral features and chemical properties. Traditional analytical methods struggle to correlate meaningful insights from this data due to the high dimensionality of this data. However, through machine learning techniques such as Partial Least Squares Regression (PLSR), Support Vector Regression (SVR), and Random Forest Regressors, we aim to enhance the interpretability of Raman spectral data, enabling real-time predictions of chemical properties of interest. This will help facilitate quality assurance and process optimization of materials. This research focuses on refining spectral preprocessing techniques, optimizing model parameters, and validating predictive accuracy against experimental data. Ultimately, our work aims to bridge the gap between inline spectroscopic analysis and practical process control, contributing to the broader adoption of continuous flow chemistry in industrial applications.
Our continuous reactor setup consists of four (4) pumps with pumps 1 and 2 delivering a stream of concentrated monomer solution and solvent for in-line dilution to the target molarity. While pumps 3 and 4 deliver a concentrated initiator solution and solvent for in-line dilution to the target molarity. These solutions then mix and undergo preheating in a 1 mL chip reactor before entering a 16 mL tube reactor where the main reaction takes place. From there it flows through an inline Raman flow cell where a spectrum of the material is collected. The material is then cooled and automatically collected. A 40-run design of experiments (DOE) was conducted in which we varied polymer solids percentage, initiator loading, system pressure, reaction temperature, and residence time of the reaction. The spectra generated was used to build and train our Raman based model. This model was used to correlate Raman spectral data to our DOE response variables of molecular weight, residual monomer A, and residual monomer B.
The model was coded to analyze Raman spectral data and predict molecular weight and residual monomers and tested using a variety of models including Support Vector Regression, Ridge Regression, Random Forest Regressor, and Partial Least Squares Regression. It begins by loading our datasets: one containing the Raman spectra raw counts with wavenumbers as column headers and another with molecular weight values. The dataset is then split into training and testing sets using an 80/20 ratio, this ensures that the model is trained on a diverse set of data points while retaining enough unseen data for validation purposes.
Using the Chemotools and Scikit Python packages a preprocessing pipeline is used to refine the spectral data before applying regressions. The best spectral ranges for each of the variables were identified using a moving window partial least squares (MW-PLS) program on the training data set. Once the best range was identified the RangeCut function was used to isolate the specific spectral regions in our model that were most relevant to the analysis, reducing noise from unnecessary wavelengths. LinearCorrection is then applied to correct baseline shifts, followed by SavitzkyGolay filtering for smoothing the spectra and enhancing signal clarity. The data is then standardized using the StandardScaler function, ensuring that all features are on the same scale before regression is applied. GridSearchCV is employed to fine-tune the PLSR model, selecting the optimal number of components that minimize mean absolute error during cross-validation. With the most important spectral regions identified using MW-PLS for Mw and each residual monomer was found we plotted the measured versus predicted variable and calculated the R-squared value for each set to check the overall fit of the model as well as how it generalizes to unseen data. For the molecular weight data, the training data set had an R-squared value of 0.92 while the test data had an R-squared value of 0.76. Similarly residual monomer A had a training R-squared value of 0.92 and a test R-squared value of 0.75 well residual monomer B had a training R-squared value of 0.95 and a test R-squared value of 0.78. This indicates that most of the variance in the model is accounted for and it generalizes well to data it has not seen before. The general performance of these models is expected to increase as the training parameters are further optimized, and as more data becomes available for use by the model.