2022 Spring Meeting and 18th Global Congress on Process Safety Proceedings
(121c) Accelerating Product Development with Diverse Training Data
Authors
The approach Citrine Informatics has taken for both storing all of this connected data and then leveraging it for AI is one of divide and conquer. First, we use the Graphical Expression of Materials Data (GEMD [1]) model to provide structure and detailed information about process history to data sources using partner-defined terminology. This allows comparison and synthesis across data sources without forcing complex records into a rigid schema. Second, we use graph queries defined in our citrine-python library [2] to normalize data into a tabular format with consistent units. The queries are expressed using the same organization-defined terms from GEMD. Third, as the heterogeneity of the data means that not all values will be defined for all rows, we use networks of models to fill empty cells through transfer learning [3]. Finally, in model validation we use leave-one-cluster-out cross validation [4] to develop reasonable uncertainty expectations for the system of models in light of the population imbalance common to industrial data. Combining these methods into a unified data and modeling stack has resulted in a tool that can engage with data sources where they are today, allow for forward compatibility as data continues to accumulate and evolve, and permit reuse and retraining of historical models with minimal human intervention.
[1] https://citrineinformatics.github.io/gemd-docs/
[2] https://citrineinformatics.github.io/citrine-python/
[3] M Hutchinson, E Antono, B Gibbons, S Paradiso, J Ling, and B Meredig. Overcoming data scarcity with transfer learning. arXiv preprint arXiv:1711.05099, 2017
[4] B Meredig, E Antono, C Church, M Hutchinson, J Ling, S Paradiso, B Blaiszik et al. âCan machine learning identify the next high-temperature superconductor? Examining extrapolation performance for materials discovery.â Molecular Systems Design & Engineering (2018).