2022 Spring Meeting and 18th Global Congress on Process Safety Proceedings

(121c) Accelerating Product Development with Diverse Training Data

Checkout Do you already own this? Log in to access this content.

Pricing

Individuals

AIChE Pro Members	150.00
AIChE Emeritus Members	105.00
Employees of CCPS Member Companies	150.00
AIChE Graduate Student Members	Free
AIChE Undergraduate Student Members	Free
AIChE Explorer Members	225.00
Non-Members	225.00

Authors

Kenneth Kroenlein - Presenter, National Institute of Standards and Technology

Sebastian M. Bernasek, Citrine Informatics

Lenore Kubie, Citrine Informatics

When designing a product, experts meld a variety of data including historical experiments, manufacturability limitations, and fundamental physical and chemical understanding. The breadth of these data resources has made awareness of all relevant information for a given design difficult, and the growth of data volumes following substantial digitalization efforts has exacerbated this challenge. Combining these disparate data streams is labor intensive, as differences in schema, assumptions about format, and variations in taxonomy make merging without human intervention often impracticable â even without considering lab notebooks or other non-digital assets. These data are heterogeneous in structure, sparsely populated, and often statistically small.

The approach Citrine Informatics has taken for both storing all of this connected data and then leveraging it for AI is one of divide and conquer. First, we use the Graphical Expression of Materials Data (GEMD [1]) model to provide structure and detailed information about process history to data sources using partner-defined terminology. This allows comparison and synthesis across data sources without forcing complex records into a rigid schema. Second, we use graph queries defined in our citrine-python library [2] to normalize data into a tabular format with consistent units. The queries are expressed using the same organization-defined terms from GEMD. Third, as the heterogeneity of the data means that not all values will be defined for all rows, we use networks of models to fill empty cells through transfer learning [3]. Finally, in model validation we use leave-one-cluster-out cross validation [4] to develop reasonable uncertainty expectations for the system of models in light of the population imbalance common to industrial data. Combining these methods into a unified data and modeling stack has resulted in a tool that can engage with data sources where they are today, allow for forward compatibility as data continues to accumulate and evolve, and permit reuse and retraining of historical models with minimal human intervention.

[1] https://citrineinformatics.github.io/gemd-docs/

[2] https://citrineinformatics.github.io/citrine-python/

[3] M Hutchinson, E Antono, B Gibbons, S Paradiso, J Ling, and B Meredig. Overcoming data scarcity with transfer learning. arXiv preprint arXiv:1711.05099, 2017

[4] B Meredig, E Antono, C Church, M Hutchinson, J Ling, S Paradiso, B Blaiszik et al. âCan machine learning identify the next high-temperature superconductor? Examining extrapolation performance for materials discovery.â Molecular Systems Design & Engineering (2018).

Breadcrumb

2022 Spring Meeting and 18th Global Congress on Process Safety Proceedings

(121c) Accelerating Product Development with Diverse Training Data

Authors