PI1M: A Benchmark Database for Polymer Informatics
Tech ID: 20-065
Inventor: Dr. Tengfei Luo, Ruimin Ma
Date Added: May 11, 2021
An innovative open-source benchmark database (Data size: ~1 million) used for machine learning research in polymer informatics
Open-source data on a large scale is the cornerstone for data-driven research, but such data is not readily available for polymers. A number of polymer databases and/or platforms, like PolyInfo, Polymer Genome, and CHEMnetBASE-Polymers, have been developed, but most of these are embedded in web applications, where the raw data (especially the data of chemical structures) is not accessible on a large scale. Thus, conducting machine learning research for polymers is largely limited to those who hold such raw-data, creating barriers and limiting the ability to test or develop machine learning algorithms for polymer informatics.
Researchers at University of Notre Dame have built a benchmark database, called PI1M (referring to ~1 million monomers of polymers for polymer informatics), to provide data resources that can be used for machine learning research in polymer informatics. A generative model is trained on ~12,000 polymers manually collected from the largest existing polymer database PolyInfo, and then the model is used to generate ~1 million polymers. A new representation for polymers, polymer embedding (PE), is introduced, which is then used to perform several polymer informatics regression tasks for density, glass transition temperature, melting temperature and dielectric constants. By comparing the PE trained by the PolyInfo data and that by the PI1M data, it is proved that the PI1M database covers similar chemical space as PolyInfo, but can significantly populate regions where PolyInfo data are sparse. Researchers believe PI1M will serve as a good benchmark database for future research in polymer informatics.
- Data size of PI1M is around 1 million, which is larger than any currently existing database
- Open-source database that is accessible in a larger scale
- Polymer informatics research industry
- Polymer (Plastic & Resin) Manufacturing in the US: $73B
Technology Readiness Level
- TRL 4 – Lab Validation
Intellectual Property Status
Ma, R., & Luo, T. (2020). PI1M: A benchmark database for Polymer informatics. Journal of Chemical Information and Modeling, 60(10), 4684-4690. doi:10.1021/acs.jcim.0c00726