The collaboration between the University of Cambridge and the University of Argonne has developed a technology that can use artificial intelligence and high-performance computing to generate an automated database to support specific scientific fields.
Even after the dawn of data-driven discovery, it has always been a tedious task for researchers to search for bits and pieces of information in the large amount of information in the scientific literature to support an idea or find the key to a specific problem.
Jacqueline Cole knew the drill very well. She is the head of the Department of Molecular Engineering at the University of Cambridge, UK. She has spent most of her career looking for materials with optical properties that can enable herself to collect light more efficiently, such as one day that might provide solar windows. Dynamic dye molecules.
She recalled: “I know that a lot of information is in very scattered form throughout the literature.” “But if you organize thousands of documents, you can form your own database.”
So Cole and colleagues at the University of Cambridge and the Argonne National Laboratory of the U.S. Department of Energy (DOE) did this and listed the process in the diary Scientific data.
Cole said the paper describes how to use natural language processing (NLP) and high-performance computing to build a database, and most of the latter’s work is carried out at the Argonne Leadership Computing Facility (ALCF), the Office of Scientific Users of the Department of Energy. .
The factors that make the database unique include the scale of the project and the fact that the fact includes experimental and calculated data about the structure of the two materials, which describe the atomic or chemical basis of things, as well as the material properties, and the functions that these properties provide. Different structures.
Cole said: “This may be the first such compilation of such a large-scale database, with 5380 pairs of similar experimental and calculated data.” “Moreover, because of the large number, it can be used as storage itself. The library really opens the door for predicting new materials.”
Many new large databases are established purely based on calculations, and their inherent disadvantage is that they have not been verified by experimental data. The latter is perhaps the most important. It provides an accurate image of the excited state of the material, which defines the dynamic state of the electrons and is used to calculate the functional properties (optical properties in this case) of the material.
Then, this budding excited state catalog can help calculate the properties of materials that have not yet been conceived, thereby further expanding the database.
“Imagine that people want to discover a new type of optical material to suit customized functional applications, and our database does not contain that special optical characteristic,” Cole explained. “We calculate the optical properties of interest from the excited states, which can be used for each property in our database, and create materials with customized functions.”
The team used ALCF’s Theta supercomputer to perform quantum chemical calculations on each structure from which optical material data was extracted, thereby creating a database of paired experimental structures and calculated structures and their optical properties.
“One of the biggest challenges is to extract chemical candidates from 400,000 scientific articles that can be used as dyes for solar cells,” said Álvaro Vázquez-Mayagoitia, a computational scientist in the Department of Computational Science at Argonne University. “We have developed a distributed framework to apply artificial intelligence methods, such as those used in natural language processing, on ALCF’s world-class supercomputers.”
To automatically extract this information and store it in a database, the team turned to a new data mining application called ChemDataExtractor. It is an NLP tool designed to mine text from chemical and materials literature, Cole said: “The information is scattered in thousands of papers and presented in a highly fragmented and unorganized form.”
Cole was not a person who manually searched for articles, but described the motivation to develop the application as helpless. Initially, she tried more general-purpose NLP packages, but pointed out that “not only will they fail, they will fail.”
The problem is translation. Although there are some similarities, it does not come from the human language standpoint, but from the scientific language.
For example, a writer might use a speech recognition program (a form of NLP) to take notes or interviews. The program is mainly trained based on the author’s voice, mastering various patterns and nuances, and starting to transcribe fairly accurately. Now, accepting interviews on subjects with foreign accents, things are starting to get bad.
In Cole’s world, foreign languages are science, and each field is in a different country. Currently, you only need to train the program in one “language” (such as chemistry), and even so, you must learn a specific dialect of science.
Inorganic chemists may use unfamiliar symbolic representations of chemical elements to construct formulas, while organic chemists prefer chemical sketches numbered in illustration boxes. Generally, for most mining programs, the information extracted from these two types of information is difficult to extract.
Cole said: “It’s just a little chemical reaction.” “Because people describe things in such a variety of ways, domain-specific diversity is absolutely crucial.”
To this end, the team’s database is one of the properties of the ultraviolet-visible (UV/vis) absorption spectrum, which provides a publicly available resource for users seeking to find materials with preferred spectral colors.
Although the team is using the new database to screen out organic dyes that may replace traditional metal-organic dyes in solar cells, they have targeted their applications in a wider area.
It can be used as a source of training data for machine learning methods for predicting new optical materials. It can also prove simple data retrieval options for users of UV/vis absorption spectrometers. This tool is widely used in research laboratories around the world to characterize The core technology of new materials.
Vázquez-Mayagoitia added: “The protocol used in this project has been deployed to similar types of projects.” “For example, the team recently used ChemDataExtractor and ALCF computing resources to generate potential battery chemistry, magnetic and superconducting compounds Expand the database.”
The optical materials database research is published in the article “Comparative Data Set of Experimental and Calculated Properties of UV/vis Absorption Spectra” in “Scientific Data”. Other authors include Edward J. Beard of Cambridge University, Ganesh Sivaraman and Venkatram Vishwanath of Argonne National Laboratory.
A paper detailing their work on magnetic and superconducting materials was published in npj calculation material.A battery material database containing more than 290,000 data records has been published on Scientific data.
Scientists use machine learning to identify high-performance solar materials
Callum J. Court et al. Use text mining and machine learning to predict magnetic and superconducting phase diagrams and transition temperatures, npj calculation material (2020). DOI: 10.1038 / s41524-020-0287-8
Shu Huang et al. Use the battery material database automatically generated by ChemDataExtractor, Scientific data (2020). DOI: 10.1038 / s41597-020-00602-2
Provided by Argonne National Laboratory
Citation: Automatically create a database for discovering materials: Innovation in Frustration (September 23, 2020) from https://phys.org/news/2020-09-automatic-database-creation- on September 24, 2020 materials-discovery.htmlSearch
This document is protected by copyright. Except for any fair transactions for private learning or research purposes, no content may be copied without written permission. The content is for reference only.