Managing, profiling and analyzing a library of 2.6 million compounds gathered from 32 chemical providers
Monge A, Arrault A, Marot C, Morin-Allory L
Molecular Diversity - vol. 10 389-403 (2006)
The data for 3.8 million compounds from structural databases of 32 providers were gathered and stored in a single chemical database. Duplicates are removed using the IUPAC International Chemical Identifier. After this, 2.6 million compounds remain. Each database and the final one were studied in term of uniqueness, diversity, frameworks, ‘drug-like’ and ‘lead-like’ properties. This study also shows that there are more than 87 000 frameworks in the database. It contains 2.1 million ‘drug-like’ molecules among which, more than one million are ‘lead-like’. This study has been carried out using ‘ScreeningAssistant’, a software dedicated to chemical databases management and screening sets generation. Compounds are stored in a MySQL database and all the operations on this database are carried out by Java code. The druglikeness and leadlikeness are estimated with ‘in-house’ scores using functions to estimate convenience to properties; unicity using the InChI code and diversity using molecular frameworks and fingerprints. The software has been conceived in order to facilitate the update of the database. ‘ScreeningAssistant’ is freely available under the GPL license.