{"id":"CONICETDig_f2f5c6ff2c480f52321c7cc62b02d0fe","dc:title":"Datasets used in the benchmarking exercise by SOMOC and iRAPCA","dc:creator":"Talevi, Alan","dc:date":"2024","dc:description":["Two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances."],"dc:format":["application\/zip"],"dc:language":["eng"],"dc:type":"dataset","dc:rights":["info:eu-repo\/semantics\/openAccess","https:\/\/creativecommons.org\/licenses\/by-nc-sa\/2.5\/ar\/"],"dc:relation":["info:eu-repo\/grantAgreement\/Ministerio de Ciencia, Tecnolog\u00eda e Innovaci\u00f3n Productiva. Agencia Nacional de Promoci\u00f3n Cient\u00edfica y Tecnol\u00f3gica. Fondo para la Investigaci\u00f3n Cient\u00edfica y Tecnol\u00f3gica\/PICT-CATI-2021-00073"],"dc:identifier":"https:\/\/repositoriosdigitales.mincyt.gob.ar\/vufind\/Record\/CONICETDig_f2f5c6ff2c480f52321c7cc62b02d0fe"}