OPTIMIZATION OF QSAR MODELS FOR  PREDICTION OF BIOLOGICAL ACTIVITY  MOLECULES USING MACHINE  LEARNING METHODS

Danilo Maslov; Oleksandr Golub

doi:10.33609/2708-129X.92.3.2026.27-32

Vol. 92 No. 3 (2026), Physical chemistry

Vol. 92 No. 3 (2026)

OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS

Physical chemistry

https://doi.org/10.33609/2708-129X.92.3.2026.27-32

Published 2026-04-30

Danilo Maslov⁺⁻
Oleksandr Golub⁺⁻

Danilo Maslov

Національний університет "Києво-Могилянська академія", вул. Сковороди 2, 04070 Київ

Oleksandr Golub

Національний університет "Києво-Могилянська академія", вул. Сковороди 2, 04070 Київ

№3

Keywords

QSAR modeling; machine learning; TRPV1; molecular descriptors.

How to Cite

Maslov, D., & Golub, O. (2026). OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS. Ukrainian Chemistry Journal, 92(3), 27–32. https://doi.org/10.33609/2708-129X.92.3.2026.27-32

Abstract

Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models are widely used for this purpose; however, their true predictive performance is often overestimated due to improper data splitting strategies. A key challenge arises when test sets contain molecular scaffolds absent from the training data, resulting in models that appear accurate under random splits but fail to generalize to unseen chemical space.

This study investigates optimization strategies for QSAR modeling while explicitly accounting for molecular diversity. A dataset of 3,782 molecules with 3,291 computed descriptors and pChEMBL anesthetic activity values (5.01–8.52) for receptor TRPV1 was analyzed. The dataset contained 733 unique scaffolds, and 72 occurred exclusively in the test set under random 80/20 splitting, revealing substantial information leakage. Three splitting strategies were compared: standard
K-Fold (R² = 0.54), scaffold-based Group K-Fold (R² = 0.31), and stratified scaffold-aware splitting (R² = 0.646–0.7201), the latter demonstrating the most realistic and stable performance.

Multiple machine-learning approaches were evaluated, with Gradient Boosting achieving the best baseline accuracy. Optimization techniques included descriptor-level data augmentation (σ = 0.02), descriptor weighting by duplicating the most important features, and combined methods. The best model (R² = 0.7201, MAE = 0.41) was obtained by integrating augmentation with triple duplication of top-ranking descriptors. Several commonly used approaches—Morgan fingerprints, deep neural networks, PCA—yielded significantly weaker performance, highlighting the superior informativeness of physicochemical descriptors for this dataset.

The resulting model demonstrates practical utility for early-stage virtual screening and prioritization of candidate molecules, providing a reliable tool for guiding medicinal chemistry decisions.

https://doi.org/10.33609/2708-129X.92.3.2026.27-32

№3

References

Golbraikh A., Tropsha A. Beware of q²!. Journal of Molecular Graphics and Modelling. 2002. 20(4): P. 269–276. doi:

https://doi.org/10.1016/S1093-3263(01)00123-1.

Yang K., Swanson K., Jin W. et al. Analyzing learned molecular representations for property prediction. Journal of Chemical Information and Modeling. 2019. 59(8): P. 3370–3388.

doi: https://doi.org/10.1021/acs.jcim.9b00237.

Caterina M. J., Schumacher M. A., Tomina¬ga M., Rosen T. A., Levine J. D., Julius D. The capsaicin receptor: a heat-activated ion channel in the pain pathway. Nature. 1997. 389(6653): P. 816–824. doi:

https://doi.org/10.1038/39807.

Hastie T., Tibshirani R., Friedman J. The ele¬ments of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer. 2009.

doi: https://doi.org/10.1007/978-0-387-84858-7.

Petrov K. P., Bender A. An open-source implementation of scaffold identification. ChemRxiv (preprint). 2024. doi: https://doi.org/10.26434/chemrxiv-2024-84r9x.

Lange J. J., Strickfaden S., Klein R., Hinselmann G. Comparative analysis of chemical descriptors by machine learning. Molecular Pharmaceutics. 2024. 21(5): P. 1874–1888. doi: https://doi.org/10.1021/acs.molpharmaceut.4c00080.

Tropsha A. Best practices for QSAR model development, validation, and exploitation. Molecular Informatics. 2010. 29(6–7): 476–488.

doi: https://doi.org/10.1002/minf.201000061.

Cherkasov A., Muratov E. N., Fourches D. et al. QSAR modeling: where have you been? Where are you going to? Journal of Medicinal Chemistry. 2014. 57(12): P. 4977–5010.

doi: https://doi.org/10.1021/jm4004285.

Roy K., Kar S., Das R. N. A primer on QSAR/QSPR modeling. Springer. 2015.

doi: https://doi.org/10.1007/978-3-319-17281-1.

Gramatica P. On the development and validation of QSAR models. Methods in Molecular Biology. 2013. 930: P. 499–526. doi: https://doi.org/10.1007/978-1-62703-059-5_21.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Downloads

Download data is not yet available.