Ankit Halder (Mumbai / IN), Sabyasachi Samantaray (Mumbai / IN), Sahil Barbade (Mumbai / IN), Sanjeeva Srivastava (Mumbai / IN)
The development of integrated Omics-based knowledgebase BrainProt™ (https://www.brainprot.org/) has facilitated our understanding of markers associated with brain diseases. Further, the development of BDDF (Brain Disease Drug Finder) domain has led to the understanding of the availability of current drugs targeting these disease and their associated markers. But, huge indistinctness lies in the mind of the researchers in choosing a certain drug target due to lack of comprehensive models for identifying the best available targets.We emphasise that beyond screening the potential inhibitors, an understanding of the viable drug targets is also crucial. The unavailability of reliable structures for a large portion of the druggable proteome leads to the inaccurate results in in silico screening. Previous studies attempting to understand the properties of the druggable proteome using simple physicochemical properties have failed to achieve optimum accuracy via reliable methods. In this study, we intend to dig deeper into protein characteristics,beyond basic physicochemical properties of the protein, and incorporate a wealth of SwissProt-verified information extracted from UniProt, including subcellular locations, protein-protein interactions and peptide domains. We employ several feature selection strategies to disseminate the major criteria for a protein druggability and evaluate these features comprehensively. For non interpretable models like Neural Networks, we utilise Shapley values to promote explainability and as an interpretation to feature importance. The inherent skewness in the number of druggable versus non-druggable proteins (according to DrugBank (https://go.drugbank.com/) poses a challenge for model development. We address this imbalance using oversampling strategies like SMOTE. Our classifier has achieved approximately 80% accuracy, 0.8 F1-score and AUC 0.9 in predicting the druggable and non druggable proteins on the held-out test set. Looking ahead, we hypothesise that employing Recursive Neural Networks(RNNs) to encode sequential properties like flexibility, hydrophobicity, relative surface accessibility(RSA), ASA, PDB secondary structure information etc. can further improve the accuracy. We anticipate that these models can be deployed at a large scale to facilitate the drug discovery process more effectively.
`