Acessibilidade / Reportar erro

AmpClass: an Antimicrobial Peptide Predictor Based on Supervised Machine Learning

Abstract

In the last decades, antibiotic resistance has been considered a severe problem worldwide. Antimicrobial peptides (AMPs) are molecules that have shown potential for the development of new drugs against antibiotic-resistant bacteria. Nowadays, medicinal drug researchers use supervised learning methods to screen new peptides with antimicrobial potency to save time and resources. In this work, we consolidate a database with 15945 AMPs and 12535 non-AMPs taken as the base to train a pool of supervised learning models to recognize peptides with antimicrobial activity. Results show that the proposed tool (AmpClass) outperforms classical state-of-the-art prediction models and achieves similar results compared with deep learning models.

Key words
Antimicrobial Peptides; Cheminformatics; Computational Biology; Machine Learning

INTRODUCTION

Infectious diseases are among the ten leading causes of death worldwide (World Health Organization 2020). During the last decades, antibiotic resistance has been considered a severe problem, affecting patients’ life quality and causing an economic burden to the health systems (World Health Organization 2021). The WHO classifies some bacterial species with priority on their need for new antibiotics. These include carbapenems resistant strains of Acinetobacter baumannii and Pseudomonas sp. (Tacconelli & Magrini 2017TACCONELLI E & MAGRINI N. 2017. Global priority list of antibiotic-resistant bacteria to guide research, discovery, and development of new antibiotics. World Health Organization, Essential medicines and health products, p. 1-7.). Meanwhile, Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, A. baumannii, P. aeruginosa and Enterobacter spp., collectively known as ESKAPE, are recognized as multidrug-resistant (Mulani et al. 2019MULANI MS, KAMBLE EE, KUMKAR SN, TAWRE MS & PARDESI KR. 2019. Emerging strategies to combat ESKAPE pathogens in the era of antimicrobial resistance: a review. Front Microbiol 10: 1-24. https://doi.org/10.3389/fmicb.2019.00539.
https://doi.org/10.3389/fmicb.2019.00539...
) and are of worldwide concern. The situation is worsened due to the use of antibiotics in the dairy industry and agricultural lands, leading to disseminating antibiotic resistance genes in the environment (Oliver et al. 2020OLIVER JP, GOOCH CA, LANSING S, SCHUELER J, HURST JJ, SASSOUBRE L, CROSSETTE EM & AGA DS. 2020. Fate of antibiotic residues, antibiotic-resistant bacteria, and antibiotic resistance genes in US dairy manure management systems. J Dairy Sci 103: 1051-1071. https://doi.org/10.3168/jds.2019-16778.
https://doi.org/10.3168/jds.2019-16778...
, Lu et al. 2020LU XM, LU PZ & LIU XP. 2020. Fate and abundance of antibiotic resistance genes on microplastics in facility vegetable soil. Sci Total Environ 709: 1-9. https://doi.org/10.1016/j.scitotenv.2019.136276.
https://doi.org/10.1016/j.scitotenv.2019...
).

Furthermore, recent studies showed the presence of antibiotic resistance genes in soils fertilized with swine manure (Van den Meersche et al. 2020VAN DEN MEERSCHE T, RASSCHAERT G, VANDEN NEST T, HAESEBROUCK F, HERMAN L, VAN COILLIE E, VAN WEYENBERG S, DAESELEIRE E & HEYNDRICKX M. 2020. Longitudinal screening of antibiotic residues, antibiotic resistance genes and zoonotic bacteria in soils fertilized with pig manure. Environ Sci Pollut Res 27: 28016-28029. https://doi.org/10.1007/s11356-020-09119-y.
https://doi.org/10.1007/s11356-020-09119...
) and in hospital settings such as kitchen surfaces and on food handlers (Benjelloun et al. 2020BENJELLOUN G, BENNANI L, BERRADA S, MOUSSA B & BENNANI B. 2020. Prevalence and antibiotic resistance profiles of Staphylococcus sp. isolated from food, food contact surfaces and food handlers in a Moroccan hospital kitchen. Let Appl Microbiol 70: 241-251. https://doi.org/10.1111/lam.13278.
https://doi.org/10.1111/lam.13278...
). Although the need for new antibiotics is clear, the pharmaceutical industry’s interest in developing new ones has decreased. The development of a new molecule by the pharmaceutical industry can take up to 12 years with an investment of around 1300 million dollars (Walsh & Wencewicz 2014WALSH CT & WENCEWICZ TA. 2014. Prospects for new antibiotics: A molecule-centered perspective. J Antibiot 67: 7-22. https://doi.org/10.1038/ja.2013.49.
https://doi.org/10.1038/ja.2013.49...
). In 2013, only 1.6% of the potential drugs under research and development on 15 of the pharmaceutical industry’s leading companies were antibiotics (Fair & Tor 2014FAIR R & TOR Y. 2014. Bacterial resistance in the 21st century. Perspect Medicinal Chem 6: 25-64. https://doi.org/10.4137/PMC.S14459.
https://doi.org/10.4137/PMC.S14459...
). There are several reasons for the low interest of the pharmaceutical industry to develop new antibiotics. Some of these are the lower return of the investment compared to drugs developed for chronic diseases such as diabetes or hypertension, and the appearance of resistant bacterial strains soon after introducing new antibiotics (Walsh & Wencewicz 2014WALSH CT & WENCEWICZ TA. 2014. Prospects for new antibiotics: A molecule-centered perspective. J Antibiot 67: 7-22. https://doi.org/10.1038/ja.2013.49.
https://doi.org/10.1038/ja.2013.49...
). Since most of the current antibiotics have been developed by searching metabolites from environmental microbes, it is possible that resistant strains and genes already exist in nature, as D’Costa et al. (2011)D’COSTA VM ET AL. 2011. Antibiotic resistance is ancient. Nature 477: 457-461. https://doi.org/10.1038/nature10388.
https://doi.org/10.1038/nature10388...
have demonstrated it.

Therefore, the search for new antibiotics must include novel classes, such as antimicrobial peptides (AMPs). AMPs are naturally produced by all life forms (Lazzaro et al. 2020LAZZARO BP, ZASLOFF M & ROLFF J. 2020. Antimicrobial peptides: Application informed by evolution. Science 368: 1-7. https://doi.org/10.1126/science.aau5480.
https://doi.org/10.1126/science.aau5480...
, Torres et al. 2019TORRES MD, SOTHISELVAM S, LU TK & DE LA FUENTE-NUNEZ C. 2019. Peptide design principles for antimicrobial applications. J Mol Biol 431: 3547-3567. https://doi.org/10.1016/j.jmb.2018.12.015.
https://doi.org/10.1016/j.jmb.2018.12.01...
). However, the search for AMPs by classical bio-prospecting strategies has the same economic cost and takes a similar amount of time than the search for secondary microbial metabolites. Consequently, alternative strategies have been recently employed that include bioinformatic analysis of proteomes (Hincapié et al. 2018HINCAPIÉ O, GIRALDO P & ORDUZ S. 2018. In silico design of polycationic antimicrobial peptides active against Pseudomonas aeruginosa and Staphylococcus aureus. Antonie van Leeuwenhoek J Microbiol 111: 1871-1882. https://doi.org/10.1007/s10482-018-1080-2.
https://doi.org/10.1007/s10482-018-1080-...
) and the development of artificial intelligence algorithms to identify the antimicrobial potential of peptides (Wu et al. 2019WU Q, KE H, LI D, WANG Q, ZHOU J & FANG J. 2019. Recent progress in machine learning-based prediction of peptide activity for drug discovery. Curr Top Med Chem 19: 4-16. https://doi.org/10.2174/1568026619666190122151634.
https://doi.org/10.2174/1568026619666190...
), most of them following a supervised learning approach. In the supervised learning approach, an algorithm learns the relationship between input variables (x) and an output variable (y). This process is known as training, and it generates a model to predict possible outcomes when used with new input data.

In AMP recognition problems, a supervised learning algorithm must be trained with a set of descriptors (or features) extracted from peptide samples. These descriptors are the input variables (x), which usually are physicochemical, compositional, or structural properties calculated from the peptide sequence. Conversely, the output variable (y) may be a label class (AMP or non-AMP), a value representing the peptide’s biological activity, or a sequence representing a synthetic AMP. Recognition methods can be categorized into three types:

  • Classification approaches, that learn the differences between AMPs and non-AMPs or the differences among peptides with specific activities (antiviral, anticancer, antifungal, etc.) to assign a class label to an unknown peptide.

  • Regression approaches, that generate a model to predict a value that represents the biological activity of a peptide –like the minimum inhibitory concentration (MIC)- according to its physicochemical, compositional and structural properties.

  • Generation approaches, which look to understand how peptides are composed to generate new synthetic AMPs.

Classification and regression approaches are based either on traditional supervised learning methods, such as discriminant analysis (DA), support vector machines (SVM), artificial neural networks (ANN), decision trees (DT) and Random Forests (RF) or based on deep learning approaches such as recurrent neural networks (Veltri et al. 2018VELTRI D, KAMATH U & SHEHU A. 2018. Deep learning improves antimicrobial peptide recognition. Bioinformatics 34: 2740-2747. https://doi.org/10.1093/bioinformatics/bty179.
https://doi.org/10.1093/bioinformatics/b...
). On the other hand, generation methods usually rely on genetic algorithms or deep learning approaches (Müller et al. 2018MÜLLER AT, HISS JA & SCHNEIDER G. 2018. Recurrent neural network model for constructive peptide design. J Chem Inf Model 58: 472-479. https://doi.org/10.1021/acs.jcim.7b00414.
https://doi.org/10.1021/acs.jcim.7b00414...
). We have compiled a list of some of the most representative state-of-the-art methods focused on classification for AMPs recognition, which are described as follows.

  • AntiBP (Lata et al. 2007LATA S, SHARMA BK & RAGHAVA GP. 2007. Analysis and prediction of antibacterial peptides. BMC Bioinformatics 8: 263. https://doi.org/10.1186/1471-2105-8-263.
    https://doi.org/10.1186/1471-2105-8-263...
    ) and AntiBP2 (Lata et al. 2010LATA S, MISHRA NK & RAGHAVA GP. 2010. AntiBP2: Improved version of antibacterial peptide prediction. BMC Bioinformatics 11: 1-7. https://doi.org/10.1186/1471-2105-11-S1-S19.
    https://doi.org/10.1186/1471-2105-11-S1-...
    ) are two predictors for AMP classification. AntiBP trained three classification models based on the SVM, ANN, and quantitative matrices (QM) algorithms, while AntiBP2 only uses SVM. Both methods include the C- and N-terminal ends of the residues to describe a peptide.

  • CAMP –Collection of Antimicrobial Peptides- (Thomas et al. 2009THOMAS S, KARNIK S, BARAI RS, JAYARAMAN VK & IDICULA-THOMAS S. 2009. CAMP: A useful resource for research on antimicrobial peptides. Nucleic Acids Res 38: 774-780. https://doi.org/10.1093/nar/gkp1021.
    https://doi.org/10.1093/nar/gkp1021...
    ) is a database of experimentally validated AMPs. CAMP has a well-known predictor that uses the SVM, RF and DA learning methods trained with the amino acid composition (AAC), physicochemical properties, and the peptides’ structural characteristics. CAMP has had two updates, the first in 2014 (Waghu et al. 2014WAGHU FH, GOPI L, BARAI RS, RAMTEKE P, NIZAMI B & IDICULA-THOMAS S. 2014. CAMP: Collection of sequences and structures of antimicrobial peptides. Nucleic Acids Res. 42: 1154-1158. https://doi.org/10.1093/nar/gkt1157) and in 2016 (Waghu et al. 2016WAGHU FH, BARAI RS, GURUNG P & IDICULA-THOMAS S. 2016. CAMPR3: A database on sequences, structures and signatures of antimicrobial peptides. Nucleic Acids Res 44: D1094-D1097. https://doi.org/10.1093/nar/gkv1051.
    https://doi.org/10.1093/nar/gkv1051...
    ). The last version was trained with 10247 sequences and can also classify AMPs as antibacterial, antifungal, or antiviral (Waghu & Idicula-Thomas 2020WAGHU FH & IDICULA-THOMAS S. 2020. Collection of antimicrobial peptides database and its derivatives: Applications and beyond. Protein Sci 29: 36-42. https://doi.org/10.1002/pro.3714.
    https://doi.org/10.1002/pro.3714...
    ).

  • ClassAMP (Joseph et al. 2012JOSEPH S, KARNIK S, NILAWE P, JAYARAMAN VK & IDICULA-THOMAS S. 2012. ClassAMP: A prediction tool for classification of antimicrobial peptides. IEEE/ACM Trans Comput Biol Bioinform 9: 1535-1538. https://doi.org/10.1109/tcbb.2012.89.
    https://doi.org/10.1109/tcbb.2012.89...
    ) is another tool for AMP prediction which implements SVM and RF algorithms. This tool looks to predict if a peptide has antibacterial, antiviral or antifungal activity based on three models trained following the one-against-all approach. Accordingly, the class of a peptide is assigned based on the majority vote among the three models. As features, ClassAMP uses compositional and physicochemical properties, structural characteristics, the BLOSUM-50 matrix and di-peptide and three-peptide frequencies, among others.

  • iAMP-L2 (Xiao et al. 2013XIAO X, WANG P, LIN WZ, JIA JH & CHOU KC. 2013. IAMP-2L: A two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal Biochem 436: 168-177. https://doi.org/10.1016/j.ab.2013.01.019.
    https://doi.org/10.1016/j.ab.2013.01.019...
    ) is a two-level classifier that first recognizes if a sequence is an AMP or not. If the peptide is classified as an AMP, the second level predicts the peptide’s kind of activity considering the antibacterial, anticancer, antifungal, anti-HIV, and antiviral classes. The algorithm used for iAMP-L2 is the fuzzy K-nearest neighbor (FKNN), and it was trained using as a descriptor Chou’s pseudo amino acid composition (PACC) (Chou 2001CHOU KC. 2001. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43: 246-255. https://doi.org/10.1002/prot.1035.
    https://doi.org/10.1002/prot.1035...
    ) of the sequences.

  • iDPF-PseRAAAC (Zuo et al. 2015ZUO Y, LV Y, WEI Z, YANG L, LI G & FAN G. 2015. IDPF-PseRAAAC: A web-server for identifying the defensin peptide family and subfamily using pseudo reduced amino acid alphabet composition. PLoS ONE 10: 1-13. https://doi.org/10.1371/journal.pone.0145541.
    https://doi.org/10.1371/journal.pone.014...
    ) is a tool to identify peptides from the defensin family and their subfamilies. iDPF-PseRAAAC trained a multi-class SVM with an RBF kernel using 333 defensin proteins from five families: insects, vertebrates, invertebrates, plants, and the unclassified family. As a descriptor, a reduced amino acid alphabet composition (RAAAC) is used. Because the authors employed a one-against-one strategy for the multi-class SVM, the sequence’s class is assigned according to the majority vote scheme.

  • Ng et al. (2015)NG XY, ROSDI BA & SHAHRUDIN S. 2015. Prediction of antimicrobial peptides based on sequence alignment and support vector machine-pairwise algorithm utilizing LZ-complexity. Biomed Res Int 2015: 212715. https://doi.org/10.1155/2015/212715.
    https://doi.org/10.1155/2015/212715...
    developed an integrated algorithm using sequence alignment and an SVM classifier. First, a sequence alignment strategy based on the BLASTP method is used to determine if a peptide is an AMP. Then, if the result of the alignment is undefined, a prediction based on an SVM is made. The SVM was trained with the pairwise scores obtained with the LZ complexity algorithm over a set of fixed sequences.

  • iAMPpred (Meher et al. 2017MEHER PK, SAHU TK, SAINI V & RAO AR. 2017. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci Rep 7: 1-12. https://doi.org/10.1038/srep42362.
    https://doi.org/10.1038/srep42362...
    ) is a predictor trained by four SVMs’. The first one predicts if a given peptide is an AMP or not; the others predict three specific activities: antibacterial, antiviral and antifungal. In this case, the authors used as descriptors the AAC, PAAC, normalized amino acid composition (NAAC), three physicochemical properties and three structural properties.

  • Veltri et al. (2017)VELTRI D, KAMATH U & SHEHU A. 2017. Improving recognition of antimicrobial peptides and target selectivity through machine learning and genetic programming. IEEE/ACM Trans Comput Biol Bioinform 14: 300-313. https://doi.org/10.1109/tcbb.2015.2462364.
    https://doi.org/10.1109/tcbb.2015.246236...
    proposed an approach that employs logistic regression (LR) and decision tree models for AMPs recognition. This classifier is different from other approaches because authors used genetic programming to combine sequence-based features such as motif and positional sequence to generate new complex features to discriminate between AMPs and non-AMPs.

  • Schneider et al. (2017)SCHNEIDER P, MÜLLER AT, GABERNET G, BUTTON AL, POSSELT G, WESSLER S, HISS JA & SCHNEIDER G. 2017. Hybrid network model for “Deep Learning” of chemical data: Application to antimicrobial peptides. Mol Infor 36: 1-8. https://doi.org/10.1002/minf.201600011.
    https://doi.org/10.1002/minf.201600011...
    presented a two-step strategy that first uses a self-organizing map (SOM) to perform a dimensionality reduction over 147 molecular descriptors. The transformed descriptors are used to train an ANN, which classifies a sequence as AMP or non-AMP.

  • Lee et al. (2018)LEE EY, WONG GC & FERGUSON AL. 2018. Machine learning-enabled discovery and design of membrane active peptides. Bioorg Med Chem 26: 2708-2718. https://doi.org/10.1016/j.bmc.2017.07.012.
    https://doi.org/10.1016/j.bmc.2017.07.01...
    used an SVM with a linear kernel trained with 12 descriptors automatically selected from 1588 calculated with Propy (Cao et al. 2013CAO DS, XU QS & LIANG YZ. 2013. Propy: A tool to generate various modes of Chou’s PseAAC. Bioinformatics 29: 960-962. https://doi.org/10.1093/bioinformatics/btt072.
    https://doi.org/10.1093/bioinformatics/b...
    ). According to the authors, their classifiers learn to discriminate AMPs based on membrane activity rather than on antimicrobial activity. The authors test their hypothesis experimentally by synthesizing 16 helical peptides.

  • CellPPDMod (Kumar et al. 2018KUMAR V, AGRAWAL P, KUMAR R, BHALLA S, USMANI SS, VARSHNEY GC & RAGHAVA GP. 2018. Prediction of cell-penetrating potential of modified peptides containing natural and chemically modified residues. Front Microbiol 9: 1-10. https://doi.org/10.3389/fmicb.2018.00725.
    https://doi.org/10.3389/fmicb.2018.00725...
    ) is a tool for predicting cell-penetrating peptides with nonnatural and modified residues. In this approach, authors worked with a RF classifier trained with 48 chemical descriptors (2D, 3D descriptors, and fingerprints) extracted with PaDEL (Yap 2011YAP CW. 2011. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32: 1466-1474. https://doi.org/10.1002/jcc.21707.
    https://doi.org/10.1002/jcc.21707...
    ) and selected from a set of 15537 descriptors. A disadvantage of this approach is that the input must be the tertiary structure of a peptide with a max length of 25 residues.

  • DBAASP (Vishnepolsky et al. 2018VISHNEPOLSKY B, GABRIELIAN A, ROSENTHAL A, HURT DE, TARTAKOVSKY M, MANAGADZE G, GRIGOLAVA M, MAKHATADZE GI & PIRTSKHALAVA M. 2018. Predictive model of linear antimicrobial peptides active against Gram-Negative Bacteria. J Chem Inf Mod 58: 1141-1151. https://doi.org/10.1021/acs.jcim.8b00118.
    https://doi.org/10.1021/acs.jcim.8b00118...
    ) was recently presented as a prediction tool of small linear AMPs with specific activity against Gram-negative species. This approach uses a semisupervised learning strategy to create a pool of clusters trained with different subsets of physicochemical descriptors.

  • AMP Scanner DNN (Veltri et al. 2018VELTRI D, KAMATH U & SHEHU A. 2018. Deep learning improves antimicrobial peptide recognition. Bioinformatics 34: 2740-2747. https://doi.org/10.1093/bioinformatics/bty179.
    https://doi.org/10.1093/bioinformatics/b...
    ) explored a new strategy based on deep learning for AMPs classification, where the model is trained using the sequences rather than their descriptors, as is done with classical supervised learning methods. This neural model’s basis is the convolutional and recurrent layers that learn the AMP’s sequence composition patterns.

  • Antifp (Agrawal et al. 2018AGRAWAL P, BHALLA S, CHAUDHARY K, KUMAR R, SHARMA M & RAGHAVA GPS. 2018. In silico approach for prediction of antifungal peptides. Front Microbiol 9: 323. https://doi.org/10.3389/fmicb.2018.00323.
    https://doi.org/10.3389/fmicb.2018.00323...
    ) is a tool designed to predict antifungal peptides. In this study, a binary pattern model is used for fixed lengths of N-terminus and C-terminus peptide regions. The best model based on an SVM achieved an accuracy of 84.64% with an MCC of 0.69 and a ROC of 0.92 on the validation dataset.

  • A-CaMP (Kaushik et al. 2021KAUSHIK A, MEHMOOD A, PENG S, ZHANG YJ, DAI X & WEI DQ. 2021. A-CaMP: a tool for anti-cancer and antimicrobial peptide generation. J Biomol Struct Dyn 39: 285-293. https://doi.org/10.1080/07391102.2019.1708796.
    https://doi.org/10.1080/07391102.2019.17...
    ) is an anticancer peptide (ACP) prediction tool that identifies peptides with high affinity to the target protein sequences provided by the user. A-CaMP takes advantage of medical data to identify potential ACPs from wild and mutated cancerous protein sequences.

  • Deep-AmPEP30 (Yan et al. 2020YAN J, BHADRA P, LI A, SETHIYA P, QIN L, TAI H, WONG K & SIU S. 2020. Deep-AmPEP30: Improve short antimicrobial peptides prediction with deep learning. Mol Ther Nucl Acids 20: 882-894. https://doi.org/10.1016/j.omtn.2020.05.006.
    https://doi.org/10.1016/j.omtn.2020.05.0...
    ) is a prediction tool for short AMPs (≤ 30 amino acids) based on deep learning. In a nutshell, Deep-AmPEP30 trained a convolutional neural network on a subset of the PseKRAAC reduced amino acid composition. This tool has two models: DeepAmPEP30 and RF-AmPEP30. In previous work, the authors presented AmPEP, which uses the distribution patterns of amino acids along the sequence to train an RF classifier (Bhadra et al. 2018BHADRA P, YAN J, LI J, FONG S & SIU SW. 2018. AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci Rep 8: 1-10. https://doi.org/10.1038/s41598-018-19752-w.
    https://doi.org/10.1038/s41598-018-19752...
    ).

As a contribution, this work proposes AmpClass, a new tool for short AMP recognition trained with a wide range of peptides obtained from 28 peptide and protein databases. Recovered peptides were used to train four machine learning algorithms (XGBoost, Random Forest, Neural Network and Decision Tree) to predict if a peptide is an AMP. The resulting predictors were compared with some of the recent approaches, including some based on deep learning. Tests show that the proposed method is among the ones with best performance today.

MATERIALS AND METHODS

The key idea of this work is to use a machine learning methodology, as shown in Figure 1, to recognize automatically short peptides with antimicrobial activity. The steps involved in this methodology are the AMP database compilation, feature extraction, feature selection, classification, and evaluation.

Figure 1
AMP recognition methodology using supervised machine learning. At the top is the model training process, and at the bottom illustrates how to use the trained model.

Database compilation

First, we collected positive (AMPs) and negative (non-AMPs) peptides from different sources. A filtering process was then applied to leave only short peptides composed of canonical amino acids.

Positive Database

Twenty-eight databases comprised of bioactive peptides were analyzed. Their listed AMPs were extracted to generate a single (positive) database to train the AMPs prediction tool. The databases used in this study are listed in the Supplementary Material - Tables SI-SIII.

From the considered databases, only antimicrobial peptides were selected; based on the literature (Waghu & Idicula-Thomas 2020WAGHU FH & IDICULA-THOMAS S. 2020. Collection of antimicrobial peptides database and its derivatives: Applications and beyond. Protein Sci 29: 36-42. https://doi.org/10.1002/pro.3714.
https://doi.org/10.1002/pro.3714...
, Pirtskhalava et al. 2016PIRTSKHALAVA M ET AL. 2016. DBAASP v.2: An enhanced database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides. Nucleic Acids Res 44: D1104-D1112. https://doi.org/10.1093/nar/gkv1174.
https://doi.org/10.1093/nar/gkv1174...
), we defined as antimicrobial any peptide reported to contain at least one of the following properties or its other non-listed synonyms: antibacterial, antifungal, antiviral, antiparasitic, anticancer, antinematodal, antiprotist, antiprotozoal, antibiofilm and insecticide. Sub-classes of the above mention activities were included but as part of their parent class (e.g. anti-Gram-positive is taken as antibacterial).

Peptide duplicates were removed, as well as peptides with non-canonical amino acids or non-FASTA characters, including all N- and C- terminal modifications, peptides represented in the three-letter amino acid codes, peptides that only contain combinations of two amino acids, and peptides with less than 7 or more than 300 amino acids. After that, the resulting positive database consisted of 23135 unique sequences.

Several authors (Gautam et al. 2016GAUTAM A ET AL. 2016. Development of antimicrobial peptide prediction tool for aquaculture industries. Probiotics Antimicrob Proteins 8: 141-149. https://doi.org/10.1007/s12602-016-9215-0.
https://doi.org/10.1007/s12602-016-9215-...
, Gabere & Noble 2017GABERE MN & NOBLE WS. 2017. Empirical comparison of web-based antimicrobial peptide prediction tools. Bioinformatics 33: 1921-1929. https://doi.org/10.1093/bioinformatics/btx081.
https://doi.org/10.1093/bioinformatics/b...
, Yi et al. 2019YI HC, YOU ZH, ZHOU X, CHENG L, LI X, JIANG TH & CHEN ZH. 2019. ACP-DL: A deep learning long short-term memory model to predict anticancer peptides using high-efficiency feature representation. Mol Ther Nucl Acids 17: 1-9. https://doi.org/10.1016/j.omtn.2019.04.025.
https://doi.org/10.1016/j.omtn.2019.04.0...
) have proposed to remove bias in antimicrobial peptide datasets by using CD-HIT (Fu et al. 2012FU L, NIU B, ZHU Z, WU S & LI W. 2012. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28: 3150-3152. https://doi.org/10.1093/bioinformatics/bts565.
https://doi.org/10.1093/bioinformatics/b...
), which implements a clustering methodology to reduce the number of redundant sequences based on a similarity threshold t. Other studies have suggested using short peptides when possible since the cost for synthesizing very long peptides is high, at least assuming chemical synthesis (Marx 2005MARX V. 2005. Watching peptide drugs grow up. Chem Eng News 83: 17-24. https://doi.org/10.1021/cen-v083n011.p017.
https://doi.org/10.1021/cen-v083n011.p01...
, Hancock & Sahl 2006HANCOCK RE & SAHL HG. 2006. Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nat Biotechnol 24: 1551-1557. https://doi.org/10.1038/nbt1267.
https://doi.org/10.1038/nbt1267...
). In response to both concerns, two positive datasets were created from the 23135 antimicrobial peptides collected: the first, named POS A, contains 15945 unique peptides with lengths between 7 and 35 amino acids, and the second, POS B, was built reducing the redundancy of POS A with CD-HIT (using a threshold of 0.9). This database contains 9274 unique sequences.

To analyze how each database contributes to the POS A dataset, we built a heat-map with the top-ten most contributing sources, according to the total number of peptides that made it to the POS A dataset and in which databases they were indexed (Figure 2). The number of peptides that each source added to the dataset is in the diagonal, at the interception of each database with itself. The other cells show the number of peptides which were found to be duplicated among each corresponding pair of sources. Accordingly, DBAASP (Pirtskhalava et al. 2016PIRTSKHALAVA M ET AL. 2016. DBAASP v.2: An enhanced database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides. Nucleic Acids Res 44: D1104-D1112. https://doi.org/10.1093/nar/gkv1174.
https://doi.org/10.1093/nar/gkv1174...
) is the database with the greatest contribution followed by SATPdb (Singh et al. 2015SINGH S, CHAUDHARY K, DHANDA SK, BHALLA S, USMANI SS, GAUTAM A, TUKNAIT A, AGRAWAL P, MATHUR D & RAGHAVA GP. 2015. SATPdb: A database of structurally annotated therapeutic peptides. Nucleic Acids Res 44: D1119-D1126. https://doi.org/10.1093/nar/gkv1114.
https://doi.org/10.1093/nar/gkv1114...
) and LAMP (Zhao et al. 2013ZHAO X, WU H, LU H, LI G & HUANG Q. 2013. LAMP: A database linking antimicrobial peptides. PLoS ONE 8: 6-11. https://doi.org/10.1371/journal.pone.0066557.
https://doi.org/10.1371/journal.pone.006...
). The databases with more peptides in common are also DBAASP and SATPdb.

Figure 2
Heat-map of the ten biggest databases used in this study showing the number of shared peptides between them.

Negative Database

We collected peptides from UniProtKB/Swiss-Prot (The Uniprot Consortium 2019) and the NCBI Reference Sequence Database (RefSeq) (O’Leary et al. 2016O’LEARY NA ET AL. 2016. Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44: D733-D745. https://doi.org/10.1093/nar/gkv1189.
https://doi.org/10.1093/nar/gkv1189...
) using Entrez Direct (Kans 2013KANS J. 2013. Entrez Direct: E-utilities on the UNIX Command Line. Technical report. National Center for Biotechnology Information, National Library of Medicine, Bethesda (MD). http://www.ncbi.nlm.nih.gov/books/NBK25501/.
http://www.ncbi.nlm.nih.gov/books/NBK255...
) to exclude sequences annotated with antimicrobial activity, to create a collection of non-AMP peptides (the negative database), where all peptide duplicates were eliminated, as well as peptides with non-canonical amino acids, and peptides conformed only by combining two amino acids. Also, we tried to ensure that the length distribution of the selected non-AMPs approximates that of the positive database.

Using the resulting peptides pool, we proceeded to build six different negative datasets, similar to the positive database procedure, but deciding whether to take peptides first from Swiss-Prot or Refseq. Table I describes these datasets. The final selected datasets are available at https://bit.ly/37VJe2c.

Table I
Negative datasets created based on the use of CD-HIT and peptides lengths.

Feature extraction

To classify the peptide sequences, it was necessary to define the methods for extracting relevant descriptors. Among these descriptors, we included global physicochemical descriptors, structural descriptors, and deep learning-based embedding, which were calculated through the following python packages:

  • modlAMP (Müller et al. 2017MÜLLER AT, GABERNET G, HISS JA & SCHNEIDER G. 2017. modlAMP: Python for antimicrobial peptides. Bioinformatics 33: 2753-2755. https://doi.org/10.1093/bioinformatics/btx285.
    https://doi.org/10.1093/bioinformatics/b...
    ) was used to calculate nine global physicochemical descriptors: molecular weight, sequence charge, charge density, instability index, aromaticity, aliphatic index, Boman index, and hydrophobicity; and 24 molecular descriptors, among them: the amino acid selectivity index scale for helical AMPs, Argos hydrophobicity amino acid scale and the Eisenberg hydrophobicity consensus amino acid scale.

  • Biopython (Cock et al. 2009COCK P ET AL. 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25: 1422-1423. https://doi.org/10.1093/bioinformatics/btp163.
    https://doi.org/10.1093/bioinformatics/b...
    ) was used to calculate two global physicochemical descriptors: the isoelectric point and Kyte and Doolittle’s Gravy score.

  • BioVec/ProtVec (Asgari & Mofrad 2015ASGARI E & MOFRAD MR. 2015. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10: 1-15. https://doi.org/10.1371/journal.pone.0141287.
    https://doi.org/10.1371/journal.pone.014...
    ) includes functionality to generate synthetic descriptors through a deep learning-based embedding. BioVec uses a mechanism similar to that used in natural language processing to determine the relationship between the n-grams of a sequence. This study used the package’s pre-trained 3-gram model, which generates 300 synthetic descriptors for each sequence.

  • PyDPI (Cao et al. 2013CAO DS, XU QS & LIANG YZ. 2013. Propy: A tool to generate various modes of Chou’s PseAAC. Bioinformatics 29: 960-962. https://doi.org/10.1093/bioinformatics/btt072.
    https://doi.org/10.1093/bioinformatics/b...
    ) allows to calculate structural and physicochemical descriptors for proteins and peptides. In this work, we used PyDPI to calculate the 20 amino acid composition descriptors (AAC), 400 dipeptide composition descriptors (DPC), 147 compositions, transition and distribution descriptors (CTD), 240 Moreau-Broto normalized autocorrelation descriptors, 160 quasi-sequence order descriptors (QSO), 50 pseudo-amino acid composition descriptors (PSAAC) and other 430 additional autocorrelation descriptors.

  • iLearn (Chen et al. 2020CHEN Z ET AL. 2020. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform 21: 1047-1057. https://doi.org/10.1093/bib/bbz041.
    https://doi.org/10.1093/bib/bbz041...
    ) is a machine learning tool compilation to use over DNA, RNA, and protein sequences. It was used to calculate a Grouped Amino Acid Composition (GAAC) descriptor, which has a length of 155 (5 + 25 + 125). It considers the prevalence of individual residues, dipeptides (2-mer) and tripeptides (3-mer) conformed by five classes of amino acids: aliphatic group (g1:GAVLMI), aromatic group (g2: FYW), positive charge group (g3: KRH), negative charged group (g4: DE), and uncharged group (g5: STCPNQ). iLearn was also used to calculate the Conjoint Triad descriptor (CTriad), which considers the properties of one amino acid and its neighbors by regarding any three contiguous amino acids as a single unit; and the k-Spaced Conjoint Triad (KSCTriad) descriptor, which considers the contiguous amino acid units that are separated by any k residues. Both generated a descriptor with a length of 343.

  • The final dataset was comprised of 2255 descriptors, each of them calculated for every peptide.

Feature selection

Feature selection is used in machine learning to reduce the dimensionality of data. In classification problems, feature selection aims to choose a subset of highly discriminative features, eliminating those irrelevant and those that decrease the model accuracy and quality. In this way, feature selection improves learning performance, lowers computational complexity, builds better-generalized models, and decreases data storage requirements (Li et al. 2017LI J, CHENG K, WANG S, MORSTATTER F, TREVINO RP, TANG J & LIU H. 2017. Feature selection: a data perspective. ACM Comput Surv 50: 94: 1-94: 45. https://doi.org/10.1145/3136625.
https://doi.org/10.1145/3136625...
, Tang et al. 2020TANG J, ALELYANI S & LIU H. 2020. Feature selection for classification: a review. Technical report. College of Computing, Georgia Tech.).

In the AMPs prediction problem, feature selection techniques help to identify the properties relevant to distinguishing between AMPs and non-AMPs. Due to the significant number of features extracted for each peptide, two pipelines of different feature selection techniques were applied to select the relevant features using the Scikit-Learn toolkit. The first pipeline was adjusted for non-tree-based supervised learning algorithms and the second one for tree-based supervised learning algorithms; the following are both pipelines’ descriptions.

Feature selection for non-tree-based learning algorithms

  • Data scaling: since linear classification methods tend to be skewed when they are trained with features at different scales, these features should be standardized (Raju et al. 2020RAJU VNG, LAKSHMI KP, JAIN VM, KALIDINDI A & PADMA V. 2020. Study the influence of normalization/transformation process on the accuracy of supervised classification. In: Third International Conference on Smart Systems and Inventive Technology, p. 729-735. https://doi.org/10.1109/ICSSIT48917.2020.9214160.
    https://doi.org/10.1109/ICSSIT48917.2020...
    ). In this work, we evaluated several methodologies, including Min-Max Scaling, Standardization, and Robust Scaling, described, and implemented in the Scikit-Learn toolkit. For non-tree-based supervised learning, the Robust Scaling methodology was used since it gave the best results according to the performed experiments (results not shown).

  • Correlation analysis: it is used to identify causal relationships among the peptides’ descriptors (Franzese & Iuliano 2019FRANZESE M & IULIANO A. 2019. Correlation analysis. In: Encyclopedia of Bioinformatics and Computational Biology, p. 706-721. https://doi.org/10.1016/B978-0-12-809633-8.20358-0.
    https://doi.org/10.1016/B978-0-12-809633...
    ). A high correlation means that two or more features have a strong linear dependence. When two features have a high correlation, one can be eliminated from the dataset because they have the same effect on the classification process. In this work, Pearson’s correlation coefficient was used to find and to eliminate characteristics with a correlation greater than 0.90.

  • Elimination of irrelevant features: we proceeded to identify and to eliminate features whose variance did not reach a specific threshold value. We utilized the Variance Threshold method from Scikit-Learn with a threshold value of 0.16.

  • Univariate selection: it selects the n best features based on univariate statistical tests. In this work, we selected the 500 best features according to an F-Test (ANOVA) (Yu & Liu 2003YU L & LIU H. 2003. Feature selection for high-dimensional data: A fast correlation-based filter solution. In: proceedings of the Twentieth International Conference on Machine Learning, p. 856-863.) and the 500 best features according to the mutual information criterion (Li et al. 2017LI J, CHENG K, WANG S, MORSTATTER F, TREVINO RP, TANG J & LIU H. 2017. Feature selection: a data perspective. ACM Comput Surv 50: 94: 1-94: 45. https://doi.org/10.1145/3136625.
    https://doi.org/10.1145/3136625...
    ). Then the features that were selected by both methods remained as the result of the univariate selection.

  • Recursive Feature Elimination: it selects the features by evaluating the classification error while recursively eliminating one feature each time (Chandrashekar & Sahin 2014CHANDRASHEKAR G & SAHIN F. 2014. A survey on feature selection methods. Comput Electrl Eng 40: 16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024.
    https://doi.org/10.1016/j.compeleceng.20...
    ). According to the error obtained using a Linear Discriminant Analysis classifier, we selected the 100 best features over the set of features obtained by the univariate selection.

Accordingly, this pipeline obtained a set of 100 relevant features selected from the 2255 calculated descriptors.

Feature selection for tree-based learning algorithms

This pipeline was similar to the previous one. Regarding the scaling methods, the test performed (results not shown) indicated that the Min-Max Scaling granted a slight improvement compared with the Standardization and Robust Scaling methods. Additionally, 500 features were selected based on a univariate selection with an F-Test and 500 features with the mutual exclusion criterion. Also, 500 features were selected with a backward selection method (Chandrashekar & Sahin 2014CHANDRASHEKAR G & SAHIN F. 2014. A survey on feature selection methods. Comput Electrl Eng 40: 16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024.
https://doi.org/10.1016/j.compeleceng.20...
). Then we intersected the selected features and obtained a set of 254 relevant descriptors.

These feature selection pipelines were obtained based on the peptide database acquired in Section Dataset Compilation, which is a subset of the training dataset and an independent set from the validations one.

Classification

To predict if a peptide is potentially antimicrobial or not, we tested several algorithms, either statistically or neural network-based. The best performance results were achieved using logistic regression (LR), a decision tree (DT), random forest (RF), artificial neural network (ANN), and the boosting algorithm XGBoost (Brownlee 2020BROWNLEE J. 2020. A gentle introduction to XGBoost for applied machine learning. https://bit.ly/3zuFhNm.
https://bit.ly/3zuFhNm...
). Each algorithm has a set of hyperparameters that should be adjusted to get the AMP recognition task’s best performance. Here, we performed the hyperparameter tuning through a grid search, which evaluates all possible values in a set of given ones. As a result, the grid search returns the hyperparameters’ values that maximize a classifier’s performance. In this step, the performance was established as the average of the F1 score obtained by a k-fold cross-validation, a technique widely used in machine learning problems (Duda et al. 2000DUDA RO, HART PE & STORK DG. 2000. Pattern Classification. Wiley-Interscience. Canada, 2nd edition.). In cross-validation, peptides are divided into k folds randomly. k-1 folds are used as training data, and the remaining fold is used as testing data to evaluate the classifiers’ performance. The experiment is repeated k times interchanging train and test data. The final performance of the classifier is the average F1 score among the k experiments. The hyperparameters values that achieve the best performance using 5-fold cross-validation are reported in the Table SI.

Evaluation

To evaluate the overall performance of a classifier, several measures are commonly used in machine learning such as accuracy (ACC), sensitivity (SEN), specificity (SPC) and F1-score (F1). These measures are defined by the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) achieved by the classifier. Also, we used the AUC, which is the area under the ROC curve that plots the sensitivity as a function of 1-specificity (He et al. 2013HE B, ZHANG B & FU Y. 2013. Discovery of proteomics based on machine learning. Preprint at https://arxiv.org/ftp/arxiv/papers/1312/1312.1025.pdf.
https://arxiv.org/ftp/arxiv/papers/1312/...
), and MCC which is the Matthew’s Correlation Coefficient.

RESULTS AND DISCUSSION

The supervised machine learning algorithms presented here are implemented in AmpClass as a part of PepMultiTools. This web application allows to look for peptides user’s defined physicochemical properties in proteomes or proteins with PepMultifinder, and to predict if a given peptide in a list is or not antimicrobial using the AmpClass tool. These tools are hosted at https://bit.ly/3oJwJtU.

Dataset selection

We combined the POS A dataset with NEG G, H and I, and the POS B with NEG J, K, and L datasets to determine which combination was the best to train the final predictor. The selection was made based on a Random Forest classifier’s performance over 5-fold cross-validation.

In this test, we also evaluated the convenience of including in our dataset the second one described in (Gabere & Noble 2017GABERE MN & NOBLE WS. 2017. Empirical comparison of web-based antimicrobial peptide prediction tools. Bioinformatics 33: 1921-1929. https://doi.org/10.1093/bioinformatics/btx081.
https://doi.org/10.1093/bioinformatics/b...
), which has 1713 positive from APD3 and 8565 negative peptides extracted from UniProt. We named this dataset Gabere2. A table which summarizes the training performances of the classifier trained with the combination of datasets with and without including the Gabere2 dataset (Gabere & Noble 2017GABERE MN & NOBLE WS. 2017. Empirical comparison of web-based antimicrobial peptide prediction tools. Bioinformatics 33: 1921-1929. https://doi.org/10.1093/bioinformatics/btx081.
https://doi.org/10.1093/bioinformatics/b...
) are reported in the Table SII.

The best classification result was achieved using POS A + Gabere2 and NEG G based on the average F1 measure from the cross-validation (Table SIII). An important observation is that all cases, including peptides from the Gabere2 dataset, helped to improve the classification performance in training. We hypothesize that this happens because the Gabere2 dataset provides a slightly different distribution of negative peptides not considered in our datasets. Concerning the positive peptides, we found that after filtering the Gabere2 dataset, few peptides remained that were not already included in our positive dataset. In this way, the dataset chosen to train the predictive classifier models was the combination of POS A + Gabere2 and NEG G.

Model training

After selecting the peptide database, the models described in Subsection Classification were trained. The performance results of the model’s training are reported in Table II using 5-fold cross-validation over the conformed dataset described in Subsection Dataset. Accordingly, the Random Forest algorithm achieves the best performance, followed by the Neural Network and XGBoost algorithms; however, results of the three models are very close. Accuracy and F1-score are in the 81–88% range, and AUCs are mostly over 90%, except for the decision tree model. These results show how the prediction models would perform, analyzing new, unseen peptides.

Table II
Average classifiers’ performances over the training set using 5-fold cross-validation.

Comparison with state-of-the-art models

We compare our AmpClass prediction models (trained and tuned with the final dataset) with eight state-of-the-art machine learning methods for AMP recognition (Table III). These results correspond to the measures described in the Evaluation section, calculated over an independent validation dataset. The validation dataset, termed Gabere1, is described by (Gabere & Noble 2017GABERE MN & NOBLE WS. 2017. Empirical comparison of web-based antimicrobial peptide prediction tools. Bioinformatics 33: 1921-1929. https://doi.org/10.1093/bioinformatics/btx081.
https://doi.org/10.1093/bioinformatics/b...
) and is the one with 547 positive sequences extracted from the DAMPD database and 2735 negative sequences extracted from UniProt/Swiss-ProtTHE UNIPROT CONSORTIUM. 2019. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res 47: D506-D515. https://doi.org/10.1093/nar/gky1049.
https://doi.org/10.1093/nar/gky1049...
. Peptides in the validation dataset were filtered according to the process described in the Evaluation section, that is to say, peptides in the validation dataset have lengths between 7 and 35 amino acids. It is important to highlight that we did not use any of the peptides of the validation dataset during our models’ training to avoid a biased performance measures in this comparison.

Table III
Model comparison by using an independent dataset. The best results for each performance measure are highlighted in bold and underlined font, while the second-best results are highlighted in bold.

For predictors which returned a numerical value to indicate the pseudo-probability to be antimicrobial, a threshold value of 0.5 was used to determine if a peptide is or is not an AMP. Also, special treatment was given to iAMPpred (Meher et al. 2017MEHER PK, SAHU TK, SAINI V & RAO AR. 2017. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci Rep 7: 1-12. https://doi.org/10.1038/srep42362.
https://doi.org/10.1038/srep42362...
) because the prediction results indicate the probability to be antibacterial, antifungal and antiviral, but not the probability to be an AMP. Therefore, a peptide with a probability greater than 0.5 to be antibacterial, antifungal or antiviral, is considered an AMP. Results for Deep-AmPEP30 and RF-AmPEP30 in Table III are calculated over the peptides with lengths ≤ 30 because it is the maximum peptides’ length to be used in those predictors.

According to Table III, most models generally show good recognition in the validation dataset, with ACC, F1, and AUC all ranging from 80-94 and MCC scores ranging from 0.59-0.80. Also, most models have a sensitivity over 90%, which means good performance in recognizing positive peptides and a specificity over 80%, which indicates a good performance detecting negative peptides. AmPEP is the only model that deviates from these measures (see row 6 in Table III). The low values reported for AmPEP show a weak performance on about 70% of the validation dataset. The sensitivity of 99.26% indicates that this model correctly classifies most positive peptides. However, the specificity of 18.17% shows that the model fails to recognize the negative peptides, indicating that the model would be biased to classify as positive most short peptides.

Results in Table III also show that AmpClass XGBoost (one of our models) got the best values for ACC, F1, SPC and MCC; followed by AMP Scanner DNN, which has better sensibility. RF-AmPEP30 is the third-best model, different from AmpClass. The relatively high values for the metrics suggest that these models have strong recognition performance on around 90% of the peptides in the validation dataset. Based on the sensitivity, AmpClass XGBoost and AMP Scanner DNN have a wrong prediction in less than 5% of the positive peptides. On the other hand, the specificity shows that both models can recognize more than 92% of the negative peptides, which means that both models mistakenly label less than 8% of negative peptides as positives.

We also used McNemar’s test to compare the classification performance of AmpClass XGBoost and AMP Scanner DNN. This was done because the results for their measures have similar values. McNemar’s test is a well-known statistical test to analyze the statistical significance of classifier performances’ differences based on observed proportions in the model’s contingency table (Dietterich 1998DIETTERICH TG. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10: 1895-1923. https://doi.org/10.1162/089976698300017197.
https://doi.org/10.1162/0899766983000171...
). Equation 1 shows the formula used to estimate the test value with a continuity correction in McNemar’s test.

χ 2 = ( | n 10 - n 01 | - 1 ) 2 n 10 + n 01 (1)

where ​​n​ 10​​​ is the count of test instances that Classifier 1 got correct and Classifier 2 got incorrect, and ​​n​ 01​​​ is the count of test instances that Classifier1 got incorrect, and Classifier 2 got correct.

If the test value (​​χ​​ 2​​) is greater than the Chi-square distribution table value of 3.84 at 95% confidence interval (α = 0.05), the null hypothesis is rejected; which means that the two models’ performance is statistically different. On the contrary, there is no evidence to reject the null hypothesis, which means no difference in both models’ performance. Applying McNemar’s test between AmpClass XGBoost and AMP Scanner DNN we got a χ 2 = 0.0 and p-value= 1.0. With these results, we cannot reject the null hypothesis, which indicates that both models have the same performance on the validation dataset. McNemar’s test also was applied to AmpClass XGBoost and RF-AmPEP30. In this case, we get a ​​χ​​ 2​​ = 0.952 and p-value= 0.329, which also show no statistical difference between AmpClass XGBoost and RF-AmPEP30. Accordingly, the performance among AmpClass XGBoost, AMP Scanner DNN, and RF-AmPEP30 is statistically the same. These results demonstrate that two different approaches have statistically the same performance even when AmpClass XGBoost is built with simple supervised machine learning models and the others are built over deep learning architectures.

In vitro validation

The experimental validation of AmpClass is presented in (Monsalve et al. 2024MONSALVE D, MESA A, MIRA LM, MERA C, ORDUZ S & BRANCH-BEDOYA JW. 2024. Antimicrobial peptides designed by computational analysis of proteomes. Antonie van Leeuwenhoek 117: 1-14. https://doi.org/10.1007/s10482-024-01946-0.
https://doi.org/10.1007/s10482-024-01946...
). The results indicate that from the nine best-classified peptides by AmpClass, six were active against three bacterial species with minimum inhibitory concentrations (MIC) under 10 µM, and 7 were active against at least one bacterial species with MIC under 10 µM.

CONCLUSIONS

Although there are several AMP predictions models available, to the best of our knowledge, this study and the one carried out by Yan et al. (2020)YAN J, BHADRA P, LI A, SETHIYA P, QIN L, TAI H, WONG K & SIU S. 2020. Deep-AmPEP30: Improve short antimicrobial peptides prediction with deep learning. Mol Ther Nucl Acids 20: 882-894. https://doi.org/10.1016/j.omtn.2020.05.006.
https://doi.org/10.1016/j.omtn.2020.05.0...
have developed tools for the successful and consistent recognition of short AMPs. As was discussed previously, short AMPs carry advantages over long ones and so having an activity predictor specially trained to recognize them gives higher chances of identifying peptides with more potential in the journey from lab to shelf. Sourcing non-AMPs from either the Refseq database or the Swissprot database proved to not impact the model performance after training in any noticeable manner which leads us to believe that both databases possess the same quality of annotations for non-AMPs, in other words, the chance of finding a false negative in both seems to be the same.

In this work we presented AmpClass, a tool that uses five supervised machine learning algorithms to recognize short AMPs. After the parameter tuning, from these models, AmpClass XGBoost achieves the higher performance, compared with existing state-of-the-art methods. Also, all models demonstrated to have better or similar accuracy than traditional models such as CAMPR3, iAMPpred, iAMPL2 or DBAASP. McNemar’s test also showed that one of our approaches (AmpClass XGBoost), based on traditional supervised machine learning models, achieves the same performance than AMP Scanner DNN and RF-AmPEP30, which are based on deep learning. A key observation is that including the Gabere2 dataset in training improves the overall classification performance because this dataset lets the classification models increase the number of non-AMPs correctly recognized. The experimental validation of AmpClass is presented in the sister paper (Monsalve et al. 2024MONSALVE D, MESA A, MIRA LM, MERA C, ORDUZ S & BRANCH-BEDOYA JW. 2024. Antimicrobial peptides designed by computational analysis of proteomes. Antonie van Leeuwenhoek 117: 1-14. https://doi.org/10.1007/s10482-024-01946-0.
https://doi.org/10.1007/s10482-024-01946...
).

SUPPLEMENTARY MATERIAL

Tables SI-SIII.

ACKNOWLEDGMENTS

Authors thank the members of the Prospecting and Biomolecules Design Laboratory of the Universidad Nacional de Colombia, as well as Geraldine Duque for her help in the compilation of an initial AMP dataset and Alberto Ceballos for his help with some of the experiments and discussions done in this work.This work was supported by Universidad Nacional de Colombia, sede Medellín, Hermes research grant numbers 61164, 49011 and 55740, and by the Instituto Tecnológico Metropolitano (ITM), research grant number PE20202. The authors have no competing interests to declare that are relevant to the content of this article.

REFERENCES

  • AGRAWAL P, BHALLA S, CHAUDHARY K, KUMAR R, SHARMA M & RAGHAVA GPS. 2018. In silico approach for prediction of antifungal peptides. Front Microbiol 9: 323. https://doi.org/10.3389/fmicb.2018.00323.
    » https://doi.org/10.3389/fmicb.2018.00323
  • ASGARI E & MOFRAD MR. 2015. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10: 1-15. https://doi.org/10.1371/journal.pone.0141287.
    » https://doi.org/10.1371/journal.pone.0141287
  • BENJELLOUN G, BENNANI L, BERRADA S, MOUSSA B & BENNANI B. 2020. Prevalence and antibiotic resistance profiles of Staphylococcus sp. isolated from food, food contact surfaces and food handlers in a Moroccan hospital kitchen. Let Appl Microbiol 70: 241-251. https://doi.org/10.1111/lam.13278.
    » https://doi.org/10.1111/lam.13278
  • BHADRA P, YAN J, LI J, FONG S & SIU SW. 2018. AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci Rep 8: 1-10. https://doi.org/10.1038/s41598-018-19752-w.
    » https://doi.org/10.1038/s41598-018-19752-w
  • BROWNLEE J. 2020. A gentle introduction to XGBoost for applied machine learning. https://bit.ly/3zuFhNm
    » https://bit.ly/3zuFhNm
  • CAO DS, XU QS & LIANG YZ. 2013. Propy: A tool to generate various modes of Chou’s PseAAC. Bioinformatics 29: 960-962. https://doi.org/10.1093/bioinformatics/btt072.
    » https://doi.org/10.1093/bioinformatics/btt072
  • CHANDRASHEKAR G & SAHIN F. 2014. A survey on feature selection methods. Comput Electrl Eng 40: 16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024.
    » https://doi.org/10.1016/j.compeleceng.2013.11.024
  • CHEN Z ET AL. 2020. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform 21: 1047-1057. https://doi.org/10.1093/bib/bbz041.
    » https://doi.org/10.1093/bib/bbz041
  • CHOU KC. 2001. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43: 246-255. https://doi.org/10.1002/prot.1035.
    » https://doi.org/10.1002/prot.1035
  • COCK P ET AL. 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25: 1422-1423. https://doi.org/10.1093/bioinformatics/btp163.
    » https://doi.org/10.1093/bioinformatics/btp163
  • D’COSTA VM ET AL. 2011. Antibiotic resistance is ancient. Nature 477: 457-461. https://doi.org/10.1038/nature10388.
    » https://doi.org/10.1038/nature10388
  • DIETTERICH TG. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10: 1895-1923. https://doi.org/10.1162/089976698300017197.
    » https://doi.org/10.1162/089976698300017197
  • DI LUCA M, MACCARI G, MAISETTA G & BATONI G. 2015. BaAMPs: The database of biofilm-active antimicrobial peptides. Biofouling 31: 193-199. https://doi.org/10.1080/08927014.2015.1021340.
    » https://doi.org/10.1080/08927014.2015.1021340
  • DUDA RO, HART PE & STORK DG. 2000. Pattern Classification. Wiley-Interscience. Canada, 2nd edition.
  • FAIR R & TOR Y. 2014. Bacterial resistance in the 21st century. Perspect Medicinal Chem 6: 25-64. https://doi.org/10.4137/PMC.S14459.
    » https://doi.org/10.4137/PMC.S14459
  • FRANZESE M & IULIANO A. 2019. Correlation analysis. In: Encyclopedia of Bioinformatics and Computational Biology, p. 706-721. https://doi.org/10.1016/B978-0-12-809633-8.20358-0
    » https://doi.org/10.1016/B978-0-12-809633-8.20358-0
  • FU L, NIU B, ZHU Z, WU S & LI W. 2012. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28: 3150-3152. https://doi.org/10.1093/bioinformatics/bts565.
    » https://doi.org/10.1093/bioinformatics/bts565
  • GABERE MN & NOBLE WS. 2017. Empirical comparison of web-based antimicrobial peptide prediction tools. Bioinformatics 33: 1921-1929. https://doi.org/10.1093/bioinformatics/btx081.
    » https://doi.org/10.1093/bioinformatics/btx081
  • GAUTAM A ET AL. 2016. Development of antimicrobial peptide prediction tool for aquaculture industries. Probiotics Antimicrob Proteins 8: 141-149. https://doi.org/10.1007/s12602-016-9215-0.
    » https://doi.org/10.1007/s12602-016-9215-0
  • GÓMEZ EA, GIRALDO P & ORDUZ S. 2017. InverPep: A database of invertebrate antimicrobial peptides. J Glob Antimicrob Resist 8: 13-17. https://doi.org/10.1016/j.jgar.2016.10.003.
    » https://doi.org/10.1016/j.jgar.2016.10.003
  • HAMMAMI R, BEN HAMIDA J, VERGOTEN G & FLISS I. 2009. PhytAMP: A database dedicated to antimicrobial plant peptides. Nucleic Acids Res 37: 963-968. https://doi.org/10.1093/nar/gkn655.
    » https://doi.org/10.1093/nar/gkn655
  • HAMMAMI R, ZOUHIR A, LE LAY C, BEN HAMIDA J & FLISS I. 2010. BACTIBASE second release: A database and tool platform for bacteriocin characterization. BMC Microbiol 10: https://doi.org/10.1186/1471-2180-10-22.
    » https://doi.org/10.1186/1471-2180-10-22
  • HANCOCK RE & SAHL HG. 2006. Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nat Biotechnol 24: 1551-1557. https://doi.org/10.1038/nbt1267.
    » https://doi.org/10.1038/nbt1267
  • HE B, ZHANG B & FU Y. 2013. Discovery of proteomics based on machine learning. Preprint at https://arxiv.org/ftp/arxiv/papers/1312/1312.1025.pdf
    » https://arxiv.org/ftp/arxiv/papers/1312/1312.1025.pdf
  • HINCAPIÉ O, GIRALDO P & ORDUZ S. 2018. In silico design of polycationic antimicrobial peptides active against Pseudomonas aeruginosa and Staphylococcus aureus. Antonie van Leeuwenhoek J Microbiol 111: 1871-1882. https://doi.org/10.1007/s10482-018-1080-2.
    » https://doi.org/10.1007/s10482-018-1080-2
  • JOSEPH S, KARNIK S, NILAWE P, JAYARAMAN VK & IDICULA-THOMAS S. 2012. ClassAMP: A prediction tool for classification of antimicrobial peptides. IEEE/ACM Trans Comput Biol Bioinform 9: 1535-1538. https://doi.org/10.1109/tcbb.2012.89.
    » https://doi.org/10.1109/tcbb.2012.89
  • KANG X, DONG F, SHI C, LIU S, SUN J, CHEN J, LI H, XU H, LAO X & ZHENG H. 2019. DRAMP 2.0, an updated data repository of antimicrobial peptides. Sci Data 6: 148. https://doi.org/10.1038/s41597-019-0154-y.
    » https://doi.org/10.1038/s41597-019-0154-y
  • KANS J. 2013. Entrez Direct: E-utilities on the UNIX Command Line. Technical report. National Center for Biotechnology Information, National Library of Medicine, Bethesda (MD). http://www.ncbi.nlm.nih.gov/books/NBK25501/
    » http://www.ncbi.nlm.nih.gov/books/NBK25501/
  • KAUSHIK A, MEHMOOD A, PENG S, ZHANG YJ, DAI X & WEI DQ. 2021. A-CaMP: a tool for anti-cancer and antimicrobial peptide generation. J Biomol Struct Dyn 39: 285-293. https://doi.org/10.1080/07391102.2019.1708796.
    » https://doi.org/10.1080/07391102.2019.1708796
  • KUMAR V, AGRAWAL P, KUMAR R, BHALLA S, USMANI SS, VARSHNEY GC & RAGHAVA GP. 2018. Prediction of cell-penetrating potential of modified peptides containing natural and chemically modified residues. Front Microbiol 9: 1-10. https://doi.org/10.3389/fmicb.2018.00725.
    » https://doi.org/10.3389/fmicb.2018.00725
  • LAZZARO BP, ZASLOFF M & ROLFF J. 2020. Antimicrobial peptides: Application informed by evolution. Science 368: 1-7. https://doi.org/10.1126/science.aau5480.
    » https://doi.org/10.1126/science.aau5480
  • LATA S, MISHRA NK & RAGHAVA GP. 2010. AntiBP2: Improved version of antibacterial peptide prediction. BMC Bioinformatics 11: 1-7. https://doi.org/10.1186/1471-2105-11-S1-S19.
    » https://doi.org/10.1186/1471-2105-11-S1-S19
  • LATA S, SHARMA BK & RAGHAVA GP. 2007. Analysis and prediction of antibacterial peptides. BMC Bioinformatics 8: 263. https://doi.org/10.1186/1471-2105-8-263.
    » https://doi.org/10.1186/1471-2105-8-263
  • LEE EY, WONG GC & FERGUSON AL. 2018. Machine learning-enabled discovery and design of membrane active peptides. Bioorg Med Chem 26: 2708-2718. https://doi.org/10.1016/j.bmc.2017.07.012.
    » https://doi.org/10.1016/j.bmc.2017.07.012
  • LEE HT, LEE CC, YANG JR, LAI JZ, CHANG KY & RAY O. 2015. A large-scale structural classification of Antimicrobial peptides. Biomed Res Int 2015: 475062. https://doi.org/10.1155/2015/475062.
    » https://doi.org/10.1155/2015/475062
  • LI J, QU X, HE X, DUAN L, WU G, BI D, DENG Z, LIU W & OU HY. 2012. ThioFinder: A web-based tool for the identification of thiopeptide gene clusters in DNA sequences. PLoS ONE 7: 1-9. https://doi.org/10.1371/journal.pone.0045878.
    » https://doi.org/10.1371/journal.pone.0045878
  • LI J, CHENG K, WANG S, MORSTATTER F, TREVINO RP, TANG J & LIU H. 2017. Feature selection: a data perspective. ACM Comput Surv 50: 94: 1-94: 45. https://doi.org/10.1145/3136625.
    » https://doi.org/10.1145/3136625
  • LI Q, ZHANG C, CHEN H, XUE J, GUO X, LIANG M & CHEN M. 2018. BioPepDB: an integrated data platform for food-derived bioactive peptides. Int J Food Sci Nutr 69: 963-968. https://doi.org/10.1080/09637486.2018.1446916.
    » https://doi.org/10.1080/09637486.2018.1446916
  • LU XM, LU PZ & LIU XP. 2020. Fate and abundance of antibiotic resistance genes on microplastics in facility vegetable soil. Sci Total Environ 709: 1-9. https://doi.org/10.1016/j.scitotenv.2019.136276.
    » https://doi.org/10.1016/j.scitotenv.2019.136276
  • MANAVALAN B, BASITH S, SHIN TH, CHOI S, KIM MO & LEE G. 2017. MLACP: Machine-learning-based prediction of anticancer peptides. Oncotarget 8: 77121-77136. https://doi.org/10.18632/oncotarget.20365.
    » https://doi.org/10.18632/oncotarget.20365
  • MARX V. 2005. Watching peptide drugs grow up. Chem Eng News 83: 17-24. https://doi.org/10.1021/cen-v083n011.p017.
    » https://doi.org/10.1021/cen-v083n011.p017
  • MEHER PK, SAHU TK, SAINI V & RAO AR. 2017. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci Rep 7: 1-12. https://doi.org/10.1038/srep42362.
    » https://doi.org/10.1038/srep42362
  • MEHTA D ET AL. 2014. ParaPep: A Web resource for experimentally validated antiparasitic peptide sequences and their structures. Database 2014: 1-7. https://doi.org/10.1093/database/bau051.
    » https://doi.org/10.1093/database/bau051
  • MINKIEWICZ P, DZIUBA J, IWANIAK A, DZIUBA M & DAREWICZ M. 2008. BIOPEP database and other programs for processing bioactive peptide sequences. J AOAC Int 91: 965-980.
  • MONSALVE D, MESA A, MIRA LM, MERA C, ORDUZ S & BRANCH-BEDOYA JW. 2024. Antimicrobial peptides designed by computational analysis of proteomes. Antonie van Leeuwenhoek 117: 1-14. https://doi.org/10.1007/s10482-024-01946-0.
    » https://doi.org/10.1007/s10482-024-01946-0
  • MULANI MS, KAMBLE EE, KUMKAR SN, TAWRE MS & PARDESI KR. 2019. Emerging strategies to combat ESKAPE pathogens in the era of antimicrobial resistance: a review. Front Microbiol 10: 1-24. https://doi.org/10.3389/fmicb.2019.00539.
    » https://doi.org/10.3389/fmicb.2019.00539
  • MÜLLER AT, GABERNET G, HISS JA & SCHNEIDER G. 2017. modlAMP: Python for antimicrobial peptides. Bioinformatics 33: 2753-2755. https://doi.org/10.1093/bioinformatics/btx285.
    » https://doi.org/10.1093/bioinformatics/btx285
  • MÜLLER AT, HISS JA & SCHNEIDER G. 2018. Recurrent neural network model for constructive peptide design. J Chem Inf Model 58: 472-479. https://doi.org/10.1021/acs.jcim.7b00414.
    » https://doi.org/10.1021/acs.jcim.7b00414
  • NG XY, ROSDI BA & SHAHRUDIN S. 2015. Prediction of antimicrobial peptides based on sequence alignment and support vector machine-pairwise algorithm utilizing LZ-complexity. Biomed Res Int 2015: 212715. https://doi.org/10.1155/2015/212715.
    » https://doi.org/10.1155/2015/212715
  • NOVKOVIĆ M, SIMUNIĆ J, BOJOVIĆ V, TOSSI A & JURETIĆ D. 2012. DADP: The database of anuran defense peptides. Bioinformatics 28: 1406-1407. https://doi.org/10.1093/bioinformatics/bts141.
    » https://doi.org/10.1093/bioinformatics/bts141
  • O’LEARY NA ET AL. 2016. Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44: D733-D745. https://doi.org/10.1093/nar/gkv1189.
    » https://doi.org/10.1093/nar/gkv1189
  • OLIVER JP, GOOCH CA, LANSING S, SCHUELER J, HURST JJ, SASSOUBRE L, CROSSETTE EM & AGA DS. 2020. Fate of antibiotic residues, antibiotic-resistant bacteria, and antibiotic resistance genes in US dairy manure management systems. J Dairy Sci 103: 1051-1071. https://doi.org/10.3168/jds.2019-16778.
    » https://doi.org/10.3168/jds.2019-16778
  • PIOTTO SP, SESSA L, CONCILIO S & IANNELLI P. 2012. YADAMP: Yet another database of antimicrobial peptides. Int J Antimicrob Agents 39: 346-351. https://doi.org/10.1016/j.ijantimicag.2011.12.003.
    » https://doi.org/10.1016/j.ijantimicag.2011.12.003
  • PIRTSKHALAVA M ET AL. 2016. DBAASP v.2: An enhanced database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides. Nucleic Acids Res 44: D1104-D1112. https://doi.org/10.1093/nar/gkv1174.
    » https://doi.org/10.1093/nar/gkv1174
  • QURESHI A, THAKUR N & KUMAR M. 2013. HIPdb: A database of experimentally validated HIV inhibiting peptides. PLoS ONE 8. https://doi.org/10.1371/journal.pone.0054908
    » https://doi.org/10.1371/journal.pone.0054908
  • QURESHI A, THAKUR N, TANDON H & KUMAR M. 2014. AVPdb: A database of experimentally validated antiviral peptides targeting medically important viruses. Nucleic Acids Res 42: 1147-1153. https://doi.org/10.1093/nar/gkt1191.
    » https://doi.org/10.1093/nar/gkt1191
  • RAJU VNG, LAKSHMI KP, JAIN VM, KALIDINDI A & PADMA V. 2020. Study the influence of normalization/transformation process on the accuracy of supervised classification. In: Third International Conference on Smart Systems and Inventive Technology, p. 729-735. https://doi.org/10.1109/ICSSIT48917.2020.9214160
    » https://doi.org/10.1109/ICSSIT48917.2020.9214160
  • SCHNEIDER P, MÜLLER AT, GABERNET G, BUTTON AL, POSSELT G, WESSLER S, HISS JA & SCHNEIDER G. 2017. Hybrid network model for “Deep Learning” of chemical data: Application to antimicrobial peptides. Mol Infor 36: 1-8. https://doi.org/10.1002/minf.201600011.
    » https://doi.org/10.1002/minf.201600011
  • SINGH S, CHAUDHARY K, DHANDA SK, BHALLA S, USMANI SS, GAUTAM A, TUKNAIT A, AGRAWAL P, MATHUR D & RAGHAVA GP. 2015. SATPdb: A database of structurally annotated therapeutic peptides. Nucleic Acids Res 44: D1119-D1126. https://doi.org/10.1093/nar/gkv1114.
    » https://doi.org/10.1093/nar/gkv1114
  • TACCONELLI E & MAGRINI N. 2017. Global priority list of antibiotic-resistant bacteria to guide research, discovery, and development of new antibiotics. World Health Organization, Essential medicines and health products, p. 1-7.
  • TANG J, ALELYANI S & LIU H. 2020. Feature selection for classification: a review. Technical report. College of Computing, Georgia Tech.
  • THE UNIPROT CONSORTIUM. 2019. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res 47: D506-D515. https://doi.org/10.1093/nar/gky1049.
    » https://doi.org/10.1093/nar/gky1049
  • THÉOLIER J, FLISS I, JEAN J & HAMMAMI R. 2014. MilkAMP: A comprehensive database of antimicrobial peptides of dairy origin. Dairy Sci Technol 94: 181-193. https://doi.org/10.1007/s13594-013-0153-2.
    » https://doi.org/10.1007/s13594-013-0153-2
  • THOMAS S, KARNIK S, BARAI RS, JAYARAMAN VK & IDICULA-THOMAS S. 2009. CAMP: A useful resource for research on antimicrobial peptides. Nucleic Acids Res 38: 774-780. https://doi.org/10.1093/nar/gkp1021.
    » https://doi.org/10.1093/nar/gkp1021
  • TORRES MD, SOTHISELVAM S, LU TK & DE LA FUENTE-NUNEZ C. 2019. Peptide design principles for antimicrobial applications. J Mol Biol 431: 3547-3567. https://doi.org/10.1016/j.jmb.2018.12.015.
    » https://doi.org/10.1016/j.jmb.2018.12.015
  • TYAGI A, KAPOOR P, KUMAR R, CHAUDHARY K, GAUTAM A & RAGHAVA GP. 2013. In silico models for designing and discovering novel anticancer peptides. Sci Rep 3: 1-8. https://doi.org/10.1038/srep02984.
    » https://doi.org/10.1038/srep02984
  • TYAGI A, TUKNAIT A, ANAND P, GUPTA S, SHARMA M, MATHUR D, JOSHI A, SINGH S, GAUTAM A & RAGHAVA GP. 2015. CancerPPD: A database of anticancer peptides and proteins. Nucleic Acids Res 43: D837-D843. https://doi.org/10.1093/nar/gku892.
    » https://doi.org/10.1093/nar/gku892
  • VAN DEN MEERSCHE T, RASSCHAERT G, VANDEN NEST T, HAESEBROUCK F, HERMAN L, VAN COILLIE E, VAN WEYENBERG S, DAESELEIRE E & HEYNDRICKX M. 2020. Longitudinal screening of antibiotic residues, antibiotic resistance genes and zoonotic bacteria in soils fertilized with pig manure. Environ Sci Pollut Res 27: 28016-28029. https://doi.org/10.1007/s11356-020-09119-y.
    » https://doi.org/10.1007/s11356-020-09119-y
  • VELTRI D, KAMATH U & SHEHU A. 2017. Improving recognition of antimicrobial peptides and target selectivity through machine learning and genetic programming. IEEE/ACM Trans Comput Biol Bioinform 14: 300-313. https://doi.org/10.1109/tcbb.2015.2462364.
    » https://doi.org/10.1109/tcbb.2015.2462364
  • VELTRI D, KAMATH U & SHEHU A. 2018. Deep learning improves antimicrobial peptide recognition. Bioinformatics 34: 2740-2747. https://doi.org/10.1093/bioinformatics/bty179.
    » https://doi.org/10.1093/bioinformatics/bty179
  • VISHNEPOLSKY B, GABRIELIAN A, ROSENTHAL A, HURT DE, TARTAKOVSKY M, MANAGADZE G, GRIGOLAVA M, MAKHATADZE GI & PIRTSKHALAVA M. 2018. Predictive model of linear antimicrobial peptides active against Gram-Negative Bacteria. J Chem Inf Mod 58: 1141-1151. https://doi.org/10.1021/acs.jcim.8b00118.
    » https://doi.org/10.1021/acs.jcim.8b00118
  • WAGHU FH, GOPI L, BARAI RS, RAMTEKE P, NIZAMI B & IDICULA-THOMAS S. 2014. CAMP: Collection of sequences and structures of antimicrobial peptides. Nucleic Acids Res. 42: 1154-1158. https://doi.org/10.1093/nar/gkt1157
  • WAGHU FH, BARAI RS, GURUNG P & IDICULA-THOMAS S. 2016. CAMPR3: A database on sequences, structures and signatures of antimicrobial peptides. Nucleic Acids Res 44: D1094-D1097. https://doi.org/10.1093/nar/gkv1051.
    » https://doi.org/10.1093/nar/gkv1051
  • WAGHU FH & IDICULA-THOMAS S. 2020. Collection of antimicrobial peptides database and its derivatives: Applications and beyond. Protein Sci 29: 36-42. https://doi.org/10.1002/pro.3714.
    » https://doi.org/10.1002/pro.3714
  • WANG G, LI X & WANG Z. 2016. APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Res 44: D1087-D1093. https://doi.org/10.1093/nar/gkv1278.
    » https://doi.org/10.1093/nar/gkv1278
  • WANG J, YIN T, XIAO X, HE D, XUE Z, JIANG X & WANG Y. 2018. StraPep: A structure database of bioactive peptides. Database 2018: 1-7. https://doi.org/10.1093/database/bay038.
    » https://doi.org/10.1093/database/bay038
  • WALSH CT & WENCEWICZ TA. 2014. Prospects for new antibiotics: A molecule-centered perspective. J Antibiot 67: 7-22. https://doi.org/10.1038/ja.2013.49.
    » https://doi.org/10.1038/ja.2013.49
  • WORLD HEALTH ORGANIZATION. 2020. The top 10 causes of death. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death
    » https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death
  • WORLD HEALTH ORGANIZATION. 2021. Antimicrobial resistance. https://www.who.int/en/news-room/fact-sheets/detail/antimicrobial-resistance
    » https://www.who.int/en/news-room/fact-sheets/detail/antimicrobial-resistance
  • WU H, LU H, HUANG J, LI G & HUANG Q. 2012. EnzyBase: A novel database for enzybiotic studies. BMC Microbiol 12. https://doi.org/10.1186/1471-2180-12-54
    » https://doi.org/10.1186/1471-2180-12-54
  • WU Q, KE H, LI D, WANG Q, ZHOU J & FANG J. 2019. Recent progress in machine learning-based prediction of peptide activity for drug discovery. Curr Top Med Chem 19: 4-16. https://doi.org/10.2174/1568026619666190122151634.
    » https://doi.org/10.2174/1568026619666190122151634
  • XIAO X, WANG P, LIN WZ, JIA JH & CHOU KC. 2013. IAMP-2L: A two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal Biochem 436: 168-177. https://doi.org/10.1016/j.ab.2013.01.019.
    » https://doi.org/10.1016/j.ab.2013.01.019
  • YAN J, BHADRA P, LI A, SETHIYA P, QIN L, TAI H, WONG K & SIU S. 2020. Deep-AmPEP30: Improve short antimicrobial peptides prediction with deep learning. Mol Ther Nucl Acids 20: 882-894. https://doi.org/10.1016/j.omtn.2020.05.006.
    » https://doi.org/10.1016/j.omtn.2020.05.006
  • YAP CW. 2011. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32: 1466-1474. https://doi.org/10.1002/jcc.21707.
    » https://doi.org/10.1002/jcc.21707
  • YI HC, YOU ZH, ZHOU X, CHENG L, LI X, JIANG TH & CHEN ZH. 2019. ACP-DL: A deep learning long short-term memory model to predict anticancer peptides using high-efficiency feature representation. Mol Ther Nucl Acids 17: 1-9. https://doi.org/10.1016/j.omtn.2019.04.025.
    » https://doi.org/10.1016/j.omtn.2019.04.025
  • YU L & LIU H. 2003. Feature selection for high-dimensional data: A fast correlation-based filter solution. In: proceedings of the Twentieth International Conference on Machine Learning, p. 856-863.
  • ZAMYATNIN AA. 2006. The EROP-Moscow oligopeptide database. Nucleic Acids Res 34: D261- D266. https://doi.org/10.1093/nar/gkj008.
    » https://doi.org/10.1093/nar/gkj008
  • ZHAO X, WU H, LU H, LI G & HUANG Q. 2013. LAMP: A database linking antimicrobial peptides. PLoS ONE 8: 6-11. https://doi.org/10.1371/journal.pone.0066557.
    » https://doi.org/10.1371/journal.pone.0066557
  • ZUO Y, LV Y, WEI Z, YANG L, LI G & FAN G. 2015. IDPF-PseRAAAC: A web-server for identifying the defensin peptide family and subfamily using pseudo reduced amino acid alphabet composition. PLoS ONE 10: 1-13. https://doi.org/10.1371/journal.pone.0145541.
    » https://doi.org/10.1371/journal.pone.0145541

Publication Dates

  • Publication in this collection
    04 Oct 2024
  • Date of issue
    2024

History

  • Received
    04 July 2023
  • Accepted
    07 Apr 2024
Academia Brasileira de Ciências Rua Anfilófio de Carvalho, 29, 3º andar, 20030-060 Rio de Janeiro RJ Brasil, Tel: +55 21 3907-8100 - Rio de Janeiro - RJ - Brazil
E-mail: aabc@abc.org.br