Abstract
A new method based on neural networks to cluster proteins into families is described. The network is trained with the Kohonen unsupervised learning algorithm, using matrix pattern representations of the protein sequences as inputs. The components (x, y) of these 20×20 matrix patterns are the normalized frequencies of all pairs xy of amino acids in each sequence. We investigate the influence of different learning parameters in the final topological maps obtained with a learning set of ten proteins belonging to three established families. In all cases, except in those where the synaptic vectors remains nearly unchanged during learning, the ten proteins are correctly classified into the expected families. The classification by the trained network of mutated or incomplete sequences of the learned proteins is also analysed. The neural network gives a correct classification for a sequence mutated in 21.5%±7% of its amino acids and for fragments representing 7.5%±3% of the original sequence. Similar results were obtained with a learning set of 32 proteins belonging to 15 families. These results show that a neural network can be trained following the Kohonen algorithm to obtain topological maps of protein sequences, where related proteins are finally associated to the same winner neuron or to neighboring ones, and that the trained network can be applied to rapidly classify new sequences. This approach opens new possibilities to find rapid and efficient algorithms to organize and search for homologies in the whole protein database.
Similar content being viewed by others
References
Altschul SF, Lipman DJ (1990) Protein database searches for multiple alignments. Proc Natl Acad Sci (USA) 87:5509–5513
Andreassen H, Bohr H, Bohr J, Brunak S, Bugge T, Cotterill RMJ, Jacobsen C, Kusk P, Lautrop B, Petersen SB, Saermark T, Ulrich K (1990) Analysis of the secondary structure of the human immunodeficiency virus (HIV) proteins p17, gp120, and gp41 by computer modelling based on neural network methods. J Acquir Immune Defic Syndr 3:615–622
Bengio Y, Pouliot Y (1990) Efficient recognition of immunoglobulin domains from amino acid sequences using a neural network. Comput Appl Biosci 6:319–324
Corpet F (1988) Multiple sequence alignment with hierarchical clustering. Nucl Acids Res 16:10881–10890
Devereux J, Haeberli P, Smithies O (1984) A comprehensive set of sequence analysis programs for the VAX. Nucl Acids Res 12:387–395
Ferrán EA, Ferrara P (1991) Clustering proteins into families using artificial neural networks. Comput Appl Biosci: (to be published)
Kohonen T (1982). Self-organized formation of topologically correct feature maps. Biol Cybern 43:59–69
Kohonen T (1988) Self-organization and associative memory, 2nd edn. Springer, Berlin Heidelberg New York
Lapedes A, Barnes C, Burks C, Farber R, Sirotkin K (1990). Application of neural networks and other machine learning algorithms to DNA sequence analysis. In: Bell G, Marr T (eds) Computers and DNA. SFI Studies in the Sciences of Complexity, vol VII. Addison-Wesley, Reading Mass, pp 157–182
Nakayama S, Shigezumi S, Yoshida M (1988) Method for clustering proteins by use of all possible pairs of amino acids as structural descriptors. J Chem Inf Comput Sci 28:72–78
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Quian N, Sejnowski TJ (1988) Predicting the secondary structure of globular proteins using neural networks models. J Mol Biol 202:865–884
Rodrigues JS, Almeida LB (1991) Improving the convergence in Kohonen topological maps. In: Gelenbe E (ed) Neural networks: advances and applications. North-Holland, The Netherlands, pp 63–78
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Van Heel M (1990) A new family of powerful multivariate statistical sequence analysis (MSSA) techniques (submitted for publication)
Waterman MS, Arratia R, Galas DJ (1984). Pattern recognition in several sequences: consensus and alignement. Bull Math Biol 46:515–527
Watson J (1990) The human genome project: past, present and future. Science 248:44–49
Wilbur WJ, Lipman DJ (1983) Rapid similarity searches of nucleic acids and protein data banks. Proc Natl Acad Sci (USA) 80:726–730
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Ferrán, E.A., Ferrara, P. Topological maps of protein sequences. Biol. Cybern. 65, 451–458 (1991). https://doi.org/10.1007/BF00204658
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF00204658