\documentclass[12pt]{report} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{geometry} \geometry{margin=1in} \title{Integration and Hybridization in Neural Network Modelling} \author{Wesley Royce Elsberry} \date{August 1989} \begin{document} \begin{titlepage} \centering {\Large Integration and Hybridization in Neural Network Modelling\par} \vspace{1.5cm} {\large Wesley Royce Elsberry\par} \vspace{1cm} Presented to the Faculty of the Graduate School of\par The University of Texas at Arlington in Partial Fulfillment\par of the Requirements for the Degree of\par \vspace{0.5cm} {\large Master of Science in Computer Science\par} \vfill The University of Texas at Arlington\par August 1989\par \end{titlepage} \chapter*{Acknowledgements} I wish to thank the many people who have made my graduate program a rewarding, enlightening, and interesting experience. My especial thanks to Bob Weems, Ken Youngers, Vijay Raj, Steve Hufnagel, and Farhad Kamangar for their exemplary instruction. The advice and encouragement of Bill Buckles was critically important to this refugee from the life sciences. The members of my graduate committee, Karan Briggs, Lynn Peterson, and Daniel Levine, have provided me with technical resources, instruction, referrals, and general good advice in plenty. Sam Leven provided the example problem description, the classical sequence data, and much of his own time and expertise to aid me in developing both the simulation program and my understanding of the field. I am indebted to Dr. Levine for his great interest in teaching the principles of cognitive modelling to as wide an audience as possible, for without his express encouragement I would not have become acquainted with the field, and also for his considerable personal assistance in developing this thesis. The enthusiasm of Harold Szu helped to motivate me to undertake a deeper inquiry into neural network modelling. Finally, without the continual support of my spouse, Diane Blackwood, this thesis and the classroom work which formed the basis for it would not have been possible. July 28, 1989 \chapter*{Abstract} Artificial neural network models derived from different biological behaviors or functions can be used in an integrative fashion to create an extensive problem-solving environment. An example problem of limited melodic composition is approached by the use of Hopfield-Tank, back- propagation, and Adaptive Resonance Theory networks serving as plausible next note generator, musical sequence critic, and novelty detector, respectively. Biological bases for integrative function are discussed, and the experimental role of synthetic systems such as the example integrated network is explored. \tableofcontents \chapter{Defining The Role And Nature Of Artificial Neural Network Modelling} Cultural Bias and Its Application to Cognitive Inquiry Plato believed that there existed a pure world of ideas, whose perfect forms were poorly mimicked by copies in the reality apparent to our senses. This concept of the dominant, lofty nature of ideas and the thoughts that manipulated those ideas would do more to inhibit inquiry into material processes than virtually any other single cause in the succeeding two millennia. The legacy of this philosophical outlook still permeates our culture, coloring the basic assumptions and viewpoints of researchers even after extended scientific training. The firmness of general belief in the separate existence of ideas contributed to the Lamarckian hypothesis of the inheritance of acquired characters. While there were critics of Jean Baptiste Lamarck's work from its introduction, these were by no means an overwhelming group by numbers. The rediscovery of Gregor Mendel's work in 1905 provided clear and convincing evidence against Lamarck's hypothesis, yet even Watson and Crick's elucidation and characterization of DNA failed to dispel the last vestiges of belief in the inheritance of acquired characters. The ready and continued acceptance of the Lamarckian hypothesis into the mid- twentieth century gives an indication of the continued influence of Plato's world of ideas, even when confronted with contradicting evidence and convincing counterarguments. Biology, however, was by no means the only scientific discipline to be touched by the cultural bias of Platonic ideals. Physics experienced great upheaval and internal dissent as the deterministic view of Newton and Laplace gave way to the quirkiness and Heisenbergian uncertainty of quantum mechanics (Hawking 1988). In thinking about thinking, researchers have tended to denigrate approaches dependent upon investigation of physical processes. The high regard given ideas and thought processes has nearly always provided Western observers with a sense of revulsion at even considering that such sacred items could be solely the product of soft, squishy brain tissues and their component parts, neurons. This could be considered somewhat akin to Roquentin's nausea upon consideration of radical contingency (Ferguson 1987). Our culture does not encourage modes of inquiry that tend to displace the traditional view of the mind as the only known organ with which the perfect world of ideas can be perceived. It especially does not encourage the denial of the Platonic ideal world, yet it has been quite some time since any reputable scientist has explicitly advanced support for that concept. Cognitive science, however, must deal explicitly and forthrightly with the subject of ideas and thoughts. In a field where cultural bias has, can, and will directly effect the discussion of the subject at hand, it pays to recognize the existence and probable magnitude of that bias. I have briefly delineated the persistence and magnitude of our Platonic bias in biology and physics; I will assert that the bias is greater in cognitive science, where it may be more directly challenged. Artificial Intelligence: Definitions Artificial intelligence describes at once a sub-discipline of cognitive science and the goal of that sub-discipline. The goal is the description and production of an artificial system or systems that can be described as intelligent in operation, function, or effect, in both global and local contexts. As with most complex fields, there are several ways in which to approach direct inquiry into the topic. In what has been termed the "top-down modelling school" of artificial intelligence, the emphasis has been upon systems of formal logic and other explicit symbol manipulation techniques. As may have been apparent from the previous description, there is a "bottom-up modelling school" of artificial intelligence. This school of inquiry seeks to examine the basic processes that are known to produce intelligent action in humans, those basic processes of neural function and coordination. By extension, the bottom-up modelling school is concerned with neural function in general. This is considered necessary because of the great complexity of biological neural systems. While much research has been done and continues to be done, there exists no basic understanding of the detailed structure and operation of biological neural systems. There is a wealth of data, but a paucity of organizing principles. The bottom-up modelling school attempts to provide possible organizing principles, testing these through a process of modelling and incremental design improvements. This area of research has become known as artificial neural network modelling. The definitions given above differ somewhat from what may be considered standard in the artificial intelligence community. In their Turing Award speech, Newell and Simon state: The notion of physical symbol system had taken essentially its present form by the middle of the 1950's, and one can date from that time the growth of artificial intelligence as a coherent subfield of computer science. The twenty years of work since then has seen a continuous accumulation of empirical evidence of two main varieties. The first addresses itself to the sufficiency of physical symbol systems for producing intelligence, attempting to construct and test specific systems that have such a capability. The second kind of evidence addresses itself to the necessity of having a physical symbol system wherever intelligence is exhibited. It starts with Man, the intelligent system best known to us, and attempts to discover whether his cognitive activity can be explained as the working of a physical symbol system. . . . The first is generally called artificial intelligence, the second, research in cognitive psychology. (Newell and Simon 1976) This formulation of definitions seems at once too narrow and not properly descriptive. It is too narrow in that it limits severely the range of activities which may be considered artificial intelligence research. While Newell and Simon talk of physical symbol systems, there is the strong implication that they refer only to computational methods of explicit symbol manipulation, as in the use of languages such as LISP and Prolog. The definition of the second part is too narrow, in that it postulates no overlap between the fields of artificial intelligence and cognitive psychology, and also provides no linkage for concepts of operation derived from empirical studies of biological intelligent systems to be incorporated into an artificial intelligence framework. As stated later by Newell and Simon, The symbol system hypothesis implies that the symbolic behavior of man arises because he has the characteristics of a physical symbol system. Hence, the results of efforts to model human behavior with symbol systems become an important part of the evidence for the hypothesis, and research in artificial intelligence goes on in close collaboration with research in information processing psychology, as it is usually called. This seems to indicate that the previously given definitions were not truly descriptive of the relationship between artificial intelligence and other components of cognitive science. By including artificial intelligence as a sub-discipline of cognitive science, it becomes clear that artificial intelligence is not disjoint from research into considerations of what constitutes intelligence. An alternative formulation for defining artificial intelligence in more general terms is given by Charniak and McDermott (1985), where they state that artificial intelligence "is the study of mental faculties through the use of computational models." This form of definition allows for the top-down and bottom-up approaches to artificial intelligence to be given co-equal rank as complementary research disciplines. Artificial Neural Network Modelling Artificial neural network modelling, the bottom-up school of artificial intelligence, derives from the use of biological nervous systems as suitable exemplars for an approach to coordination, cognition, and control in artificial systems. This approach, in one form or another, has been with us for a long time. In the 1940's, McCulloch and Pitts demonstrated the possibility of casting boolean logic systems into networks of thresholded logic units. This certainly fits the Newell-Simon definition of a physical symbol system. McCulloch and Pitts' seminal paper and subsequent work introduced a new concept for consideration: rather than driving the understanding of biological cognitive processes through increasing sophistication of technology, perhaps technology can be made more capable and sophisticated by elucidating mechanisms of function from biological processes involved in cognition (Levine 1983, 1990). This differs qualitatively from the viewpoint common to many observers of human intelligence. There has been a tendency for people to compare the operation of thought processes with the current leading edge of technology. Freud compared the mind to a steam engine, Twain would speak of "mill-works" in relation to speech production, and various persons in this century have blithely and confidently settled upon the computer as a fully analogous system. A circular system can be noted here, though, as the image of the mind as being like some mechanism helped contribute to such endeavors as Pascal's calculator, Babbage's Analytical Engine, and the Hollerith punched card controlled census tabulator. These efforts, in turn, inspired the modern computer. (The influence of the two World Wars upon the motivation for the development of the computer must also be acknowledged.) However, as the Metroplex Study Group on Computational Neuroscience (1988) points out, Computational study of brain structures began in the 1940's when digital computers were emerging. The first computer architectures were developed in a collaboration between the mathematicians John von Neuman and Norbert Wiener working closely with physiologists like Warren McCulloch. Early computer scientists were captivated by the analogy between the all-or-none action potentials of the neuron and binary switches representing bits in computers. Models of biological computation thus had a seminal role in the design of modern computer architecture. Thus, while comparisons of the workings of the mind to technology helped to motivate research into further technological advances, the insight into the successful design for a significant new technology, computers, came instead from paying attention to the actual mechanisms of brain function. The computer has come to be pressed into service as a "new" metaphor for thought processes in biological systems. While the use of technological metaphors for cognition is persistent, having produced quaint turns of speech and aphorisms, it has not been significantly conducive to a direct understanding of the actual cognitive processes being described. The turnabout analogy produced by McCulloch and Pitts has provided far more benefits to cognitive science and computing through its establishment of neural modelling. Binary switching technology failed to provide significant insight into biological neural activity and function. Leon Harmon (1970) noted that "[a]nother kind of difficulty in coming to an understanding of nervous systems is that we may be conditioned into thinking about them in ways that are more constrained that we like to admit or are sufficiently aware of." The cultural identity which we share through environment can be a handicap in research into the nature and mechanisms of our intelligence. The model networks of McCulloch and Pitts, an example of which is shown in Figure 1, were seen to provide useful mechanisms and insights for computing machinery. These networks consist of threshold logic units receiving excitatory and inhibitory inputs of integer value. A simple sum of these values is compared with a threshold level for the unit, and the unit is considered to "fire" if its inputs equal or exceed the threshold value. A unit's activity is then carried forward to further connected units in the system. McCulloch and Pitts (1943) proved that any boolean Figure 1, Mc-P function could be created by the appropriate combination of such units. The McCulloch-Pitts formalism is quite similar to principles of design using TTL logic, as diagrams can attest. It should be noted that McCulloch and Pitts described their formalism in 1943, some sixteen years before the invention of the integrated circuit. While significant problems reduce the overall utility of McCulloch-Pitts networks, modified forms continue to be advanced (Szu 1989). Given the basis of a neural computational framework, additional features extracted from biological research were gradually added to proposed models. The idea that inputs to neurons had graded values modified by the efficiency of synaptic connections led to several important advances. Hebb proposed a synaptic modification rule to allow a form of learning in randomly connected networks (Hebb 1949, Levine 1983). His formulation was that the efficiency of a synapse which connects two neurons increases if both of the neurons have high activities at the same time. This relatively simple rule has a number of drawbacks, which have since been extensively cataloged and discussed. However, it cannot be denied that Hebb's Law, as it has been called, was an important advance for modelling. The idea of stating an organizational principle for network design also represented an important advance for artificial neural network modelling. Rosenblatt's Perceptron architectures demonstrate this well. Rosenblatt postulated several network models, among these was a three- layer network. The layer which received external inputs was the sensory layer, and the layer which gave an output (either as a raw signal or interpreted as a motor output) was termed the response layer. An associative layer of neurons provided the articulation between sensory and response layers (Levine 1983). This general method of organization helped move artificial neural network modelling away from specifically designed networks and randomly organized network models. The drawbacks associated with the special design of network models remain a concern for ANN modellers today. Such systems are fragile or brittle, meaning that a fault in any part of the network could cause a general failure of function. Also, if an error in design occurred, the resulting network would have no method of overcoming the design fault. This particular pitfall is known by the assumption necessary for correct function to be achieved: "programmer omniscience." Since programmer omniscience cannot be guaranteed, systems predicated on this principle are inherently unreliable (Hecht-Nielsen 1986). Designing with generality as an important principle of function results in artificial neural networks that are described as self- organizing. A self-organizing network will conform its function to the particular problem or context through a process of adaptation or learning. Current Models Narrowly Focussed The tendency in developing an artificial neural network model is to constrain its function to a well-defined domain. This design principle enables better control over evaluation of the functionality of the design. Data from neurological or behavioral studies dealing with the problem domain can then often be directly applied to either training or evaluating the model. There exist many general functions that are commonly encountered in biological neural networks which can be considered to have important implications for computational study. Among these are included associative memory, classification, pattern recognition, and function mapping. ANN models proposed to implement these functions include Bi- directional Associative Memory, Adaptive Resonance Theory (classification), Brain-State-in-a-Box (classification), Neocognitron (invariant pattern recognition), and Back-Propagation (function mapping) (Simpson 1988). Unfortunately, the underlying reason for existence of each model is overlooked in comparing different models. The human tendency to wish to attribute a single scalar quantity denoting how "good" an ANN model is will often cause people to overlook the fact that certain networks should not be compared as being equivalent. Lippmann provides a good basic overview of six different ANN models, yet falls prey to this pitfall. Each of the models is interpreted as a classification network, yet only the Hamming and ART networks are designed to function as classifiers (Lippmann 1987). Multi-architecture Integrative Systems The use of multiple complementary architectures in designing application systems has not often been explored. While this approach to system design is commonly touted in ANN simulation aid advertisements, it is less frequently featured in the literature. In an advertisement for the ANZA Neurocomputer, HNC Corporation states, "In these networks the interconnect geometry is already determined and the form of the transfer equations is fixed. However, the number of processing elements (neurons), their initial state and weight values, learning rates, and time constants are all user selectable, thus allowing one to customize a particular network paradigm (or combination of paradigms) to suit a particular application." There are some reasons for the relative lack of multi- paradigm (multi-architecture) research results in publication. The design principles stated earlier still hold, that it is easiest to design an architecture with a narrow definition of function. There is the tendency for such architectures to incorporate simplifying assumptions which aid in simulation or real-world implementation ("casting in silicon," as the phrase goes). The ease of verifying correct operation is increased for architectures which have a narrowly defined application, as relatively clean data or simple theoretical findings are likely to be available against which the architecture may be evaluated. The relative complexity of modelling for even constrained contexts provides another reason for concentration upon single paradigm systems. The number of parameters which can effect a model's performance ranges from zero for linear systems, such as Widrow's ADALINE (Widrow 1987), to tens of scalar parameters, as may be encountered in Carpenter and Grossberg's Adaptive Resonance Theory architecture (Carpenter and Grossberg 1987a, 1987b; Simpson 1988). When dealing with nonlinear dynamical systems for which no closed-form solution exists, slight changes of system parameters can result in large scale changes in behavior. Carpenter and Grossberg point out relative constraints upon the parameters used in the ART 1 architecture, giving guidelines for values which should yield stable operation of the network. Finding suitable parameters for operation of a particular architecture can be a frustrating and time- intensive experience. As an example, an architecture called the "on-center, off-surround" network (OCOS) is based upon a relatively simple equation that appears in slightly modified form in many competitive networks. One form of this equation (Grossberg 1973) is: dx/dt = -A Xi [first term] + (B-Xi) (Ii+f(Xi)) [second term] - (Xi+C) (Sum from k = 1 to n of ((k<>i) (Jk+f(Xk))) [third term] (Eq. 1) The first term represents decay of activity over time; the second term represents increase in activity due to excitatory input values, Ii, and recurrent self-excitation; the third term represents decrease in activity due to inhibitory input values, Jk and competition from other nodes in the network. For each term, there is an associated parameter. A hidden parameter of the architecture is a factor by which to multiply the result of the difference equation. This is usually set to be much less than one, especially when using the simple forward Euler method of system updating. However, there is a performance trade-off involved with setting this factor too small: the system will converge to a stable result, but will take a long time to converge. If the factor is set to a large value the network will "blow-up," which is a state of wild fluctuation in activation values for the nodes in the network. Such a system cannot converge. The solution is to find a value for this factor which will ensure convergence without an inordinate amount of time taken in reaching convergence. The explicit parameters given in the equation must be matched to the size of the other parameters to maintain set relationships. In order to find parameter values, I have used a network simulation which runs through permutations of sets of discrete parameter values. By examining the resulting output which indicated whether the network run achieved convergence and the number of time steps necessary to achieve convergence, I was able to settle upon a suitable choice of parameters for regular use and was able to decipher a coarse set of relationships between the parameters. In consideration of these complexities in modelling and simulation, synthesis of complex systems from simpler subunits has some significant obstacles to overcome. There are more degrees of freedom for system operation, leading to complexities in articulation and coordination of subunits. This may increase the necessary number of system parameters, increasing complexity in a combinatorial fashion. Usually, there will either be less data available for evaluation of the complete system behavior, or the data that is available will be uncertain. The general principles to apply to a synthetic endeavor in artificial neural network modelling remain to be elucidated. In addition, synthesis has traditionally been given short shrift in Western culture (Paris 1989). The application of synthesis in philosophy has largely been confined to the works of Wittgenstein and Marx. Marx, of course, developed a philosophy whose application was inimical to most Western economic structures. This has caused both his work and his approach to be considered with contempt. On the positive side, however, synthesis can lead to significant insights as different assumptions and features can be applied to functional design. The process of integration and hybridization in artificial neural network modelling may be expected to lead to new insights into structure and function of benefit to artificially intelligent systems, given the past history of benefits derived from consideration of concepts borrowed from disparate disciplines. In this case, the complexity of knowledge or topic of a contributing discipline can provide a measure of the possible beneficial overlap: as complexity increases, the set of concepts which may profitably be applied to the new context also increases. As this rule would predict, biological nervous systems provide a very large pool of concepts ready for reconstruction in the artificial intelligence framework. Biological Neural System Complexity The complexity which drives that expectation can be readily appreciated, as in the human nervous system the number of neurons is in the billions, and each neuron will have connection to between tens and tens of thousands of other neurons. This physio-spatial complexity pales when compared to the complexity that can result when one attempts to define a state for the neuron. The state can be considered to be a combination of electrical activity, ionic balance, hormonal levels, input activities, and many other factors. Since for several of these attributes there exist separate contributing factors, the number of components contributing to neuron state can easily exceed twenty, and there may be hundreds of such components. Add to this that these components often have analog values, separate time scales of action, and capacity for permanent change in the neuron's response, and the possible state function for a single neuron displays a suitably bewildering complexity. Since this level of complexity is not conducive to direct examination, the process of problem decomposition and solution is applied. The implication for artificial neural network modelling is that certain gross features are modelled. Typically, a model assumes that the major interesting component of neural function is the electrical activity of the contributing neurons or nodes. A further assumption in general currency is that the electrical activity of an ANN may be modeled with the elements or nodes responding in the manner expected of neural populations rather than individually thresholded neurons (Levine 1983). Some models take into account generalized neurotransmitter effects, with Grossberg's gated dipole model providing a good example (Grossberg 1972). These simplifying assumptions still leave a high level of complexity to be dealt with in developing ANN designs and applications. Relevance of Biological Models Biology provides the exemplars for self-organizing adaptive systems which have proven useful in artificial neural network modelling thus far. While strict adherence to biological accuracy in modelling may well be counter-productive to advances in modelling techniques, rejecting biological principles may then lead to difficulties in later integrative work. The biological framework of neurophysiological function provides a basic structure which makes for a common ground of interaction in models. For example, the range of activation and output of neurons is rather strictly proscribed, while the range of activation of an analogous unit in neural network models is limited only by the capability of the processor in specifying a floating point exponent. This would make interfacing two models using different ranges of activation less straightforward. The inherent complexity of biological systems can support two different arguments concerning the future of modelling. As Hawking (1988) says, describing a complex system all at once is terrifically difficult. It is much simpler to model several different components of the overall system. Bringing these partial models together to form a complete model can become problematic, as in Hawking's example of the search for the Grand Unified Theory in physics. In one sense, then, the effort to integrate models of limited function may be premature. However, it may be that along with the incremental advances made in modelling small subunits of biological neural function, we should also attempt incremental integration of well-understood low level models. This would help prevent the kind of situation which exists in physics, with two or more highly disjoint models prevailing and no unification or integration yet in sight. Since biological systems are inherently more complex than the physical systems upon which they are based, it becomes important to keep an eye on the eventual need for integration and resolution of subsidiary models. Models which utilize features from two or more extant ANN architectures such as Hecht-Nielsen's Counterpropagation Network (Hecht- Nielsen 1987) demonstrate the useful qualities which synthesis of models can bring to functionality. The counterpropagation network is derived from Kohonen's self-organizing map model and Grossberg's competitive learning networks. Hecht-Nielsen notes that this network is designed for function mapping and analyzes its performance as compared to the back- propagation architecture. While the general functionality of counter- propagation networks (CPN) remains lower than that of back-propagation networks, there exists a subclass of mapping functions for which the CPN will train faster, and there also exists a closed-form solution for the error of the CPN. As Hecht-Nielsen notes, "Finally, CPN illustrates a key point about neurocomputing system design. Namely, that many of the existing network paradigms can be viewed as building block components that can be assembled into new configurations that offer different information processing capabilities." Artificial Neural Network Models: Three Architectures For the purposes of exploring multi-architecture system design, I selected three artificial neural network architectures for incorporation into the overall system. These were the Hopfield-Tank network, back- propagation, and Adaptive Resonance Theory 1 models. These networks perform three different functions: Hopfield-Tank is used for optimization or constraint satisfaction, back-propagation is a general-purpose mapping network, and Adaptive Resonance Theory 1 is a classifier network. Hopfield-Tank Networks In 1982, John Hopfield wrote an article, "Neural networks and physical systems with emergent collective computational abilities," which was published in Proceedings of the National Academy of Sciences, describing a model of neural computation which was readily implementable in current solid-state technology. This article and the model which it described has been widely credited with a resurgence of interest in ANN modelling. The network architecture presented by Hopfield and Tank (1985), hereafter known as HTN for "Hopfield-Tank Network[s]", is a single-layer fully interconnected network model (Figure 2). There is no learning rule for this network, although various researchers have proposed modified HTN architectures that do incorporate adaptive learning. Weights between nodes in an HTN are fixed and symmetric, and connections between a node and itself are zero. The advantages of these criteria for network design are that the system dynamics can be shown to perform an "energy minimization" in reaching a stable state. When the weights are determined according to system constraints, the system can be characterized by a Liapunov function, providing a measure for system energy. HTN's have been applied to various constraint satisfaction and optimization problems. Hopfield and Tank attracted much attention by demonstrating the utility of an HTN in generating good solutions to the "Traveling Salesman Problem" (or TSP), a non-polynomial time complete problem. The TSP can be described as choosing a minimum length path among Figure 2 a set of cities such that each city is visited once, and the salesman returns to the city of origin. The closed path length constitutes the measure for any particular solution. As the number of cities increases, the number of possible valid tours increases combinatorially. However, it can be shown that an HTN will produce good tours in constant time. An HTN used for TSP computation does not necessarily converge upon the optimum solution, but will reject non-optimum solutions and give relatively "good" solutions. Unfortunately, it can also be shown that the HTN trades off constant time operation for O(n2) space considerations. Figure 2 shows the HTN architecture for solution of the TSP. Although the nodes in the network are treated as elements in a vector, the visual representation which makes the most sense to the designer and other interested humans is a matrix of nodes. By imposing this structure upon our view of the network, we can associate each row with a city in the tour, and each column with the position of a city in the final tour (City n is in Position m in the tour). The constraints of importance are that for a valid tour, each city appears once, and each position in the tour has only one city. Thus, for the state of the network at equilibrium, we should find high activities in n nodes, where n is equal to the number of cities or positions. The interconnections between nodes in a row or column are inhibitory, causing highly active nodes to reduce the activity of other nodes in the same row or column. The interconnections from a node to its neighbors in adjacent columns are proportionally more inhibitory as distance between cities represented by those nodes increases. Hopfield-Tank Equation The equation defining the network's activity over time (Hopfield and Tank 1985) is Ci (dui/dt) = (Sum from j = 1 to N of (TijVj - ui/Ri + Ii)) (Eq. 2) where i and j designate neurons in the network, N is the total number of neurons in the network, T is a connection weight between neurons, V is the output value function for a neuron, u is the activity of a neuron, I is the external input for a neuron, C is the capacitance of a neuron, and R is the resistance of a neuron. Back-propagation Networks Back-propagation is a short version of "correction by the backward propagation of errors." The learning rule used in back-propagation networks (BPNs) is termed the generalized delta rule (Simpson 1988; Rumelhart, Hinton, and Williams 1986). Basically, a BPN is a multi-layer (with at least three layers) network whose nodes use a sigmoid output function. The BPN will map a set of input activities to another set of output activities, given training upon a set of example input/expected output vector pairs. The BPN will generally discriminate and adapt to non-linear relationships in the training data. For example, a BPN can learn the exclusive-or relation, which a single-layer perceptron cannot. Nodes in the BPN are often called "units." The basic premise of the back-propagation algorithm has been independently derived several times. Werbos (1974) is generally credited with the first publication of the learning rule, which he called dynamic feedback; Parker (1985) gave the rule the name, "learning logic;" and the popularity of the BPN architecture is primarily attributable to Rumelhart, Hinton, and Williams (1986), as related by Simpson (1988). All units in the BPN operate in basically the same manner. There are some slight differences dependent upon whether a unit resides in the input, hidden, or output layer. Generally, however, a unit generates an output signal as follows: oj = f(netj) (Eq. 3) where j is a unit in the BPN, net represents the input to a unit, f is a sigmoid function, and o is the output of a unit. netj = (Sum over i of (oiwij)) (Eq. 4) where i is a unit in a preceding layer of the BPN, and w is a connection weight linking two units. f(netj) = 1 / (1 + e(-(netj + thetaj))) (Eq. 5) where theta is the bias weight for the unit. An input unit has a net input which is simply the provided external input, as there are no preceding layers. Figure 3 demonstrates a three layer BP network. At the bottom are five input units, which receive their activation from an external source. In the middle are sixteen hidden units, which receive their input according to Equation 4. The hidden units' output is sent on over another set of weights to the single output node. The error between the external training signal and the output node's output activity provides the basis for correcting the behavior of the network. This raw measure of error is used to find the delta for the output node. deltak = (tk - ok) f'(netk) (Eq. 6) where k is an output unit, t is the expected output, f' is the derivative of the output function. Figure 3 Fortunately, the derivative of the sigmoid function is symbolically easy to specify: f' = f (1 - f) (Eq. 7) The network then must distribute this error measure backward through the network. For each of the hidden units, then, deltas are found as follows: deltaj = (Sum over k of (deltakwjk)) f'(netj) (Eq. 8) where k is a unit in the succeeding layer of the BPN. Each of the weights in the network is changed according to: (Change in wij) = L * oideltaj (Eq. 9) where L is a constant representing the learning rate for the BPN. Theta values for nodes are treated in the same manner, thus (Change in thetaj) = L * f(thetaj)deltaj (Eq. 10) So, for the example BPN in Figure 3, an input vector causes output of the input units to be distributed to the hidden units, modified by the intervening weights. Similarly, the hidden units send on their output through weights to the output unit, which provides the response of the network to the particular input vector. This is called "feed-forward" processing. Once the network output is known, it can be compared to the expected output, and the delta value for the output unit is determined. This begins the "back-propagation" process. The delta values for the hidden units can now be derived as in Equation 8. The amount of change for each of the weights between the hidden and output layers can now be found according to Equation 9. Weight changes between input and hidden layers proceed similarly. For each node in the hidden and output layers, the value of theta is also changed. This completes the normal back- propagation phase. In the example problem, we have made a slight change to the normal back-propagation process, and have allowed theta for the input units to also be adaptively changed. In practice, one would normally construct a network with the correct numbers of input and output units, make some guess as to the number of hidden units needed, and assign random values to the weights. The network would then be trained upon the available data vector pairs until the error becomes suitably low, or the implementor decides to make a change in network design. Possible applications for BPNs include encrypting, data compression, non-linear pattern matching and feature detection. Existing BP applications include translation of text inputs into phoneme outputs, acoustic signal classification, character recognition, speech analysis, motor learning, image processing, knowledge representation, combinatorial optimization, natural language, forecasting and prediction, and multi- target tracking. BP has been implemented or theorized in electronic, VLSI, and optical formats (Simpson 1988). Adaptive Resonance Theory 1 Adaptive Resonance Theory 1 (ART 1) is a model introduced by Gail Carpenter and Stephen Grossberg (Carpenter and Grossberg 1987a). There are two ART models of note, ART 1 and ART 2, and many modified architectures which are premised upon one or the other of the ART models. Basically, an ART architecture is a two-layer network which provides unsupervised learning of categories of inputs (Figure 4). The F1 layer is composed of "feature nodes," which accept external inputs. In ART 1, these inputs are binary patterns, while ART 2 incorporates preprocessing to accept analog inputs to the F1 layer. The F2 layer is composed of "category nodes," which compete to respond to valid F1 activations. There are control structures built into the architecture to prevent F2 activation without input being received at the F1 layer. There are other control structures to prevent "resonance" from occurring when the prototype pattern determined by the most active F2 node does not correspond to the pattern of F1 activation. Short term and long term memory are represented in ART architectures by node activations and inter-node weights, respectively. A "bottom-up activation" refers to the pattern of activation received by F2 nodes through the weighted links from F1 nodes. Similarly, a "top-down activation" is the pattern of activation received by F1 nodes through weighted links from F2 category nodes. Long term memory is changed only through resonance between the F1 nodes and a selected F2 node. A more Figure 4 detailed diagram of the ART 1 interconnections is shown in Figure 5. Resonance is a state in which long term memory traces between the F1 and F2 layers are modified to more closely represent the input activation in the category node's top down weights. An F2 node which wins the competition among F2 nodes has its top-down activation tested against the bottom-up activation of the F1 feature nodes. If these match within a level of tolerance, called the vigilance level, a resonant state is entered and long term memory is changed. If not, the current winning F2 category node is made ineligible for further consideration against the input, and F2 competition is restarted among remaining eligible F2 nodes. For an example pattern presented to the ART network given in Figure 4, the presence of input turns on the gain control nodes and activates the F1 layer. The activation of the F1 layer, or bottom-up activation, is fed across a set of long term memory weights to generate a new pattern of activity at the F2 layer. The F2 layer responds with a top down activation which is filtered by the top down weights linking each of the F2 nodes to the F1 layer. Competition among the F2 nodes results in a single winner for application to the current feature input. A comparison of the bottom up activation with the top down activation yields a set theoretic measure of the match between the presented pattern and the category represented by the F2 node. If I represents the input pattern, V(J) represents the F2 category node J's archetypal pattern, and p represents the vigilance parameter, then the cardinality of the set intersection between pattern I and the category pattern V(J) must be greater than or equal to p times the cardinality of I, or reset occurs. If reset occurs, the F2 node J becomes ineligible for matching to the current input pattern, and the process of bottom up activation, top down activation, competition, and match testing continues until some category node is found to match the input sufficiently well, or until all category nodes have been matched against the input and found to be too different. Figure 5 displays a more detailed ART 1 network, with the various weights, nodes, and connections visible. Obviously, it is possible that none of the eligible F2 nodes will match within the vigilance level's acceptable tolerance. When an ART network is first trained, there are no eligible F2 nodes, rather there are a number of uncommitted F2 nodes. An F2 node will be selected and enter a resonant state, providing the first category. If a subsequent input does not match this category within the vigilance level, the single F2 category is rendered ineligible and the F2 layer is reset. This brings the network to a state analogous to that of having no eligible category nodes available, and a second category node is selected and resonated with the F1 layer. At any point where no further eligible F2 nodes exist, but an uncommitted F2 node remains, then a new category node is formed from the formerly uncommitted node. If no uncommitted category nodes remain, then the input has been found not to match the available categories. Figure 5 \chapter{Integration In Neural Network Modelling} Integration in neural network modelling is taken here to mean the combination of different neural network architectures in a coordinated system. This differs from the casual usage sometimes found in the literature where the term has been applied either to systems composed of multiple units of the same base architecture, or else to trivial modifications of a known architecture. Integration applies properly to cases of multiple-architecture systems and there are some instances of systems for which the term genuinely applies but has not been used. As an example, Matsuoka, Hamada, and Nakatsu (1989) have proposed an architecture for phoneme recognition that subdivides the hidden and output layers of a back-propagation network in order to enhance the network's ability to recognize phonemes and also to substantially reduce the training time necessary for the network. However, Matsuoka terms this architecture the Integrated Neural Network (INN). While there is a substantial improvement in training time, there is no fundamental difference between an INN and a back-propagation network: they differ only in the connectivity of the weights between the hidden and output layers. The reduction in internal complexity of the INN can explain the decreased training time discovered. A slightly different form of integration is pursued by Foo and Szu (1989). A "divide and conquer" approach to problem solving employs the same architecture, a modified Hopfield-Tank network, to handle smaller subproblems and then brings together the resulting subproblem solutions into an overall solution. This requires some coordination to effect the overall solution, bringing into play elements of the broader integrative issues I have noted, but is not properly an integrated system as I have defined it. A better example of an integrated artificial neural network system appears in Cruz et al. (1989). Cruz uses a MADALINE architecture for image preprocessing and a back-propagation network for removing image distortion. The MADALINE architecture is a Multiple ADALINE system, developed by Widrow (Widrow, Pierce, and Angell 1961). The ADALINE, short for Adaptive Linear Neuron, is a neuronal model using the Least Mean Square learning rule developed by Widrow and Hoff. Widrow and Winter (1988) have updated the learning rule used for multiple layer, multiple ADALINE networks. The specific example given by Widrow and Winter presents a MADALINE network for invariant pattern recognition. The MADALINE architecture most closely resembles the Perceptron architecture given by Rosenblatt. In creating his integrated system, Cruz applied the "divide and conquer" concept not for the purpose of reducing simulation time, but in consideration of space requirements needed for a system which could handle the 256x256 pixel images used. Integration and Convergence A perennial problem for the artificial neural network modeller is the issue of convergence in finite time. It is nice to know that the architecture selected for a function will converge to a solution before the heat death of the universe. It is similarly a concern that a system composed of subsidiary architectures will converge. This can be problematic, since general convergence theorems have not been found for several of the most popular architectures (Widrow 1987). Back-propagation provides a particularly pointed example, since it is by far the most popularly used network, if number of papers concerning applications is taken as the criterion. The generalized delta learning rule of back-propagation has long been appreciated to be generally useful, yet significant progress in firmly establishing this usefulness in the form of theorems concerning convergence has been lacking. Sontag and Sussman (1989) provide a theorem demonstrating for back-propagation an analogous result to the perceptron learning rule: if a separation solution exists, the generalized delta rule's gradient descent will find the solution in finite time. While much has been made of a theorem by Kolmogorov (Farhat 1986), it must be conceded that Kolmogorov's theorem is an existence theorem: there is some network based upon the back- propagation architecture which will perform a mapping from n to m, but the theorem gives no clues as to what that specific network looks like. There is some promising news concerning convergence which is of interest for building integrative ANN systems. Hirsch (1989) gives several theorems which hold that if component subnets of a neural network converge, then the network will converge. While Hirsch's theorems assume that the convergence properties of the subnetworks can be described by Liapunov energy functions, he notes, "It is more difficult to obtain convergence for cascades of systems that are merely assumed to be convergent, but without benefit of Liapunov functions or global asymptotic stability. One way of doing this is to place strong restrictions on the rates of convergence." Hirsch defines a cascade as a layered network where the output of one layer serves as the input of the next. Many integrative designs can be cast into this framework. Incremental Synthesis The process of synthesis leading to integrative artificial neural network modelling is important to the development of insights into topics of critical application, such as sensor fusion. By confronting directly the need for coherent internal use of available resources and capabilities, we are more likely to generate an understanding of fusion principles. The synthetic approach to modelling provides a supportive environment for creating extensive systems. Just as the topic of artificial neural network modelling benefits from the interdisciplinary nature of its supporting sciences, so a synthetically derived artificial neural network system benefits from the range of problem solution approaches and features inherent in the underlying network architectures. By creating and maintaining a system of network architectures applicable to subfunctions in the problem solution, subfunction solution by a particular system component can benefit from the co-option of features normally found in different components of the artificial neural network system. Networks Under Consideration, System Properties Each of the three network architectures used in the example problem of melodic composition has its own set of features and drawbacks which play an important role in system design. Hopfield-Tank Networks As mentioned earlier, the Hopfield-Tank architecture is generally used in achieving a specific function. By this I mean that each Hopfield- Tank network is designed for a particular purpose, and can provide no functionality for other unrelated purposes. So the use of a Hopfield-Tank network should generally be reserved for functions which do not change over time. However, I will immediately cite a counter-example, due to its elegance of integrative design. An ingenious mechanism for extending the utility of the Hopfield-Tank architecture is pursued by Tsutsumi (1989), where one back-propagation net remaps Hopfield-Tank network inputs and another back-propagation network remaps Hopfield-Tank network outputs. The problem given is one of avoiding robotic arm deadlocking. The movement space is constrained, and therefore applicable to Hopfield-Tank network solution. However, the adaptation of the internal space representation to the real world arm movement must be adaptive. The Hopfield-Tank network does not provide learning rules, so the back-propagation networks provide the adaptation to real world feedback. In this manner, some significant benefits accrue to the use of the integrated system: since the Hopfield-Tank network is static with respect to the encoded weights, it provides a good repository for the robot's joint-arm space; since the back-propagation networks are adaptive, the system can configure itself to respond to a changing environment. Since the Hopfield-Tank network was conceived of in the context of implementation in silicon, the possibility for reducing a Hopfield-Tank network instance to a hardware component can be important for real-world applications. This step would bring the benefit often touted for Hopfield-Tank networks, speed. Speed is rarely noted in practice, since most practice involves simulation. The simplicity of the Hopfield-Tank architecture can be a strong point for system design and integration, however, even in simulated systems. A drawback which may eliminate the Hopfield-Tank network from consideration for a particular function is that the weights for the network must be derived from the constraints to be implemented and from any data functions necessary for solution but not available in the input to the network. This requires an understanding of the system to be solved, which may not be available. Some architecture with learning rule would then be more suitable for application. The output of a Hopfield-Tank network must be deciphered from the final pattern of activation of the net at equilibrium (cf. discussion in Chapter 1). In the case of the Travelling Salesman Problem, position of active nodes provides an encoding of placement of cities in the tour. This information might be rendered more compact in another format, depending upon the input type expected for further networks in the system. Back-propagation Network System Considerations The back-propagation architecture provides a mapping from the input vector to the output vector, and can be trained by example. The system properties of back-propagation networks include stability of learning given a fixed universe. By implication, the learning is not stable if some perturbation in the problem set changes the mapping function. The back-propagation network would then learn the new mapping over time, and the old mapping would be lost. Back-propagation networks have a moderately complex structure. The properties gained from this increase in complexity over the Hopfield-Tank architecture include the capacity for learning, the ability to extract features from input data and generate internal representations for those features, and the possibility of complex input to output transforms in accordance with learned associations. Back-propagation networks can accept binary or analog inputs, so the inputs can represent conditional probabilities as well as more strictly constrained values. The outputs are analog values which can be interpreted as binary through the use of thresholding functions. This allows a wide variety of input and output possibilities to achieve overall system function. However, the choice of representation of inputs can be critical for speedy and reliable training to occur. For example, use of analog values should be avoided when there exists a natural partition of the range of the input into distinct states. Our example of the melodic note generator will illustrate this concern. ART 1 System Properties Adaptive Resonance Theory 1 architectures provide unsupervised learning of "clusters" or classifications of input vectors. An internal representation of classification archetypes is generated. This architecture ensures that new inputs will tend not to perturb classifications of previous inputs. This compromise in the stability- plasticity tradeoff (Carpenter and Grossberg 1987a, b) can be modified for special purposes, as the melodic note generator program will demonstrate. The internal operation of the ART 1 network can provide certain features of especial interest to system design. Specifically, one by-product of the classification algorithm is the detection of novelty, which will be shown to have functional significance beyond that of the original design of the ART 1 network. Some properties of ART 1 require particular attention from the designer. For example, there is a matching parameter (vigilance) which controls how much deviation from a category prototype is acceptable. There are no guidelines for the selection of the vigilance parameter, and it is left to the designer to select and assign a "proper" value. Some guidelines do exist for certain of the other learning parameters in the ART 1 architecture, such as the learning coefficients for each of the top- down and bottom-up memory equations (Carpenter and Grossberg 1987a). The ART 1 architecture provides no standard output. The designer must access internal values of the ART 1 network to provide useful information to the remainder of the system which includes that network. ART 1 is a highly complex architecture with many parameters to be selected and set by the designer. Useful modifications to be made to the architecture would include creating adaptive functions to replace some of the static and arbitrary parameters of the network, such as the gain and reset parameters (cf. Figure 5). Table 1. System Properties Summary Hopfield-Tank network properties: - Data initialization process only place for changing adaptively (not "learning" at all) - Fast convergence (on system, not necessarily on simulation) - Inflexible structure (individual design necessary) - Simple structure Back-propagation Network Properties: - Stable learning given fixed universe - Change to design implies relearning necessary (costs for self-adaptation include forgetting what has been learned) - Adaptive weights changed as consequence of training (supervised learning) - Medium complexity of structure - Input to output transform can be computationally complex - Output nodes may deliver digital, bipolar, or analog values ART Network Properties: - Stable self-organizing learning - Change in design produces unknown effect on existing "knowledge" - Non-adaptive parameters for gain and reset possible drawbacks, should be replaced by adaptive functions System considerations Considerations of interfacing outputs of one model to inputs of another: Inputs: HTN : analog or discrete value, one per node BPN : analog or discrete value, one per input node ART 1: discrete value, one per input node ART 2: analog value, one per input node Outputs: HTN : discrete value, one per node BPN : analog or discrete value, one per output node ART 1 : none specified ART 2 : none specified \chapter{Example Problem} In developing a simulation to test out processes of integration, several factors have to be considered. The simulation should have enough scope to provide good subsidiary roles for the component functions, it should be small enough to be implementable in a reasonable period of time, and it should produce output of a form which is readily apprehensible and analyzable. The first criterion is easily fulfilled; the second and third are rather more difficult to fulfill. As a starting point for exploring the possibilities of an integrated approach to artificial neural network modelling, the problem of producing a melodic line in music composition was selected. The complexity of music composition in general provides ample considerations for the application of component networks. Unfortunately, the complexity of musical composition offers no hope for the relatively simple design and implementation of an artificial neural network system which addresses all the salient points. Therefore, simplifying assumptions are made to ease the requirements for the ANN system. The output, to be interpreted as a sequence of musical notes, can present some problems with evaluation, due to the qualitative context of musical evaluation in general. However, it is possible to treat the note sequence as a set of concatenated symbols, and apply some of the information theory concepts of Shannon (1948) to conduct an analysis. Simplifying Assumptions The great complexity of musical composition in general is constrained to yield a problem of suitable scope for the example integrated ANN system. A limited scale covering one octave is assumed, in the key of C. There are no accidentals, and there is no explicit timing of notes. A single voice is assumed, and there are no harmonics generated. A limited and fixed set of classical composition rules forms the basis for the constraint and comparison of the system. Problem Approach The use of several ANN architectures in creating an integrated system whose function fulfills the requirements of the musical composition problem is assumed. A preliminary set of hypotheses as to how a composer develops a line of melody was advanced for defining the subfunctions of the composition system, or note generator. As Teuvo Kohonen (1989) notes in discussing his own ANN system for musical composition, It is not possible to survey here the development of ideas in computer music. One of the traditional approaches, however, may be mentioned. It is based on Markov processes. Each note (pitch, duration) is thereby regarded as a stochastic state in a succession of states. The probability Pr = Pr(Si | Si-1, Si-2, % raw formatter directive: ...) for state Si, on the condition that the previous states are Si-1, Si-2, ..., is recorded from given examples. Usually three predecessor states are enough. New music is generated by starting with a key sequence to which, on the basis of Pr and, say, the three last notes, the most likely successor state is appended. The augmented sequence is used as a new key sequence, and so the process generates melodic successions ad infinitum. Auxiliary operations or rules are necessary to make typical forms (structures) of music out of pieces of melodic passages. Leaving the problem of constructing musical forms for possible later consideration, the approach given by Kohonen matches well the procedure of composition undertaken here. A candidate note and note sequence are proposed, a critique according to classical rules is made concerning the last note in the candidate sequence. The approval of the critic should mostly cause the acceptance of the candidate note, but the rules should be broken often enough that there is not an absolute conformity to the rules postulated. Now we must confront the problem of how to best combine models in a unified structure that accomplishes the example function of melodic composition. The solution involves identifying the strengths and weaknesses of the component models, and which functions each may accomplish. Then when the subproblem mapping has been accomplished, it must be determined how to integrate the subfunction module outputs to other modules to accomplish the overall task. Example Problem Network System The identifiable problem subfunctions are candidate note generation, sequence critique, and novelty detection. The candidate note generator subfunction should propose notes in general, but not complete, conformance to probabilities of the next note filling the requirement of being part of a classical sequence. The sequence critique subfunction should evaluate the proposed next note in strict conformance to the set of classical sequences provided. Since the evaluation given by the sequence critique subfunction is an incomplete criterion for composition given expectations of novelty, some means of detecting novel sequences must exist for the use of the coordinating system. The coordinating system then has the information necessary to "break the rules" when needed to avoid long, boring sequences of strictly mechanical melody. This can also be seen as a requirement for stable and continued operation, as there exist sequences which have no classical next note possibilities. Candidate Note Generation The candidate note generator should produce plausible next notes given a historical partial sequence. Since the rules for identification of plausible next notes are fixed and known, there is no need for learning in this stage of the network. The candidate note generator should also occasionally provide notes which do not necessarily conform to the expectation of a classical next note. The Hopfield-Tank network (HTN) was selected as the candidate note generator since it provides the above features. HTNs are noted for their utility in constraint satisfaction and other optimization problems, which fits the requirement that only a single next note is to be proposed at a given time and that it should basically follow the probabilities of a classical next note. A well known attribute of the HTN architecture is the inclusion of spurious local minima which do not represent "valid" solution states. By purposefully utilizing this feature, we can convert what normally constitutes a drawback into a asset for the subfunction. The spurious local minima will give the occasional proposal of non- classical next notes. The constant and known nature of expected sequences determines the formation of the HTN weights, in conjunction with the known constraints for proper HTN function. The use of "noisy" input values can produce the semi-random distribution of possible notes that is needed for variability. Figure 6 shows the modified HTN architecture, called Bach, used in our simulation. The rows represent note values and the columns represent sequence placement. The constraints imposed on this network include the need to present a single winning note in each place in the sequence pattern, and the need to prevent endless repetition of the same candidate note. By introducing relatively strong inhibitory links within rows and columns, we can satisfy the constraint requirements. We achieve preferential selection of classical next notes by reducing those inhibitory links somewhat for connections which follow classical sequence patterns. Sequence Critique The sequence critic should provide an evaluation of the conformance of the proposed next note to classical melodic sequence rules. If the sequence rules are assumed to be provided by example, then some learning function is required to allow the critic to become adept and reliable. The subfunction receives a note sequence as input, and produces an output which may be interpreted as a boolean statement of the classical nature of the candidate note. The back-propagation network (BPN) was selected as the sequence Figure 6 critic. The BPN, as discussed in Chapter 2, can learn any input/output transformation given to it (within some constraints upon the availability of sufficient hidden nodes to form a stable internal representation). The form of the BPN used, called Salieri, was in the end a network accepting a binary representation of the input sequence, using twenty hidden units for internal representation, and producing an output value which was interpreted as a yes-or-no output. So the net structure used forty input units, twenty hidden units, and one output unit. This is a medium sized BPN. Novelty Detection The novelty detection function must have components to enable the recollection of past sequences for discrimination of novel sequences. However, the space requirements should not be overwhelming. A classification system would provide good data compression while allowing the nearly complete context information needed for novelty detection. The ART 1 architecture was selected for the role of the novelty detector, for it provides novelty detection as part of a classification framework. The ART 1 architecture also has other features of interest to further research on integrated and extensive systems. The ART 1 network used here, called Beethoven, is modified from the Carpenter-Grossberg architecture. Some of the explicit features of the Carpenter-Grossberg architecture are handled as implicit assumptions in the procedural simulation. The "2/3rds Rule," for example, is not invoked, since the only time that Beethoven is active, the network meets the constraints of the rule. The separate rules for top-down and bottom- up weight modifications are replaced by a single rule for both, as is done in the ART 2 architecture (Carpenter and Grossberg 1987b). Coordination In any integrative system, some means of coordinating subfunctions becomes necessary. Some interpretation and processing of input or output terms may be accomplished by the coordinating system. The ultimate decision for whether or not a candidate note is accepted falls to the coordinating system. The coordinating system is called Lobes, since the features and activities of the frontal lobes (Levine 1986, Levine and Prueitt 1989) provided the inspiration for its operation. Lobes generates the context management and state dependent actions which drive the integrated system to completion of the intended function, melodic composition. Lobes also contains an internal boredom function, which tends to increase over time. There is a boredom threshold, which causes a change in behavior in Lobes if it is exceeded. Operation In operation, the integrated ANN note generator uses its components in a sequential manner. Lobes generates a call to Bach for a candidate note, which when returned is sent to Salieri for critique. As a mechanism for preventing wastage of Beethoven category nodes, the entire sequence which Bach settles upon is used for further processing by Salieri and Beethoven. Since at the beginning of composition there is no sequence history, Bach's inputs are determined randomly over the entire sequence. As notes are added to the sequence history, Bach inputs are determined with the addition of some noise, except for the candidate note column which receives only random noise as input. Salieri receives the Bach sequence values and makes a determination of classical conformance, which it passes back to Lobes. Lobes then sends on the entire context, sequence plus critique, to Beethoven. At first, Beethoven is virtually certain to encode new categories for each input it receives. As time goes by, the likelihood that Beethoven will encode a new category decreases, until all category nodes are utilized. So it is more likely that the first few notes generated will conform to classical sequence examples. Sometimes, however, a note which has no possible classical successors will be generated early in the simulation. In this case, it is likely that Beethoven's category encoding will proceed at a much faster than average pace. The indication of novelty generated by Beethoven is used by Lobes to modify the system response to the internal state, which is determined by Salieri's critique of the next note, Beethoven's detection of novelty, and the boredom threshold in Lobes. If Lobes is not bored and Salieri approves of the candidate note, the note is accepted. If, however, Lobes has reached its boredom threshold and Salieri approves of a note, it will reject the candidate note and request another one from Bach. Likewise, if Salieri disapproves of the note but Lobes is bored, Salieri may be overridden and Lobes may accept the note. Indications of novelty from Beethoven can satisfy Lobes' drive to be "excited," or not-bored. This will tend to make Lobes more conservatively classical as Beethoven continues to detect novelty. Figure 7 shows the operation of the coordinated system. Figure 7 \chapter{Results} Performance The integrated ANN note generator system produces 152 notes in about three hours when run on an 80386 PC compatible at 16MHz clock speed. It takes approximately fifteen hours to produce the same number of notes on an 8088 based machine with an 8087 numeric coprocessor. Example and Analysis of Output Appendix A contains sample output from a run of the note generator. With a problem such as musical composition, assigning an objective measure denoting the "worth" of the output is not possible. However, it is possible to compare the output of our note generator system with random sequences of note values. By use of a binomial performance measure, it is possible to define how much the sample output differs from a random sequence. Random sequences have their own mystique and interest, but subjective evaluations of random melodic forms by the untrained ear tend toward the negative. The output of our note generator network was intended to basically follow the guidelines of an example set of classical sequences. The system included mechanisms for breaking out of a strict adherence to the guide set of sequences. Thus, it would be expected that the output have somewhat more resemblance to random sequences than the "rules" would state. Since the classical rules of composition, while not our sole criteria of fitness, are the only quantifiable part of our criteria, we compare our output with random note sequences on the basis of these rules. Table 2 summarizes the characteristics of output sequences generated under several different conditions. A random note generator, a mostly classical note generator, and our integrated ANN note generator each produced outputs and were evaluated using the critic developed for training Salieri. The "Successes" column indicates the number of times the next note in the sequence could be considered to be classical. The random note generator simply output a sequence of random numbers for a set sequence length. The Classical Instructor sequence generator operated by determining the available pool of possible classical next notes, then randomly selecting one of those notes. In some cases, no next note fit the criteria of being "classical," and a random note was generated. The rest of the sequence generators were variations upon our ANN note generator. The use of different back-propagation nets in the critic role gave different results in the output. Trained and untrained back-propagation nets with analog inputs were used. No significant difference in output could be distinguished between the trained and untrained versions, but the result was far closer to random performance than classical. The inability of the Salieri net with analog inputs to converge to a reliable and accurate performance measure explains the similarity of result with the untrained version of the same type. On the other hand, a Salieri composed of a trained back-propagation network using a binary representation for inputs was able to converge to a stable and fairly accurate state. Hence the performance of the trained Salieri with binary inputs was considerably closer to classical than was the performance of the random sequence generator. In two cases, a rule-based critique system was substituted for the back-propagation network. Table 2. Classical components of output sequences. Sequence generator Sequence Successes \% Z versus Z versus Length random classical Random 10,000 890 0.089 0.00 -102.54 System w/ untrained Salieri (analog inputs) 152 18 0.118 1.26 -23.84 System w/ trained Salieri (analog inputs) 152 19 0.125 1.54 -23.64 System w/ trained Salieri (binary inputs) 152 36 0.237 6.28 -20.07 System w/ rule-based critique (Salieri supervisor) Run 1 150 35 0.233 6.10 -20.06 Run 2 150 47 0.313 9.42 -17.50 Classical Instructor 8,150 6898 0.846 102.54 0.00 (rule-based critique w/out ANN system) \chapter{Discussion} The integrated note generation system's performance suggests that it met operational expectations. It produced notes according to a mixture of somewhat conflicting criteria. This kind of operation in the midst of uncertainty characterizes many human decision-making processes, and may be assumed to play a role in human music composition as well. Our framework allows for further experimentation with hypotheses concerning the fundamental processes involved in higher-order constraint satisfaction systems in an extensive environment (cf. Pao 1989). The integrated approach has demonstrated several advantages and disadvantages in development and operation. The disadvantages include the complexity of handling several different network architectures at once, which can contribute to programmer confusion (the downfall of programmer omniscience). The necessity for dealing directly with "model mismatches," where one subnetwork may produce a different representation as output than the next subnetwork requires as input, can cause system design time to be protracted. Failure to recognize that a problem exists in articulation of networks can result in behavior that diverges wildly from expected norms. On the other hand, any complex problem may present similar difficulties regardless of the choice of solution approach. By using subnetworks of known characteristics, one may be able to achieve a solution with fewer uncertainties than a totally top-down approach would yield. With a range of different capabilities available from subsidiary networks, the likelihood of encountering an insoluble subproblem is reduced. Synthetic integrated systems also lead to a combinatorial explosion in the richness of possible system behaviors, which again is reminiscent of the increasingly interesting behaviors noted as more complex biological organisms are considered. Simulation Concerns The integrated ANN system from the example problem relied on several procedural programming shortcuts. For example, the implementation of "boredom" in Lobes was simply a counter which would be compared with a threshold value. This does not have any basis in biological neural systems, yet rather neatly simulates the behavior of a simple neuronal model for the same task. As another example, the decision in the ART 1 network as to which category an input belongs to is currently made on the basis of an arbitrary, winner-take-all rule. The same effect could probably be achieved by means of on-center-off-surround neural interactions that are more biologically realistic. Casting the system functions into an entirely biological framework would yield a better, more capable system for future work. However, the system stands as a first effort toward this goal. Integration and Artificial Intelligence There have been several efforts toward integrative system design in the top-down modelling school. In the HEARSAY system (Reddy et. al. 1973), the blackboard model was developed. The blackboard model presupposes the combination of a set of possible problem-solving systems which have available the current system state. The system state is said to reside upon "the blackboard" as a visualization of the process of problem-solving. Each of the several applicable subsidiary problem- solving systems attempts to derive an incremental step toward the global solution, and competes with other such problem-solvers to control the blackboard, and thus be able to change the system state. This is a significant development, and one which is paralleled by concepts in several artificial neural network models. In ART models, for example, the concept of competition among various classification prototypes bears a strong resemblance to the blackboard model. If the F1 layer's activation is considered analogous to the system state, the F2 nodes each are analogous to problem-solvers in the blackboard model. There have been some efforts toward explicitly bringing together top- down and bottom-up models in hybrid systems. Amano et al. (1989) present an example of a phoneme recognition system combining an expert system with a perceptron network. The expert system provides feature extraction from speech data, which is input to the perceptron. The perceptron allows decision-making under uncertainty, whose output is interpreted using fuzzy logic rules. This avoids drawbacks associated with "template matching" phoneme recognition schemes. Rabelo and Alptekin (1989) have integrated a neural network with an expert system into an intelligent scheduling system for manufacturing applications. Their system has the ability to learn from experience and generate schedules within real-time constraints. Neurobiological Evidence of Multifunctionality Multi-state functionality of memory is supported by work of Nottebohm (1989). Nottebohm's work involves the development and remembrance of complex songs in songbirds. Through a series of studies, Nottebohm demonstrates that hormonal changes can cause the forgetting of songs in a bird's repertoire, or allow the formation and remembrance of new songs. The hormonal changes in question center around testosterone, and typically the mating season is the time when levels of testosterone allow the formation of songs, presumably conferring a reproductive advantage for male songbirds. Nottebohm demonstrates that the changes are only hormonally dependent by the simple strategy of artificially inducing song creation by the application of testosterone to songbirds of both sexes at various times of the year. The withholding of regular doses of testosterone was also shown to cause the forgetting of known songs in the same birds. The implications of Nottebohm's work include support for state- dependent memory. Since a specific memory function can be modulated by a specific hormone, this implies that other memory systems may also have recall dependent upon some hormone or other chemically mediated process. Given that recall and learning can be so modulated, the necessity for taking state dependencies (levels of hormones and other neuroactive substances) into account for system function can be appreciated. State-dependent memory is also supported by the work of many other investigators, such as Bower (1981). In Bower's experiments, happy or sad moods were induced in subjects by hypnotic suggestion, in order to investigate the influence of emotions on memory and thinking. This influence was profound; for example, people recalled a greater percentage of those experiences which had taken place when they were in the same mood as they were in during recall. Also, when the feeling tone of a story agreed with a reader's hypnotically generated emotion, the reader found the events and characters in the story more memorable and easier to identify with. State dependence of memory or other neural function can give rise to quite useful modelling constructs. The ability to recast problem solutions given functional states becomes biologically justifiable. The logical power of conditional activation of entire subnetworks becomes available through the modelling, however coarse, of these state dependencies. The almost direct implementation of expert system analogues which can be analyzed in a completely biological and ANN context is made possible. Implications include the higher order integration of functionally changed sub-units over time. In humans, well known state dependencies include the fight or flight response to norepinephrine production and the diving response typical of mammals entering cold water. Without the appropriate integrative control, neither fight or flight nor the diving response would produce the desired, or selected, effect. The coordination of separate functional neural "circuits" is clearly present; the exact mechanism remains to be elucidated but there have been some promising beginnings in neural network models. For example, in the neural model of attention described by Grossberg (1975), there is competition between nodes representing activations of different drives (hunger, thirst, sex, etc.). The winner of this competition is not determined solely by which drive is highest, but also by the availability of compatible cues in the environment. State-dependent memory implies the existence of functional changes over time in cortical structures. Since we now have evidence of multi- modal neural circuitry, at least some consideration should be given to the implied necessary integration. The problem of understanding a system which is dynamic not only in processing of input but also in functional neural subsystems is both daunting and exhilarating. It is daunting, because the complexities of modelling such systems exceeds our current capacity for ready assimilation and understanding of the underlying concepts and mechanisms (which have not yet been elucidated), and exhilarating because there appears to be no end to the variety of expression of these systems in the natural world, and thus no apparent end to the problem-solving challenge awaiting the researcher. The function of speech processing in humans, for example, requires the acquisition of external signals, the separation of those signals into semantic and affective content, the recognition of mode in affective content, the parsing of semantic content, and the integration of semantic and affective content to determine meaning. This list of subfunctions is not complete, which gives an indication of the extent to which integration remains a regular and important activity in biological systems. The Triune Brain Theory Integrative theories of neural/cognitive function have a long history. One of the best known is the triune brain theory of Paul MacLean (1970). MacLean's research into behavioral studies of different brain areas led him to propose that the human brain is divided into three developmentally derived regions of separate function (see Levine 1990 for discussion). The earliest, and presumably the most primitive region, is termed the reptilian brain, and is composed of the the brainstem and basal ganglia. The reptilian brain is responsible, in this theory, for the preprogrammed, innate behaviors. The paleomammalian brain, composed of the limbic system, modifies the expressed pattern of reptilian brain responses and is the source, in this theory, of the basic emotions (love, hate, fear, arousal, etc.). The neomammalian brain, composed of the newer parts of the cerebral cortex, provides further modifications of the expression of the two older brain areas, and gives us our rational capacity, seen in the ability for planning and verbal expression. While MacLean's theory is oversimplified, it does provide a useful set of distinctions between various cognitive subfunctions, all of which are involved in complex behaviors. In fact, if one stretches the imagination, one can draw analogies between the reptilian brain and our Hopfield-Tank network; the paleomammalian brain and our ART 1 network, and the neomammalian brain and our back-propagation network. The integrated ANN note generator had its origins in a collaborative effort to develop an extensive ANN system suitable for exploring multi- modal cognitive hypotheses (Blackwood, Elsberry, and Leven 1988). That project, in turn, was derived from insights provided by Leven (1987b). Leven's SAM model was depicted in a manner which led to a discussion of the possibility of replacing the components of MacLean's triune brain model with current ANN architectures. The difficulty of describing a suitably restricted problem for adequate application of the limited current architectures was resolved with the simple melodic composition problem outlined previously. Points of difference from the original MacLean theory can be attributed to certain changes in model context (as modified by Leven's separation of memory into three components: motoric/instinctive [reptilian], sensory/affective [paleomammalian], and associative/semantic [neomammalian] (Leven 1987a)) and to the mismatch between architectures derived not for their similarity to these basic cognitive forms, but to satisfy more immediate criteria such as being implementable in current electronic devices. The desired system based on our loose analogy to MacLean's theory has been demonstrated to be operational and ready for incremental refinement. We hope that in future work such analogies can be made more precise. The development of both neural network theory and neuroscientific data should allow the critical research to continue into these theories of integrative cognitive function. \appendix \chapter{Sample Melody Output Of The Various Note Generator Programs} Integrated ANN Note Generator Sample Output b61T output, page 1 b61t output, page 2 b61t output, page 3 Random Note Generator Sample Output random output, page 1 random output, page 2 random output, page 3 Classical Note Generator Sample Output classical output, page 1 classical output, page 2 classical output, page 3 \chapter{Program Source Listing: Integrated Ann Note Generator} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Back-Propagation Unit} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: List Structures Unit} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Miscellaneous Procedures Unit} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Global Type And Variables Unit} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Classical Instructor Unit} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Ansi Screen Control Unit Unit Ansi\_Z;} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Musical Sequence Evaluator Program} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Random Note Generation Program} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Note Sequence Playing Program} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Rule-Based Note Sequence Generation Program} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Offline Back-Propagation Network Training Program} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Data File Listing: Hopfield-Tank Network Weight Data File} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Data File Listing: Classical Sequences Data File} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Data File Listing: Back-Propagation Network Data File} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \chapter{Program Source Listing: Translator From Program Note Files To Music Transcription System Song Format} This appendix is represented in the repository by the legacy source and data files in \texttt{THES/}. The automated thesis conversion suppresses the full listing here to keep the document manageable. \nocite{*} \bibliographystyle{plain} \bibliography{integration_and_hybridization_in_neural_network_modelling} \end{document}