Representational Trajectories in Connectionist Learning Andy Clark School of Cognitive and Computing Sciences, 海角社区 Brighton, BN1 9QH CSRP 292 Abstract The paper considers the problems involved in getting neural networks to learn about highly structured task domains. A central problem concerns the tendency of networks to learn only a set of shallow (non-generalizable) representations for the task i.e. to 'miss' the deep organizing features of the domain. Various solutions are examined including task specific network configuration and incremental learning. The latter strategy is the more attractive since it holds out the promise of a task-independent solution to the problem. Once we see exactly how the solution works, however, it becomes clear that it is limited to a special class of cases in which (l) statistically driven undersampling is (luckily) equivalent to task decomposition, and (2) the dangers of unlearning are somehow being minimized. The technique is suggestive nonetheless, for a variety of developmental factors may yield the functional equivalent of both statistical AND 'informed' undersampling in early learning. Key words: Connectionism, learning, development, recurrent networks, unlearning, catastrophic forgetting. 0. An Impossible Question. Some questions are just impossible. They are impossible because the question is asked in a kind of vacuum which leaves crucial parameters unspecified. The customer who goes into a shop and demands "How much?" without specifying the item concerned poses just such an impossible question. So too I shall argue, does the philosopher or cognitive scientist who asks, baldly, "What kind of problem solving comes naturally to a connectionist system?" Yet just such a question seems close to the surface of many recent debates. (For especially clear examples, see the opening paragraphs of most papers offering hybrid (Connectionist/classical) cognitive models; e.g. Cooper and Franks (1991), Finch and Chater (1991).) It is not my purpose, in what follows, to argue that there are no generic differences between connectionist and classical approaches, nor that such differences, carefully understood, might not indeed motivate a hybrid approach. The goal is rather to demonstrate, using concrete examples, the wide range of parameters which need to be taken into account before deciding, of some particular problem, whether it is or is not a suitable case for connectionist treatment. The main claim of the paper is that the question of what comes 'naturally' to connectionist systems cannot be resolved by reflection on the nature of connectionist architecture and processing alone. Instead, a variety of (at least superficially) disparate parameters all turn out to be fully interchangeable determinants of whether or not a network approach will succeed at a given task. Such parameters include (1) the overall configuration of the network (e.g. number of units, layers, modules etc.), (2) the nature of the input primitives (3) the temporal sequence of training and (4) the temporal development of the network configuration (if any). Some of these parameters are obvious enough, though insufficient attention is paid to them in actual debates (e.g. (1) and (2)). Others are less obvious but, as we shall see, no less important. The strategy of the paper is as follows. I begin (section 1) by detailing 3 cases in which networks fail to solve a target problem. In section 2 I rehearse a variety of different ways in which such shortcomings may be remedied. These remedies involve tweaking the various neglected parameters mentioned above. I go on (section 3) to show how these rather specific instances bear on a much bigger issue, viz. how cognitive development depends not just on the learning device and data, but on the further 'scaffolding' provided by course of training, selective attention, sensorimotor development, linguistic surround and other factors. 1. Net Failures. Sometimes the failure of a system is more instructive than (would have been) its success. A case in point is Norris' (1990) (1991) attempt to use a multi-layer, feedforward connectionist network to model date calculation as performed by 'idiot savants'. These are people who, despite a low general intelligence, are able to perform quite remarkable feats of specific problem solving. In the date calculation case, this involves telling you, for almost any date you care to name (e.g. November 20 in the year 123,000!), what day of the week (Monday, Tuesday etc.) it falls on. The best such date calculators can successfully perform this task for dates in years which I can hardly pronounce, the top limit being about the year 123470. Norris conjectured that since idiot savant date calculators are solving the problem despite low general intelligence, it might be that they are using just what Norris describes as a 'low level learning algorithm' like backpropagation of error in a connectionist net (see Norris (1991) p.294). The task, however, turned out to be surprisingly resistant to connectionist learning. Norris initially took a 3-layer network and trained it on 20% of all the day/date combinations in a fifty-year period. The result was uninspiring. The network learned the training cases by rote, but failed to generalise to any other dates. Perhaps the fault lay with some simple aspect of the configuration? Norris tried permutations of numbers of units, numbers of layers of hidden units etc. To no avail. Is the date calculation problem therefore one which connectionist models are not 'naturally equipped' to solve?. The issue is much more complex, as we shall see. Here is a second example. Elman (1991-a) describes his initial attempts to persuade a 3-layer network to learn about the grammatical structure of a simple artificial language. The language exhibited grammatical features including verb/subject number agreement, multiple clause embedding and long distance dependencies (e.g. of number agreement across embedded clauses). The network belonged to the class of so-called 'recurrent' networks and hence exploited an additional group of input units whose task is to feed the network a copy of the hidden unit activation pattern from the previous processing cycle alongside the new input. In effect, such nets have a functional analogue of short-term memory: they are reminded, when fed input 2, of the state evoked in them by input 1, and so on. The task of the net was to take as input a sentence fragment and to produce as output an acceptable successor item i.e. one which satisfies any grammatical constraints (e.g. of verb/number agreement) set up by the input. Alas, the Elman net, too, failed at its task. It failed completely to generalise to new cases (i.e. to deal with inputs not explicitly given during training) and got only a badly incomplete grip on the training cases themselves. The network had failed to learn to use its resources (of units and weights) to encode knowledge of the deep organising features of the domain-features such as whether or not an input was singular or plural. (Such a failure can be demonstrated by the use of techniques such as principal components analysis - see Gonzalez and Winch (1977), Elman (1991-b) The result was a network whose internal representations failed to fix on the grammatical features essential to solving the problem. Does this kind of grammar therefore represent another problem space which connectionist learning (even using the relatively advanced resources of a recurrent network) is fundamentally ill-suited to penetrate? Once again, the issue is considerably more complicated as we shall now see. 2. How to Learn the Right Thing. One way of solving a learning problem is, in effect, to give up on it. Thus it could be argued that certain features are simply unlearnable, by connectionist means, on the basis of certain bodies of training data, and hence that the 'answer' is either to give up on connectionist learning (for that task) or to build more of the target knowledge into the training data in net-accessible ways. Thornton (1991) suggests that networks can only learn features which are present in the first-order statistics of a training set. First-order statistics cover e.g. the number of occurrences of an item in a set, while second-order ones cover e.g. the number of items which have some first-order statistical frequency in the set i.e. they are statistics about statistics. The input description language, if this is right, must always be such that the target knowledge is statistically first order with respect to it. Hence very often the solution to a learning failure will be to alter the input description language. This is an interesting claim, and one pursued further in Clark (forthcoming). Nonetheless, the input description language, although it no doubt could be manipulated to solve many instances of network failure, need not always be tampered with. In the present section I examine a variety of ways of dealing with the failures described in section 1 which keep the training corpus (and hence the input description language) fixed, but instead manipulate one or other of a variety of different, often neglected, parameters. The goal will be to show first, that such corpus-preserving fixes are possible; second, that the various fixes are pretty much interchangeable (and hence, I believe, that the roles of architecture, input code and training sequence are not fundamentally distinct in connectionist learning); and third, that the problems (from section 1) and the fixes described have more in common than meets the eye. By the end of the section it should be increasingly clear just how complex the question of what comes naturally to a connectionist system really is. The first fix I want to consider is by far the most brutal. It involves pre- configuring the network in a highly problem specific way. Recall Norris' unsuccessful attempt to model date calculation. To generate a successful model, Norris reflected on the logical form of a particular date calculation algorithm. The algorithm involved three steps. First, day/date pairings are specified (by rote) for a base month (say, November 1957). Second, offsets are learnt to allow the generalisation of the base month knowledge to all the other months in that year (1957). Finally, offsets between years are learnt (i.e. a one- day offset between consecutive years, modulo leap years). With this algorithm in mind, Norris chose a global configuration comprising 3 distinct sub-nets, one of which would be trained on each distinct sub-task (i.e. base month modelling, base year transformations, and cross-year transformations). Each sub-net was trained to perform its own specific part of the task (in logical sequence) and learning in it was stopped before training the next sub- net. Thus learning was stopped in sub-net 1 before training sub-net 2 and so on. Sub-net 2 would take output from sub-net 1 and transform it as needed, and sub-net 3 would take output from 2 and do the same. The upshot of this pre-configuration and training management was, perhaps unsurprisingly, a system capable of solving the problem in a fully generalizable manner. The final system was about 90% accurate, failing mainly on leap-year cases of the same kind as cause problems for human date calculators (see Norris (1991) p.295). This result is, however, at best only mildly encouraging. It shows that the problem space can be negotiated by connectionist learning. And true, the solution does not require amending the input description language itself. But the solution depends on a task-specific configuration (and training regime) which is bought only by drastic human intervention. Such intervention (assuming the goal is to develop good psychological models of human problem solving) is, as far as I can see, legitimate only if either; (a) we can reasonably suppose that the long-term processes of biological evolution in the species have pre-configured our own neural resources in analogous ways, or (b) we are assuming that the configuring etc. can be automatically achieved, in individual cognitive development, by processes as yet unmodelled. In the case at hand, (a) seems somewhat implausible. The real question that faces us is thus whether there exist fixes which do not depend on unacceptable kinds of human intervention. Recent work by Jeffrey Elman suggests that the answer is a tentative 'yes' and that the key lies in (what I shall label) the scaffolding of a representational trajectory. Hence we move to our second fix viz. manipulating the training. Recall Elman's failed attempt to get a recurrent network to learn the key features of a simple grammar. One way of solving the problem is, it turns out, to divide the training corpus into graded batches and to train the network by exposure to a sequence of such batches beginning with one containing only the simplest sentence structures and culminating with one containing the most complex ones. Thus the net is first trained on 10,000 sentences exhibiting e.g. verb/subject number agreement but without any relative clauses, long distance embeddings etc., and then it is gradually introduced to more and more complex cases. The introduction of the progressively more complex cases was gradual in two senses. First, insofar as it was accomplished by grading the sentences into five levels of complexity and exposing the net to example batches at each level in turn. And second (this will be important later on) because the network was 'reminded', at each subsequent stage of training, of the kinds of sentence structure it had seen in the earlier stages. Thus, for example. stage 1 consisted of exposure to 10,000 very simple sentences, and stage 2 consisted of exposure to 2,500 sentences of a more complex kind plus 7,500 (new) very simple cases. This 'phased training' regime enables the network to solve the problem, i.e. to learn the key features of the artificial language. And it does so without amending the basic architecture of the system and without changing the content of the corpus or the form of the input code. What makes the difference, it seems, is solely the sequential order of the training cases. Why should this be so effective? The answer, according to Elman, is that phasing the training allows the network to spot, in the early stages, the most basic domain rules and features (e.g. the idea of singular and plural and of verb/subject number agreement). Knowing these basic rules and features, the net has a much smaller logical space to search when faced with the more complex cases. It is thus able to 'constrain the solution space to just that region which contains the true solution'(Elman (1991a) p.8. By contrast, the original net (which did not have the benefit of phased training) saw some very complex cases right at the start. These forced it to search wildly for solutions to problems which in fact depended on the solutions to other simpler problems. As a result it generated lots of 'ad hoc' small hypotheses which then (ironically) would serve to obscure even the grammatical structure of the simple cases. Such a net is, in effect, trying to run before it can walk, and with the usual disastrous consequences. At this point we begin to see a common thread uniting the grammar case and the date calculation case. Both domains require, in a very broad sense, hierarchical problem solving. In each case there is a problem domain which requires, for its successful negotiation, that a system de-compose the overall problem into an ordered series of sub- problems. In the grammar case, this involves first solving e.g. the verb/subject number agreement problem and only later attacking the problem of relative clauses. In the date calculation case it involves e.g. first solving the problem for the base year, and only later attacking the problem of other years. We can express the general moral, made explicit in Elman (1991-a), like this: Elman's Representational Trajectory Hypothesis There is a class of domains in which certain problem solutions act as the 'building blocks' for the solutions to other, more complex problems. In such domains connectionist learning is efficient only if the overall problem domain can somehow be de-composed and presented to the net in an ordered sequence. In the absence of such de-composition the basic regularities (the 'building blocks') are obscured by the net's wild attempts to solve the more complex problems, and the more complex problems are, practically speaking, insoluble. The solution, as we have seen, is to somehow sculpt the network's representational trajectory i.e. to force it to solve the 'building block' problems first. This can be achieved either by direct manipulation of the architecture and training (Norris) or by 'scaffolding' a network by the careful manipulation of the training date alone (Elman). The Norris solution, however, was seen to involve undesirable amounts of problem-specific human intervention. The phased training solution is a little better insofar as it does not require problem-specific pre-configuration of the architecture. The third and final fix I want to consider is one which involves neither phasing the training nor pre-configuring the architecture to suit the problem. It is what Elman calls 'phasing the memory' and it represents the closest approximation to the ideal of an automatic means of sculpting the representational trajectory of a network. Recall that short-term memory, in the Elman network, is given by a set of so- called context units whose task is to copy back, alongside the next input to the net, a replica of the hidden unit activation pattern from the previous cycle. The 'phased memory' fix involves beginning by depriving the network of much of this feedback, and then slowly (as training continues) allowing it more and more until finally the net has the full feedback resources of the original. The feedback deprivation worked by setting the context units to 0.5 (i.e. eliminating informative feedback) after a set number of words had been given as input. Once again, there were five phases involved. But this time the training data was not sorted into simple and complex batches. Instead, a fully mixed batch was presented every time. The phases were as follows. Phase 1: feedback eliminated after every 3rd/4th word (randomly). Phase 2: feedback eliminated after every 4th/5th word (randomly). Phase 3: feedback eliminated after every 5th/6th word (randomly). Phase 4: feedback eliminated after every 6th/7th word (randomly). Phase 5: full feedback allowed (i.e. the same as the original net used in the earlier studies.) In short, we have a net which, as Elman puts it, 'starts small' and develops, over time, until it reaches the effective configuration of the original recurrent net. This 'growing' network, although exposed to fully mixed sentence-types at all stages, is nonetheless able to learn the artificial grammar just as well as did the 'phased training' net. Why should this be so? The reason seems to be that the early memory limitations block the net's initial access to the full complexities of the input data hence it is not able to be tempted to thrash around seeking the principles which explain the complex sentences. Instead, the early learning can only target those sentences and sentence fragments whose grammatical structure is visible in a 4/5 word window. Unsurprisingly, these are mostly the simple sentences, i.e. ones which exhibit such properties as verb/subject number agreement but do not display e.g. long distance dependencies, embeddings, etc. The 'phased memory' solution thus has the same functional effect as the phased learning viz. it automatically decomposes the net's task into a well ordered series of sub-tasks (first agreement, then embeddings etc.). The key to success, we saw, is to somehow or other achieve task-decomposition. The great attraction of the 'phased memory' strategy is that the decomposition is automatic - it does not require task-specific human intervention (unlike e.g. the Norris solution or the phased training solution). It is always reassuring to learn that the use of limited resources (as in the net's early memory limitations) can bring positive benefits. In suggesting a precise way in which early cognitive limitations may play a crucial role in enabling a system to learn about certain kinds of domain, Elman's work locates itself alongside E.Newport's (1988)(1990) studies concerning the explanation of young children's powerful language acquisition skills. It is worth closing the present section by summarizing this work as it helps reveal important features of the general approach. Newport was concerned to develop an alternative explanation of young children's facility at language acquisition. Instead of positing an innate endowment which simply decays over time, Newport proposed what she labelled a "Less is More" hypothesis viz. that: The very limitations of the young child's information processing abilities provide the basis on which successful language acquisition occurs. Newport (1990)p.23. (Note: the "less is more" hypothesis remains neutral on the question of whether a significant innate endowment operates. What it insists on is that the age-related decline in language acquisition skills is not caused by the decay of such an endowment). The key intellectual limitation which (paradoxically) helps the young child learn is identified by Newport as the reduced ability to "accurately perceive and remember complex stimuli" (op. cit. p.24). Adults bring a larger short term memory to bear on the task, and this inhibits learning since it results in the storage of whole linguistic structures which then need to be analysed into significant components. By contrast, Newport suggests, the child's perceptual limitations automatically highlight the relevant components. The child's reduced perceptual window thus picks out precisely the "particular units which a less limited learner could only find by computational means" (op.cit. p.25). Some evidence for such a view comes from studies of the different error patterns exhibited by late and early learners. Late learners (generally second language learners) tend to produce inappropriate "frozen wholes". Early learners tend to produce parts of complex structures with whole components missing. The idea, then, is that children's perceptual limitations obviate the need for a computational step in which the components of a complex linguistic structure are isolated. As a result: Young children and adults exposed to similar linguistic environments may nevertheless have very different internal databases on which to perform a linguistic analysis. Newport (1990) p.26. The gross data to which children and adults are exposed is thus the same, but the effective data, for the child, is more closely tailored to the learning task (for more on this theme, see Clark (forthcoming, Ch.9)). Newport's conjectures and Elman's computational demonstration thus present an interesting new perspective on the explanation of successful learning. This perspective should be of special interest to developmental cognitive psychology. In the next section I try to expand and clarify the developmental dimensions while at the same time raising some problems concerning the generality of the "phased memory" and "less is more" solutions. 3. The Bigger Picture: Scaffolding and Development. Faced with a hierarchically structured problem domain connectionist networks have, we saw, a distressing tendency to get 'lost in space(s)'. They try to solve for all the observed regularities at once, and hence solve for none of them. The remedy is to sculpt the network's representational trajectory; to force it to focus on the 'building block' regularities first. The ways of achieving this are quite remarkably various, as demonstrated in section 2. It can be achieved by direct configuration of the architecture into task- specific sub-nets, or by re-designing the input code, or by fixing the training sequence, or by phasing the memory. In fact, the variety of parameters whose setting could make all the difference is, I believe, even larger than it already appears. To see this, notice first that the mechanism by which both the Elman solutions work is undersampling. The network begins by looking at only a subset of the training corpus. But actual physical growth (as in the incremental expansion of the memory) is not necessary in order to achieve such initial undersampling, even supposing that no interference with the training corpus (no sorting into batches) is allowed. The heart of the 'phased memory' solution is not physical growth so much as progressive resource allocation. And this could be achieved even in a system which had already developed its full, mature resources. All that is required is that, when the system first attends to the problem, it should not allocate all these resources to its solution. In the Elman net, the memory feedback was initially reduced by setting the context units to 0.5 after every 4/5 words. A similar effect would be obtained by adding noise after every 4/5 words. Even switching attention to a different problem would do this, since relative to the grammar problem, the new inputs (and hence the subsequent state of the context units) would be mere noise. Rapidly switching attention between two problems might then have the same beneficial effect as phasing the overall memory. Limited attention span in early infancy might thus be a positive factor in learning, as might the deliberate curtailing of early efforts at problem solution in adult cognition. In general, it seems possible that one functional role of salience and selective attention may be to provide precisely the kind of input filter on which the phased memory result rests. (In the case of learning a grammar it is worth wondering whether the fact that a young child cares most about the kinds of content in fact carried by the simple sentences may play just such a functional role i.e. the child's interests yield a selective filter which results in a beneficial undersampling of the data. There is a kind of 'virtuous circle' here, since what the child can care about will, to an extent, be determined by what she can already understand.) Less speculatively, Elman himself notes (personal communication) that there are mechanisms besides actual synaptic growth which might provide a physical basis for early undersampling e.g. delays in cortical myelinization resulting in high noise levels along the poorly myelinated pathways. A further developmental factor capable of yielding early undersampling is the gradual development of physical motor skills. This provides us with a staged series of experiences of manipulating our environment, with complex manipulations coming after simple ones. Once again, it may be that this automatic phasing of our learning is crucial to our eventual appreciation of the nature of the behaviour of objects (i.e. to the development of a 'naive physics' - see e.g. Hayes (1985)). Going deeper still, it is worth recalling the functional role of undersampling. The role is to enable the system to fix on an initial set of weights (i.e. some initial domain knowledge) which serve to constrain the search space explored later when it is faced with more complex regularities. As Elman puts it "the effect of early learning ... is to constrain the solution space to a much smaller region" (Elman (1991 -a) p.7) - i.e. to one containing fewer local minima. Given this role, however, we can see that at least three other factors could play the same role. They are (1) the presence in the initial system of any kinds of useful innate knowledge i.e. (in connectionist terms) any pre-setting of weights and/or pre- configuration of networks which partially solves the problems in a given domain. (Obviously this solution will be plausible only in central and basic problem domains e.g. vision and language acquisition); (2) the set of basic domain divisions already embodied in our public language. The idea here is that in seeking to solve a given problem we will often (always?) deploy knowledge of types and categories - knowledge which is in part determined by the types and categories dignified by labels in whatever public language we have learnt (Plunkett and Sinha's (1991) picture of language as a 'semantic scaffold' captures much of what I have in mind); and finally, (3), the whole thrust of culture and teaching which is, in a sense, to enable the communal sharing of 'building block' knowledge and hence to reduce the search space of the novice. The catalogue of speculations could be continued, but the effective moral is already clear. It is that attention to the basic mechanisms highlighted by the Elman experiments reveals a unifying thread for a superficially disparate bag of factors which have occupied cognitive developmental psychology since time immemorial (well, 1934 at least). What we need to understand, before we venture to pronounce on what connectionist networks will or won't be able to learn, is, it seems, nothing less than how cognitive development is 'scaffolded' by innate knowledge, culture and public language, and how broadly maturational processes and processes of individual learning (including selective attention) interrelate. Connectionism and developmental psychology are thus headed for a forced union to the benefit of both parties. That, however, is the good news. I want to close this section by looking at the downside and highlighting two limitations on Elman-style solutions. The first is a limitation which afflicts any 'phased memory' approach. It is that phasing the memory can only be effective in cases where as a matter of fact (i.e. 'as luck would have it'!) merely statistically-driven undersampling of a training corpus is equivalent to task de-composition. Thus it so happens, in the artificial grammar domain, that an initial 4/5 word window isolates the set of training data necessary to induce the basic 'building block' rules of the domain. But as the old refrain goes, it ain't necessarily so. There are many domains in which such unintelligent undersampling of the data would not yield a useful subset of the corpus. It is here that reflection on Newport's work can help clarify matters. For suppose we ask: Is the posited "fit" between the young child's perceptual window and the componential structure of natural language just a lucky coincidence? In the case of natural language, the answer is plausibly "no". For, as Newport (op. cit p.25) notes, it is tempting to suppose that the componential form of natural language has itself evolved so as to exploit the kind of perceptual window which the young child brings to bear on the learning task. Morphology may have evolved in the light of the young child's early perceptual limitations. Natural language may thus present a special kind of domain in which the learning problem has been posed precisely with an eye to the short-cut computational strategies made available by the early limitations. The "phased memory"/"less is more" style of solution is thus quite plausible for any cases in which the problem may itself have been selected with our early limitations "in view". But its extension beyond such domains is questionable, as it would require a lucky coincidence between the componential structure of a target domain and the automatic windowing provided by early perceptual limits. The second limitation is one which threatens to afflict the whole connectionist approach. It is the problem of unlearning or 'catastrophic forgetting' (French (1991)). Very briefly the problem is that the basic power of connectionist learning lies in its ability to buy generalisation by storing distributed representations of training instances superposit- ionally i.e. using overlapping resources of units and weights to store traces of semantically similar items. (For the full story, see Clark (1989)) The upshot of this is that it is always possible, when storing new knowledge, that the amended weights will in effect blank out the old knowledge. This will be a danger whenever the new item is semantically similar to an old one, as these are the cases where the net should store the new knowledge across many of the same weights and connections as the old. Vulnerability of old knowledge to new knowledge sets such networks up for a truly (Monty) Pythonesque fate, viz. exposure to one 'deadly' input could effectively wipe out all the knowledge stored in a careful and hard-won orchestration of weights. The phenomenon is akin to the idea of a deadly joke upon the hearing of which the human cognitive apparatus would be paralysed or destroyed. For a connectionist network, such a scenario is not altogether fanciful. As French comments: Even when a network is nowhere near its theoretical storage capacity, learning a single new input can completely disrupt all of the previously learned information. French (1991) p.4 The potential disruption is a direct result of the superpositional storage technique. It is thus of a piece with the capacities of 'free generalisation' which makes such nets attractive. (It is not caused by any saturation of the net's resources such that there is no room to store new knowledge without deleting the old.) Current networks are protected from the unlearning by a very artificial device, viz. the complete interweaving of the training set. The full set of training cases is cycled past the net again and again, so it is forced to find an orchestration of weights which can fit all the inputs. Thus in a corpus of three facts A, B and C, training will proceed by the successive repetition of the triple and not e.g. by training to success on A, then passing to B and finally to C. Yet this, on the face of it, (but see below) is exactly what the phased training /phased memory solutions involve! The spectre of unlearning was directly and artificially controlled in the Norris experiment by stoping all learning in a successful sub-net. As Norris commented: When subsequent stages start to learn they naturally begin by performing very badly. The learning algorithm responds by adjusting the weights in the network. ... If learning had been left enabled in the early nets then their weights would also have been changed and they would have unlearned their part of the task before the final stage had learned its part. Norris (1991) p.295. What protects the Elman nets from this dire effect? The answer, in the case of phased training, seems to be that Elman protects the initial 'building block' knowledge by only allowing the complex cases in gradually, alongside some quite heavy duty reminders of the basics. Thus at phase 2 the net sees a corpus comprising 25% of complex sentences alongside 75% of new simple sentences. Similarly, in the phased memory case, the net at phase two sees a random mix of 4 and 5 word fragments thus gradually allowing in a few more complex cases alongside reminders of the basics. In short, the current vulnerability of nets to unlearning requires us to somehow insulate the vital representational products of early learning from de-stabilization by the net's own first attempts to deal with more complex cases. Such insulation does not seem altogether psychologically realistic, and marks at least one respect in which such networks may be even more sensitive to representational trajectory than their human counterparts. The bigger picture, then, is a mixed bag of pros and cons. On the plus side , we have seen how the broad picture of the vital role of representational trajectories in connectionist learning makes unified sense of a superficially disparate set of developmental factors. On the minus side, we have seen that the phased memory solution is limited to domains in which merely statistically-driven undersampling is luckily equivalent to task decomposition, and that the endemic vulnerability of networks to unlearning makes the step-wise acquisition of domain knowledge an especially (and perhaps psychologically unrealistically) delicate operation. 4. Conclusions, and an Aside We began with an impossible question - "What comes naturally to a connectionist system?". Understood as a question about what kinds of theoretical space may be amenable to the connectionist treatment, this question was seen to be dangerously underspecified. The question becomes tractable only once a variety of parameters are fixed. These were seen to include obvious items such as the large scale configuration of the system (into sub- nets etc.), and also less obvious ones such as whether training is phased and whether the mature state is reached by a process of incremental 'growth'. The effects of these less obvious and superficially more peripheral factors were seen to be functionally equivalent to those involving the large scale configuration. The key to success, in all cases, was to somehow help the network to decompose a task into an ordered series of sub-tasks. In the absence of such decomposition, we saw that networks have a tendency to get 'lost in space(s)'. They try to account for all the regularities in the data at once; but some of the regularities involve others as 'building blocks'. The result is a kind of snow blindness in which the net cannot see the higher order regularities because it lacks the building blocks, but nor can it isolate these as it is constantly led off track by its doomed efforts to capture the higher-order regularities. The negotiation of complex theoretical spaces, then, is a delicate matter. Connectionist learning needs, in such cases, to be scaffolded. We saw (section 3) that the functional role of the kinds of scaffolding investigated by Elman and Newport (section 2) could be mimicked by a wide variety of superficially distinct developmental factors. I note as a final aside that this intimacy between connectionist learning and several much wider developmental factors begins to suggest a possible flaw in the so-called 'systematicity argument' presented in Fodor and Pylyshyn (1988). That argument, recall, begins by defining a notion of systematic cognition such that a thinker counts as a systematic cognizer just in case her potential thoughts form a fully interanimated web. More precisely, a thinker is systematic if her potential thoughts form a kind of closed set i.e. if, being capable of, say, the thoughts "A has property F" and "B has property G" she is also capable of having the thoughts "A has property G" and "B has property F". A similar closure of relational thoughts is required, as illustrated by the overused pair "John loves Mary" and "Mary loves John". The notion of systematicity, then, is really a notion of closure of a set of potential thoughts under processes of logical combination and re-combination of their component parts. There are many pressing issues here - not least the extent to which daily concepts and ideas such as "loves" and "John" can be properly supposed to isolate component parts of thoughts (see Clark (forthcoming) for a full discussion). For our purposes, however, it will be sufficient to highlight a much more basic defect. To do so, we must look at the argument in which the notion of systematicity operates. It goes like this: 1. Human thought is systematic 2. Such systematicity comes naturally so systems using classical structured representations and logical processes of symbol manipulation. 3. It does not come naturally to systems using connectionist representations and vector to vector transformations. 4. Hence classicism offers a better model (at the cognitive psychological level) than connectionism. The argument is, of course, only put forward as an inference to the best explanation, hence it would be unfair to demand that it be logically valid. But even as inference to the best explanation it is surely very shaky. For as we have seen it is a mistake to suppose that the question of what kind of thing a connectionist network will learn is to be settled by reference to the generic form of the architecture and/or learning rules. Many other parameters (such as the system's development over time) may be equal determinants of the kind of knowledge it acquires. Even the observation of a pervasive feature of mature human cognition (e.g. systematicity) need not demand explanation in terms of the basic cognitive architecture. It could instead be a reliable effect of the regular combination of e.g. a basic connectionist architecture with one or more of a variety of specific developmental factors, (including, surely, the effects of learning a systematic public language (see Dennett (1991) for an argument that language learning is the root of such systematicity as human thought actually displays)). The contrast that I want to highlight is thus the contrast between systematicity as something forced onto a creature by the basic form of its cognitive architecture versus systematicity as a feature of a domain or domains i.e. as something to be learnt about by the creature as it tries to make sense of a body of training data. The suggestion (and it is no more than that) is thus that we might profitably view systematicity as a knowledge-driven achievement and hence as one in principle highly sensitive to the setting of all the various parameters to which connectionist learning has been shown (sections 1 - 3 above) to be sensitive. What I am considering is thus a kind of 'gestalt flip in our thinking about systematicity. Instead of viewing it as a property to be induced directly and inescapably by our choice of basic underlying architecture, why not try thinking of it as an aspect of the knowledge we (sometimes) want a network to acquire? In so doing we would be treating the space of systematically interanimated concepts as just another theoretical space, a space which may one day be learnt about by a (no doubt highly scaffolded) connectionist learning device. The mature knowledge of such a system will be expressible in terms of a (largely) systematically interwoven set of concepts. But the systematicity will be learnt as a feature of the meanings of the concepts involved. It will flow not from the shallow closure of a logical system under recombinative rules, but from hard-won knowledge of the nature of the domain. A "might" is alas well short of a proof. To make the case stick we would need to show in detail exactly how systematicity could arise as a function of a well-scaffolded trajectory of connectionist learning. What we can already say is just this: that given the demonstrable sensitivity of connectionist learning to the settings of an unexpectedly wide variety of developmental parameters, we should beware of arguments which, like Fodor and Pylyshyn's, marginalize the role of such factors and focus attention only on the combination of basic architecture and gross training inputs. To sum up, I have tried to show (a) that connectionist learning is highly sensitive to differences in 'representational trajectory' - i. e. to the temporal sequence of problem solutions, (b) that a surprisingly wide variety of factors (network growth, motor development, salience and selective attention etc. etc.) may be understood as means of sculpting such trajectories and finally (c) that the debate about what kinds of problem domain are amenable to a connectionist treatment is unreliable if pursued in a developmental vacuum i.e. without reference to whatever mechanisms of sculpting nature or culture may provide. So much for the high ground. Failing that, and with (I'm told) a minimum of two new journals appearing every month, it should be a relief to learn that sometimes, at least, undersampling the corpus is the key to cognitive success. Notes 1. Versions of this paper were presented to the 1991 ANNUAL CONFERENCE OF THE BRITISH PSYCHOLOGICAL SOCIETY (Developmental Section) and the PERSPECTIVES ON MIND CONFERENCE. (Washington University in St. Louis, St. Louis, Miss. 1991). Thanks to the participants of those events for many useful and provocative comments and discussions. Thanks also to Margaret Boden, Christopher Thornton and all the members of the 海角社区 Cognitive Science Seminar, and to two anonymous referees whose comments and criticism proved invaluable in revising the original manuscript. Some of the material presented includes and expands upon Clark (forthcoming, ch.7) Thanks to MIT Press for permission to use it here. B I B L I O G R A P H Y Bechtel, W. and Abrahamsen, A. (1991) Connectionism and the Mind, Oxford: Basil Blackwell. Clark, A.(1989-a) Microcognition: Philosophy, Cognitive Science and Parallel Distributed Processing, Cambridge, MA: MIT Press/ Bradford Books. Clark, A.(1991) "In defence of explicit rules," in Philosophy and Connectionist Theory, ed. W. Ramsey, S.Stich & D.Rumelhart, New Jersey: Erlbaum. Clark, A. and Karmiloff-Smith, A. (in press) The Cognizer's Innards: A Psychological and Philosophical Perspective on the Development of Thought. Mind and Language. Clark, A. (forthcoming) Associative Engines: Connectionism, Concepts and Representational Change. Cambridge, MA: MIT Press/Bradford Books. Cooper, R. and Franks, B. (1991). Interruptability: a new constraint on Hybrid Systems. in AISB Quarterly (Newsletter of the Society for the study of Artificial Intelligence and Simulation of Behaviour) Autumn/Winter 1991, no.78, pp.25-30. Dennett, D.(1991) "Mother Nature versus the Walking Encyclopedia," in Philosophy and Connectionist Theory, ed. W.Ramsey, S.Stich and D. Rumelhart.pp. 21-30. Hillsdale, NJ.: Erlbaum. Elman, J.(1991-a) "Incremental learning or the importance of starting small," Technical Report 9101, Center for Research in Language, University of California, San Diego. Elman, J.(1991-b) "Distributed representations, simple recurrent networks and grammatical structure," Machine Learning.7, pp.195-225 Finch, S. and Chater, N. (1991) "A hybrid approach to the automatic learning of linguistic categories. AISB Quarterly, no.78, pp.16-24. Fodor, J. and Pylyshyn, Z. (1988)" Connectionism and cognitive architecture. A critical analysis.," Cognition, no. 28, pp. 3-71. French, R. M. (1991) "Using semi-distributed representations to overcome catastrophic forgetting in connectionist networks," CRCC Technical Report 51, University of Indiana, Bloomington, Indiana 47408. French, R.M. (1992) Semi-distributed representations and catastrophic forgetting in connectionist networks. Connection Science vol.4, nos. 3 and 4, pp.365-377. Gonzalez, R. and Wintz, P. (1977) Digital Image Processing, Reading, MA: Addison-Wesley Hayes, P.(1985) "The second naive physics manifesto," in J. Jobbs and R. Moore, eds. Formal Theories of the Commonsense World, ed. Ablex, New Jersey. Newport, E. (1988) "Constraints on learning and their role in language acquisition: studies of the acquisition of American sign language." Language Sciences , 10, pp.147-172. Newport, E. (1990) "Maturational Constraints on language learning", Cognitive Science,14, pp.11-28. Norris, D. (1990) "How to build a connectionist idiot (savant)", Cognition, 35, pp.277-291. Norris, D. (1991) "The constraints on connectionism," The Psychologist, vol. 4, no.7, pp.293-296. Plunkett, K. and C. Sinha (1991) "Connectionism and developmental theory," Psykologisk Skriftserie Aarhus, vol. 16, no. 1, pp.1-77. Thornton, C. (1991) "Why connectionist learning algorithms need to be more creative," Conference Preprints for the First Symposium on Artificial Intelligence, Reasoning and Paper presented to and Creativity (University of Queensland, August 1991).(Also CSRP 218, Cognitive Science Research Paper, 海角社区).