# Lec-22 Heuristics For Back-Propagation

The title of today’s talk is Heuristics for Back Propagation We have gone through the back propagation algorithm with sufficient details We now know that, how to adjust the weights of the outer as well as the hidden layers dependent upon the local induced field and the local gradients And then we also saw, some of the other aspects like the stopping criteria, then whether we prefer the batch mode or the sequential mode Now, having done all these things, it is required that we know about some of the heuristics that one can apply to the back propagation in algorithm In order to ensure that, the back propagation algorithm functions more efficiently So, in order to ensure it is more efficient functioning, there are some heuristics, which are commonly used and these are the things that we are going to discuss in today’s class Now, some of the heuristics that we use for the back propagation is… Firstly, that we had initiated already the debate related to the sequential verses batch mode of processing So, sequential verses batch mode update and there one can note that the sequential mode is computationally faster, as compare to the batch mode This is the very interesting observation, in fact sequential mode as you know sequential mode means, that it is a pattern by pattern training that we do Whereas, batch mode means that where the weights are up dated only at the end of one epoch Only at the end of every epoch the weights are adjusted Now, sequential mode is computationally often faster than the batch mode, because if we go into the equations of the batch mode update which we did not We had consider the equations, for the sequential mode of update only, but if we see the batch mode update, we will see that that involves the computation of a Hessian matrix, which is highly done time consuming Whereas, sequential in pattern by pattern that means to say we do not require that a competition analyses the competitions are quite simple And in fact, the fact that the sequential mode is computationally faster than batch mode is true, especially when the training data set is large and highly co-related, so we can say that this is true especially for large and highly correlated data set Now, thus second heuristics that one has to consider is regarding the maximization of

the information content, which means to say that, if the training patterns contain good amount of information in it then the learning should be better, the learning should be faster So, that means to say that we have to go in for some kind of a diversity in the training set If the training set consists of a diverse set of patterns, that would really lead to more information content and a better learning and there are two ways, where by this can be achieved So, the two ways where by the maximization of information content can be achieved is first is the use of examples that result in large training error, this may appear to be a puzzling Because why are we really preferring those examples which are having large training error, but definitely what happens, is that when there is a large training error The large training error definitely leads to a larger value of the local gradient then the large value of local gradient makes the learning process faster So, one of the ways is to have to pick up examples having large training error and the other is to use examples, which are radically different from those that is previously used In fact, in the pattern classification problem, we often achieve this by randomizing the presentation of patterns that means to say that whenever, we are going in for a sequential mode of leaning then what happens is that we complete one epoch Let us say that there is some order in which, the patterns are presented and then when a next epoch is carried out that means to say, when the training is done for the next epoch that time you just randomize the order in which the patterns are presented Now, this could appear to be somewhat unnecessary by some people because after all, your job is to present the set of patterns, which constitute the training set does not matter Whether they are presented in one particular order or without they are presented in a different order, but one thing is very important that if you keep the same order Let us say, that from pattern to pattern you are keeping the same ordering from epoch to epoch you keep the same ordering in that case, there is a risk that the network will learn the ordering also Learning the ordering means that as in to say that, when you are feeling the second pattern it expects that pattern to be classified into some class, when you present the third it expects that to be classified into the other categories So, in order to avoid that and to have a better learning capability, one very often randomizes the order in which the patterns are presented So, we can say that the pattern presentations pattern presentations are randomized, in fact this has got more meaning only if you are having sequential mode Because in batch mode, it really does not matter that, if you are altering the order of pattern presentations on or not because in the batch

mode, any way, the weight adjustments are done at the end of the epoch So, this is the second heuristics and the third important heuristics that one considers is the activation function Now, we have so for explode different type of activation functions and one of the characteristics or one of the requirements that we have already specified, for the back propagation networks is that the activation functions are must necessarily be differentiable And more than that, we did not specify any other limitation for the activation functions, but it is seen that one of the some typical activation functions, which people use are the logistic functions, which involves the sigmoids which are in the range of 0 to 1 That is of the form of 1 by 1 plus exponential to the power minus av functions of that form or the other possibility is to use tan hyperbolic functions Now, if you plot the logistic function you will see that, you will get a characteristic like this, if this is the axis in which v is plotted and if this is the axis in which the phi v is plotted, then 1 will be the final value this is 0 So, you will get a function of this nature and in the case of tan hyperbolic function you will be getting a characteristic of this nature, because there it will vary from minus 1 to plus 1 So, if this is the v and if this is the phi of v, which is of the form of tan hyperbolic then this will be the nature, now here you can see that this function is not symmetric about 0 So, this is a non-symmetric function, this is non-symmetric about v is equal to 0 whereas, this function is anti-symmetric about v is equal to 0 And if you compare the learning capabilities of non-symmetric sigmoidal functions and anti-symmetric sigmoidal functions Then, it is seen that the sigmoidal functions, with anti-symmetric activation functions of the tan hyperbolic for example, we will have a better learning capability or a faster learning capability So, any multilayered perceptron that is, trained with anti-symmetric activations, where anti-symmetric property will be simply defined as phi of minus v equal to minus of phi v. Now, this condition is not satisfied by logistic whereas, this condition is definitely satisfied by tan hyperbolic In fact, for the tan hyperbolic logistic for the tan hyperbolic sigmoidal functions, the researches have come up with different type of standard implementation and one of the implementations of the tan hyperbolic functions was proposed by Lecun And they have specified the function of the form phi v equal to a tan hyperbolic and the argument of the tan hyperbolic is b into v where, a and b are constants And for their implementation that is the one that was proposed by Lecun, it was suggested that a, has got a value of 1.7159 and b was taken to be 2 3rd, the significance of this is that we are going to have a characteristic of this nature If we are taking a to be of this value in that

case, definitely the final value that this phi v function will achieve at the saturation is going to be 1.7159 because, it is going to approach a in the limit It is going to approach plus a in the positive limit and minus a in the negative limit, so it will vary between minus 1.7159 up to plus 1.7159, this will be the nature of the tan hyperbolic function And for a function of this type the interesting characteristic is that, whenever you keep v this is the axis of the v, if you keep v is equal to 1 that means to say that if you are computing phi of 1 then phi of 1 is equal to 1, although the final value is 1.7159 At v is equal to 1 you are going to get phi v equal to 1 and likewise is the function is anti-symmetric, you are going to have phi of minus 1 to be equal to minus 1 In fact, in this characteristic one thing which you can notice that, there are two portions of the curve one is the linear part of it and the other is the saturation part of it So, I can describe this part the part beyond v equal to 1, that means to say that when mod of v is greater than 1 We can say that this belongs to the saturation part of the characteristic whereas, as long as mod of v is less than or equal to 1 this is a linear characteristic So, this is linear whereas, this one is the saturation and we can say roughly that as if to say that, v is equal 1 is lying at the boundary of the linear to saturation zone and we will use this very shortly So, the third heuristic that we were describing is again, related to the activation function and we have seen that anti-symmetric functions are better as compared to the non-symmetric functions The fourth heuristic is pertaining to the target values, now target values means that definitely in the training set we are going to have the input set, we are going to have the input set, we are going to have input vector as also the output values Now, the output values will be again in the range that is specified by the activation function is in it, so definitely if we are using a characteristics of this nature, let say again a tan hyperbolic characteristics of this nature we take Then, the activation function, we cannot definitely exceed an activation function d, an expected output d of the neuron j, if we are taking dj, dj certainly cannot exceed 1.7159 at this end or cannot be less than minus 1.7159 at the same So, definitely it is implied that the activation function need to follow this, but again the question is that where should we keep the expected, how should be select the patterns, so that the activation functions, they could be lying close to the saturation region or they could be lying in the linear region Now, which one is better, are we going to choose those patterns which are having there expected outputs close to the saturation region of the characteristic or close to the linear range of the characteristic Now, it is seen that the linear range of characteristics, if we are choosing the dj’s closure to the linear range of characteristic, it is better from a training point of view, but certainly this is not the thing, that you can always guarantee because dj is after all what your expected outputs Which are drawn from the examples, now it could be that for some of the patterns, which are available as examples this dj could be very close to 1.7159, in fact, there is no harm if it is exactly equal to 1.7159

So, what is done is that in order to ensure a better learning, as a heuristic what one does is that, one applies and offset to the desired output before the training is performed that means to say that we are not really training it with exact dj If dj close to the limits if it is close to plus a or if it is close to minus a then we are not taking the exact value of dj rather we are taking some offsetted value of dj So, that we shift the dj points more towards the linear range for example, in this case if I get a dj equal to 1.7159, in that case the nice thing will be is that I apply an offset of minus 0.7159 So, that my dj is equal to 1 because this 1 I know is definitely closer to the linear range, so that is why what we have to do is to modify the target value So, we must keep dj equal to a minus epsilon, some positive quantity epsilon for the limiting value, when the limiting value is plus a, for the limiting value of plus a we must make dj equal to a minus epsilon and when the limiting value is minus a then what we should do yes So, for a limiting value of minus a, we should have dj equal to minus a plus epsilon, so this is the heuristic on the target values and the next point that we are going to the deal with, is regarding the normalization of the inputs Now, normally we are not putting any restriction on the set of patterns with which we are going to train, in general it will be a set of training patterns in the m dimensional space Now, since we again cannot visualize and m dimensional space let us, consider this simplest case of a two dimensional space for our ease of visualization Supposing we have got an x 1 x 2 space where, we have got two inputs basically x 1 and x 2 and the set of patterns, that could be very widely distributed Like, we could have let us say the set of patterns like this, supposing all these points that I have marked in this two dimensional space, they are the set of points with which we are going to train those are the input patterns that are available Now, what is the characteristic that you are observing, you see that the way I have drawn the axis both x 1 and x 2 could vary in the positive as well as the negative range But, we have chosen the patterns where, we are always having the patterns selected out of baring except one or two patterns which are lying in this quadrant, but most of the patterns in this particular example, are lying in the first quadrant with x 1 and x 2 both as positive Now, is it very good surely not because, you are training the back propagation network for a good generalization capability, so that means to say that it should encounter the patterns, which are varying in values equally between positive and negative values So, ideally we should take the patterns where, x 1 varies in the range of say minus a to plus a and not this a of course, I am saying some limits of x 1 in the positive and negative and similarly x 2 also in some minus and plus range, that can vary

And if we can get the patterns like that, that is the best, but again as I told you that obtaining the patterns may not be at our hands because, we are drawing the examples from some real life data’s which are already available as examples I could be that the examples themselves are such that, only these many examples are available, but that does not mean that whenever we are testing the network with unknown patterns, the unknown patterns cannot be like this unknown patterns can be, who says that everything should be available or very similar pattern should be available as examples only We can have unknown patterns here and in this case there is no good reason to believe that the network will have a good generalization capability Because it has seen the patterns in this region, but it has not seen the pattern in this region during the training, but during the testing it has in kind of pattern like this So, that is why we have to do something with the inputs, some manipulation, some manipulation is needed with the input data before we actually train the network with the patterns, so what is the kind of normalization that we do Now, in this example I have shown you that x 1 and x 2 are both having some positive bias in this example there is a positive bias to both x 1 and x 2 So, what if we obtain the mean of the x 1 and x 2 values and if we subtract the mean value from all these observations, from each one of these observations in that case, we should get a distribution that should be equally balance between the positive and negative So, what we can have is that the picture in this case may look something like this, whatever was on like this restricted to the first quadrant alone may now be looking like this, but So, this is the process of mean removal, so this is the process of mean removal and in the process of mean removal actually, we are altering the inputs space, we cannot call the input space as x 1 x 2 space any more We have to call it as a new space x 1 prime x 2 prime So, that is one of the steps that we do, that is the mean removal and in order to accelerate the back propagation algorithm, we need to have some more steps to be performed after that and immediately after the mean removal, the step that is normally performed is a de correlation of inputs, so those who are having difficulties in understanding the term de correlation Let me give you a simple example, let us say that you have got a set of time domain samples, supposing you have got some voltage data or any analog data you have got and you have taken time samples of that And each one of those time samples, you are representing with some digitized value, now you will be finding that the time samples are highly correlated from each other, by correlation what we mean that the time sample that you are taking at n, will not be remarkably different from the time sample that you are going to take at n plus 1 or n minus 1 So, the data is highly correlated, now supposing you have gotten sinusoidal wave and the time samples, if you are taking many samples you will be finding that the time sample are highly correlated But, now if you are converting the temporal data, that time domain data into a frequency domain data then what will you get I said the example of a sinusoidal wave, sinusoidal wave means it will have a frequency component and also if you are observing the sinusoid only through a small window, if it is not a continuous sinusoidal wave, which will be in the case of the discrete thing

Then, we will be having because of the truncation effect of the sign wave, you are going to have some more components also, but other than that the frequency components will be highly uncorrelated So, the reason why we often transform from the time domain to the frequency domain is because, we want to de correlate the data and de correlation is very often useful for thinks like data compression because, if we de correlate the individual components then, it is possible for us to represent it by a smaller number of samples, as compare to the original domain with which may ever working In fact, de correlation there are different methodologies for performing de correlation and one of the effective ways, where by the de correlation is performed is principal component analysis And we are going to study principal component analysis shortly, not immediately may be after a few lectures, we will be a coming to the topic of principal component analysis, which are very often used in order to de correlate the data, now, what happens is that So, the steps that, we are doing is that the first step is mean removal and the next step is going to be de correlation, so whatever you have as x 1 prime x 2 prime will now be de correlated, de correlated means this data set could now be getting more spreaded like, if you are having a de correlated set of patterns then the patterns may be looking like this In fact, what happens is that this is the x 1 axis not x 1 anymore because already because of the mean removal we were calling it as x 1 prime So, now with the de correlation, we are going to call it as x 1 double prime and this axis will be x 2 double prime In fact, the scaling on x 1 w prime and x 2 double prime are quite different as you can see from here, so what happens is that after the de correlation you have to do one more step and that is called as the covariance equalization What is meant by covariance equalization is that, the covariances in the x 1 direction and x 2 direction in this case they are different So, you scale it appropriately, so that the covariance‘s in x 1 direction and x 2 direction are equal and after the covariance equalization you will get the patterns somewhat in this range, if they are equalized, if the covariance are equalized we are going to get the pattern So, now we are going to call this as a new space x 1 triple prime and x 2 triple prime So, in effect what we are doing is the normalization of the inputs, so now that we have transformed the input space into a new space x 1 triple prime x 2 triple prime where, the datas are having 0 mean number one, number two the datas are highly de correlated and number three where, there covariance’s are equalize Now, on this data, on this space if we apply the back propagation algorithm the learning is seen to be better So, this is the process of the normalization of the input, which we had listed as the fifth heuristic in our case The next heuristic and which is perhaps the most crucial and important of this is the initialization, by initialization we mean to say that the question, which should very often arise is that we begin with the training of a network, let say a multi layer perceptron network, which we are going to train with the set of examples that