Archive-name: ai-faq/neural-nets/part3 Last-modified: 2001-05-21 URL: ftp://ftp.sas.com/pub/neural/FAQ3.html Maintainer: saswss@unx.sas.com (Warren S. Sarle)
This is part 3 (of 7) of a monthly posting to the Usenet newsgroup comp.ai.neural-nets. See the part 1 of this posting for full information what it is all about.
------------------------------------------------------------------------
Generalization requires prior knowledge, as pointed out by Hume (1739/1978), Russell (1948), and Goodman (1954/1983) and rigorously proved by Wolpert (1995a, 1996a, 1996b). For any practical application, you have to know what the relevant inputs are (you can't simply include every imaginable input). You have to know a restricted class of input-output functions that contains an adequate approximation to the function you want to learn (you can't use a learning method that is capable of fitting every imaginable function). And you have to know that the cases you want to generalize to bear some resemblance to the training cases. Thus, there are three conditions that are typically necessary--although not sufficient--for good generalization:
For classification, if you do not need to estimate posterior probabilities, then smoothness is not theoretically necessary. In particular, feedforward networks with one hidden layer trained by minimizing the error rate (a very tedious training method) are universally consistent classifiers if the number of hidden units grows at a suitable rate relative to the number of training cases (Devroye, Györfi, and Lugosi, 1996). However, you are likely to get better generalization with realistic sample sizes if the classification boundaries are smoother.
For Boolean functions, the concept of smoothness is more elusive. It seems intuitively clear that a Boolean network with a small number of hidden units and small weights will compute a "smoother" input-output function than a network with many hidden units and large weights. If you know a good reference characterizing Boolean functions for which good generalization is possible, please inform the FAQ maintainer (saswss@unx.sas.com).
If you have more information about the function, e.g. that the outputs should be linearly related to the inputs, you can often take advantage of this information by placing constraints on the network or by fitting a more specific model, such as a linear model, to improve generalization. Extrapolation is much more reliable in linear models than in flexible nonlinear models, although still not nearly as safe as interpolation. You can also use such information to choose the training cases more efficiently. For example, with a linear model, you should choose training cases at the outer limits of the input space instead of evenly distributing them throughout the input space.
References:
Caudill, M. and Butler, C. (1990). Naturally Intelligent Systems. MIT Press: Cambridge, Massachusetts.
Devroye, L., Györfi, L., and Lugosi, G. (1996), A Probabilistic Theory of Pattern Recognition, NY: Springer.
Goodman, N. (1954/1983), Fact, Fiction, and Forecast, 1st/4th ed., Cambridge, MA: Harvard University Press.
Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R. (1986), Induction: Processes of Inference, Learning, and Discovery, Cambridge, MA: The MIT Press.
Howson, C. and Urbach, P. (1989), Scientific Reasoning: The Bayesian Approach, La Salle, IL: Open Court.
Hume, D. (1739/1978), A Treatise of Human Nature, Selby-Bigge, L.A., and Nidditch, P.H. (eds.), Oxford: Oxford University Press.
Plotkin, H. (1993), Darwin Machines and the Nature of Knowledge, Cambridge, MA: Harvard University Press.
Russell, B. (1948), Human Knowledge: Its Scope and Limits, London: Routledge.
Stone, C.J. (1977), "Consistent nonparametric regression," Annals of Statistics, 5, 595-645.
Stone, C.J. (1982), "Optimal global rates of convergence for nonparametric regression," Annals of Statistics, 10, 1040-1053.
Vapnik, V.N. (1995), The Nature of Statistical Learning Theory, NY: Springer.
Wolpert, D.H. (1995a), "The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework," in Wolpert (1995b), 117-214.
Wolpert, D.H. (ed.) (1995b), The Mathematics of Generalization: The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, Santa Fe Institute Studies in the Sciences of Complexity, Volume XX, Reading, MA: Addison-Wesley.
Wolpert, D.H. (1996a), "The lack of a priori distinctions between learning algorithms," Neural Computation, 8, 1341-1390.
Wolpert, D.H. (1996b), "The existence of a priori distinctions between learning algorithms," Neural Computation, 8, 1391-1420.
------------------------------------------------------------------------
Noise in the actual data is never a good thing, since it limits the accuracy of generalization that can be achieved no matter how extensive the training set is. On the other hand, injecting artificial noise (jitter) into the inputs during training is one of several ways to improve generalization for smooth functions when you have a small training set.
Certain assumptions about noise are necessary for theoretical results. Usually, the noise distribution is assumed to have zero mean and finite variance. The noise in different cases is usually assumed to be independent or to follow some known stochastic model, such as an autoregressive process. The more you know about the noise distribution, the more effectively you can train the network (e.g., McCullagh and Nelder 1989).
If you have noise in the target values, what the network learns depends mainly on the error function. For example, if the noise is independent with finite variance for all training cases, a network that is well-trained using least squares will produce outputs that approximate the conditional mean of the target values (White, 1990; Bishop, 1995). Note that for a binary 0/1 variable, the mean is equal to the probability of getting a 1. Hence, the results in White (1990) immediately imply that for a categorical target with independent noise using 1-of-C coding (see "How should categories be encoded?"), a network that is well-trained using least squares will produce outputs that approximate the posterior probabilities of each class (see Rojas, 1996, if you want a simple explanation of why least-squares estimates probabilities). Posterior probabilities can also be learned using cross-entropy and various other error functions (Finke and Müller, 1994; Bishop, 1995). The conditional median can be learned by least-absolute-value training (White, 1992a). Conditional modes can be approximated by yet other error functions (e.g., Rohwer and van der Rest 1996). For noise distributions that cannot be adequately approximated by a single location estimate (such as the mean, median, or mode), a network can be trained to approximate quantiles (White, 1992a) or mixture components (Bishop, 1995; Husmeier, 1999).
If you have noise in the target values, the mean squared generalization error can never be less than the variance of the noise, no matter how much training data you have. But you can estimate the mean of the target values, conditional on a given set of input values, to any desired degree of accuracy by obtaining a sufficiently large and representative training set, assuming that the function you are trying to learn is one that can indeed be learned by the type of net you are using, and assuming that the complexity of the network is regulated appropriately (White 1990).
Noise in the target values increases the danger of overfitting (Moody 1992).
Noise in the inputs limits the accuracy of generalization, but in a more complicated way than does noise in the targets. In a region of the input space where the function being learned is fairly flat, input noise will have little effect. In regions where that function is steep, input noise can degrade generalization severely.
Furthermore, if the target function is Y=f(X), but you observe noisy inputs X+D, you cannot obtain an arbitrarily accurate estimate of f(X) given X+D no matter how large a training set you use. The net will not learn f(X), but will instead learn a convolution of f(X) with the distribution of the noise D (see "What is jitter?)"
For more details, see one of the statistically-oriented references on neural nets such as:
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press, especially section 6.4.
Finke, M., and Müller, K.-R. (1994), "Estimating a-posteriori probabilities using stochastic network models," in Mozer, Smolensky, Touretzky, Elman, & Weigend, eds., Proceedings of the 1993 Connectionist Models Summer School, Hillsdale, NJ: Lawrence Erlbaum Associates, pp. 324-331.
Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
Husmeier, D. (1999), Neural Networks for Conditional Probability Estimation: Forecasting Beyond Point Predictions, Berlin: Springer Verlag, ISBN 185233095.
McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall.
Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems", in Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural Information Processing Systems 4, 847-854.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
Rohwer, R., and van der Rest, J.C. (1996), "Minimum description length, regularization, and multimodal data," Neural Computation, 8, 595-609.
Rojas, R. (1996), "A short proof of the posterior probability property of classifier neural networks," Neural Computation, 8, 41-43.
White, H. (1990), "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3, 535-550. Reprinted in White (1992).
White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," in Page, C. and Le Page, R. (eds.), Proceedings of the 23rd Sympsium on the Interface: Computing Science and Statistics, Alexandria, VA: American Statistical Association, pp. 190-199. Reprinted in White (1992b).
White, H. (1992b), Artificial Neural Networks: Approximation and Learning Theory, Blackwell.
------------------------------------------------------------------------
For an elementary discussion of overfitting, see Smith (1996). For a more rigorous approach, see the article by Geman, Bienenstock, and Doursat (1992) on the bias/variance trade-off (it's not really a dilemma). We are talking about statistical bias here: the difference between the average value of an estimator and the correct value. Underfitting produces excessive bias in the outputs, whereas overfitting produces excessive variance. There are graphical examples of overfitting and underfitting in Sarle (1995, 1999).
The best way to avoid overfitting is to use lots of training data. If you have at least 30 times as many training cases as there are weights in the network, you are unlikely to suffer from much overfitting, although you may get some slight overfitting no matter how large the training set is. For noise-free data, 5 times as many training cases as weights may be sufficient. But you can't arbitrarily reduce the number of weights for fear of underfitting.
Given a fixed amount of training data, there are at least six approaches to avoiding underfitting and overfitting, and hence getting good generalization:
The first five approaches are based on well-understood theory. Methods for combining networks do not have such a sound theoretical basis but are the subject of current research. These six approaches are discussed in more detail under subsequent questions.The complexity of a network is related to both the number of weights and the size of the weights. Model selection is concerned with the number of weights, and hence the number of hidden units and layers. The more weights there are, relative to the number of training cases, the more overfitting amplifies noise in the targets (Moody 1992). The other approaches listed above are concerned, directly or indirectly, with the size of the weights. Reducing the size of the weights reduces the "effective" number of weights--see Moody (1992) regarding weight decay and Weigend (1994) regarding early stopping. Bartlett (1997) obtained learning-theory results in which generalization error is related to the L_1 norm of the weights instead of the VC dimension.
Overfitting is not confined to NNs with hidden units. Overfitting can occur in generalized linear models (networks with no hidden units) if either or both of the following conditions hold:
References:
Bartlett, P.L. (1997), "For valid generalization, the size of the weights is more important than the size of the network," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.
Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems", in Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural Information Processing Systems 4, 847-854.
Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages)
Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results," ftp://ftp.sas.com/pub/neural/dojo/dojo.html
Smith, M. (1996). Neural Networks for Statistical Modeling,
Boston: International Thomson Computer Press, ISBN 1-850-32842-0.
van Houwelingen,H.C., and le Cessie, S. (????), "Shrinkage and penalized likelihood as methods to improve predictive accuracy," http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/shrinkage.pdf and http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/figures.pdf
Weigend, A. (1994), "On overfitting and the effective number of hidden units," Proceedings of the 1993 Connectionist Models Summer School, 335-342.
------------------------------------------------------------------------
Training with jitter works because the functions that we want NNs to learn are mostly smooth. NNs can learn functions with discontinuities, but the functions must be piecewise continuous in a finite number of regions if our network is restricted to a finite number of hidden units.
In other words, if we have two cases with similar inputs, the desired outputs will usually be similar. That means we can take any training case and generate new training cases by adding small amounts of jitter to the inputs. As long as the amount of jitter is sufficiently small, we can assume that the desired output will not change enough to be of any consequence, so we can just use the same target value. The more training cases, the merrier, so this looks like a convenient way to improve training. But too much jitter will obviously produce garbage, while too little jitter will have little effect (Koistinen and Holmström 1992).
Consider any point in the input space, not necessarily one of the original training cases. That point could possibly arise as a jittered input as a result of jittering any of several of the original neighboring training cases. The average target value at the given input point will be a weighted average of the target values of the original training cases. For an infinite number of jittered cases, the weights will be proportional to the probability densities of the jitter distribution, located at the original training cases and evaluated at the given input point. Thus the average target values given an infinite number of jittered cases will, by definition, be the Nadaraya-Watson kernel regression estimator using the jitter density as the kernel. Hence, training with jitter is an approximation to training with the kernel regression estimator as target. Choosing the amount (variance) of jitter is equivalent to choosing the bandwidth of the kernel regression estimator (Scott 1992).
When studying nonlinear models such as feedforward NNs, it is often helpful first to consider what happens in linear models, and then to see what difference the nonlinearity makes. So let's consider training with jitter in a linear model. Notation:
x_ij is the value of the jth input (j=1, ..., p) for the ith training case (i=1, ..., n). X={x_ij} is an n by p matrix. y_i is the target value for the ith training case. Y={y_i} is a column vector.Without jitter, the least-squares weights are B = inv(X'X)X'Y, where "inv" indicates a matrix inverse and "'" indicates transposition. Note that if we replicate each training case c times, or equivalently stack c copies of the X and Y matrices on top of each other, the least-squares weights are inv(cX'X)cX'Y = (1/c)inv(X'X)cX'Y = B, same as before.
With jitter, x_ij is replaced by c cases x_ij+z_ijk, k=1, ..., c, where z_ijk is produced by some random number generator, usually with a normal distribution with mean 0 and standard deviation s, and the z_ijk's are all independent. In place of the n by p matrix X, this gives us a big matrix, say Q, with cn rows and p columns. To compute the least-squares weights, we need Q'Q. Let's consider the jth diagonal element of Q'Q, which is
2 2 2 sum (x_ij+z_ijk) = sum (x_ij + z_ijk + 2 x_ij z_ijk) i,k i,kwhich is approximately, for c large,
2 2 c(sum x_ij + ns ) iwhich is c times the corresponding diagonal element of X'X plus ns^2. Now consider the u,vth off-diagonal element of Q'Q, which is
sum (x_iu+z_iuk)(x_iv+z_ivk) i,kwhich is approximately, for c large,
c(sum x_iu x_iv) iwhich is just c times the corresponding element of X'X. Thus, Q'Q equals c(X'X+ns^2I), where I is an identity matrix of appropriate size. Similar computations show that the crossproduct of Q with the target values is cX'Y. Hence the least-squares weights with jitter of variance s^2 are given by
2 2 2 B(ns ) = inv(c(X'X+ns I))cX'Y = inv(X'X+ns I)X'YIn the statistics literature, B(ns^2) is called a ridge regression estimator with ridge value ns^2.
If we were to add jitter to the target values Y, the cross-product X'Y would not be affected for large c for the same reason that the off-diagonal elements of X'X are not afected by jitter. Hence, adding jitter to the targets will not change the optimal weights; it will just slow down training (An 1996).
The ordinary least squares training criterion is (Y-XB)'(Y-XB). Weight decay uses the training criterion (Y-XB)'(Y-XB)+d^2B'B, where d is the decay rate. Weight decay can also be implemented by inventing artificial training cases. Augment the training data with p new training cases containing the matrix dI for the inputs and a zero vector for the targets. To put this in a formula, let's use A;B to indicate the matrix A stacked on top of the matrix B, so (A;B)'(C;D)=A'C+B'D. Thus the augmented inputs are X;dI and the augmented targets are Y;0, where 0 indicates the zero vector of the appropriate size. The squared error for the augmented training data is:
(Y;0-(X;dI)B)'(Y;0-(X;dI)B) = (Y;0)'(Y;0) - 2(Y;0)'(X;dI)B + B'(X;dI)'(X;dI)B = Y'Y - 2Y'XB + B'(X'X+d^2I)B = Y'Y - 2Y'XB + B'X'XB + B'(d^2I)B = (Y-XB)'(Y-XB)+d^2B'Bwhich is the weight-decay training criterion. Thus the weight-decay estimator is:
inv[(X;dI)'(X;dI)](X;dI)'(Y;0) = inv(X'X+d^2I)X'Ywhich is the same as the jitter estimator B(d^2), i.e. jitter with variance d^2/n. The equivalence between the weight-decay estimator and the jitter estimator does not hold for nonlinear models unless the jitter variance is small relative to the curvature of the nonlinear function (An 1996). However, the equivalence of the two estimators for linear models suggests that they will often produce similar results even for nonlinear models. Details for nonlinear models, including classification problems, are given in An (1996).
B(0) is obviously the ordinary least-squares estimator. It can be shown that as s^2 increases, the Euclidean norm of B(ns^2) decreases; in other words, adding jitter causes the weights to shrink. It can also be shown that under the usual statistical assumptions, there always exists some value of ns^2 > 0 such that B(ns^2) provides better expected generalization than B(0). Unfortunately, there is no way to calculate a value of ns^2 from the training data that is guaranteed to improve generalization. There are other types of shrinkage estimators called Stein estimators that do guarantee better generalization than B(0), but I'm not aware of a nonlinear generalization of Stein estimators applicable to neural networks.
The statistics literature describes numerous methods for choosing the ridge value. The most obvious way is to estimate the generalization error by cross-validation, generalized cross-validation, or bootstrapping, and to choose the ridge value that yields the smallest such estimate. There are also quicker methods based on empirical Bayes estimation, one of which yields the following formula, useful as a first guess:
2 p(Y-XB(0))'(Y-XB(0)) s = -------------------- 1 n(n-p)B(0)'B(0)You can iterate this a few times:
2 p(Y-XB(0))'(Y-XB(0)) s = -------------------- l+1 2 2 n(n-p)B(s )'B(s ) l lNote that the more training cases you have, the less noise you need.
References:
An, G. (1996), "The effects of adding noise during backpropagation training on a generalization performance," Neural Computation, 8, 643-674.
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.
Holmström, L. and Koistinen, P. (1992) "Using additive noise in back-propagation training", IEEE Transaction on Neural Networks, 3, 24-38.
Koistinen, P. and Holmström, L. (1992) "Kernel regression and backpropagation training with noise," NIPS4, 1033-1039.
Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0-262-18190-8.
Scott, D.W. (1992) Multivariate Density Estimation, Wiley.
Vinod, H.D. and Ullah, A. (1981) Recent Advances in Regression Methods, NY: Marcel-Dekker.
------------------------------------------------------------------------
Early stopping has several advantages:
Early stopping is closely related to ridge regression. If the learning rate is sufficiently small, the sequence of weight vectors on each iteration will approximate the path of continuous steepest descent down the error surface. Early stopping chooses a point along this path that optimizes an estimate of the generalization error computed from the validation set. Ridge regression also defines a path of weight vectors by varying the ridge value. The ridge value is often chosen by optimizing an estimate of the generalization error computed by cross-validation, generalized cross-validation, or bootstrapping (see "What are cross-validation and bootstrapping?") There always exists a positive ridge value that will improve the expected generalization error in a linear model. A similar result has been obtained for early stopping in linear models (Wang, Venkatesh, and Judd 1994). In linear models, the ridge path lies close to, but does not coincide with, the path of continuous steepest descent; in nonlinear models, the two paths can diverge widely. The relationship is explored in more detail by Sjöberg and Ljung (1992).
References:
S. Amari, N.Murata, K.-R. Muller, M. Finke, H. Yang. Asymptotic Statistical Theory of Overtraining and Cross-Validation. METR 95-06, 1995, Department of Mathematical Engineering and Information Physics, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan.
Finnof, W., Hergert, F., and Zimmermann, H.G. (1993), "Improving model selection by nonconvergent methods," Neural Networks, 6, 771-783.
Nelson, M.C. and Illingworth, W.T. (1991), A Practical Guide to Neural Nets, Reading, MA: Addison-Wesley.
Orr, G.B., and Mueller, K.-R., eds. (1998), Neural Networks: Tricks of the Trade, Berlin: Springer, ISBN 3-540-65311-2.
Prechelt, L. (1998), "Early stopping--But when?" in Orr and Mueller (1998), 55-69.
Prechelt, L. (1994), "PROBEN1--A set of neural network benchmark problems and benchmarking rules," Technical Report 21/94, Universitat Karlsruhe, 76128 Karlsruhe, Germany, ftp://ftp.ira.uka.de/pub/papers/techreports/1994/1994-21.ps.gz.
Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages)
Sjöberg, J. and Ljung, L. (1992), "Overtraining, Regularization, and Searching for Minimum in Neural Networks," Technical Report LiTH-ISY-I-1297, Department of Electrical Engineering, Linkoping University, S-581 83 Linkoping, Sweden, http://www.control.isy.liu.se .
Wang, C. (1994), A Theory of Generalisation in Learning Machines with Neural Network Application, Ph.D. thesis, University of Pennsylvania.
Wang, C., Venkatesh, S.S., and Judd, J.S. (1994), "Optimal Stopping and Effective Machine Complexity in Learning," NIPS6, 303-310.
Weigend, A. (1994), "On overfitting and the effective number of hidden units," Proceedings of the 1993 Connectionist Models Summer School, 335-342.
------------------------------------------------------------------------
Weight decay is a subset of regularization methods. The penalty term in weight decay, by definition, penalizes large weights. Other regularization methods may involve not only the weights but various derivatives of the output function (Bishop 1995).
The weight decay penalty term causes the weights to converge to smaller absolute values than they otherwise would. Large weights can hurt generalization in two different ways. Excessively large weights leading to hidden units can cause the output function to be too rough, possibly with near discontinuities. Excessively large weights leading to output units can cause wild outputs far beyond the range of the data if the output activation function is not bounded to the same range as the data. To put it another way, large weights can cause excessive variance of the output (Geman, Bienenstock, and Doursat 1992). According to Bartlett (1997), the size (L_1 norm) of the weights is more important than the number of weights in determining generalization.
Other penalty terms besides the sum of squared weights are sometimes used. Weight elimination (Weigend, Rumelhart, and Huberman 1991) uses:
(w_i)^2 sum ------------- i (w_i)^2 + c^2where w_i is the ith weight and c is a user-specified constant. Whereas decay using the sum of squared weights tends to shrink the large coefficients more than the small ones, weight elimination tends to shrink the small coefficients more, and is therefore more useful for suggesting subset models (pruning).
The generalization ability of the network can depend crucially on the decay constant, especially with small training sets. One approach to choosing the decay constant is to train several networks with different amounts of decay and estimate the generalization error for each; then choose the decay constant that minimizes the estimated generalization error. Weigend, Rumelhart, and Huberman (1991) iteratively update the decay constant during training.
There are other important considerations for getting good results from weight decay. You must either standardize the inputs and targets, or adjust the penalty term for the standard deviations of all the inputs and targets. It is usually a good idea to omit the biases from the penalty term.
A fundamental problem with weight decay is that different types of weights in the network will usually require different decay constants for good generalization. At the very least, you need three different decay constants for input-to-hidden, hidden-to-hidden, and hidden-to-output weights. Adjusting all these decay constants to produce the best estimated generalization error often requires vast amounts of computation.
Fortunately, there is a superior alternative to weight decay: hierarchical Bayesian learning. Bayesian learning makes it possible to estimate efficiently numerous decay constants.
References:
Bartlett, P.L. (1997), "For valid generalization, the size of the weights is more important than the size of the network," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.
Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weight-elimination with application to forecasting. In: R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.), Advances in Neural Information Processing Systems 3, San Mateo, CA: Morgan Kaufmann.
------------------------------------------------------------------------
By Radford Neal.
Conventional training methods for multilayer perceptrons ("backprop" nets) can be interpreted in statistical terms as variations on maximum likelihood estimation. The idea is to find a single set of weights for the network that maximize the fit to the training data, perhaps modified by some sort of weight penalty to prevent overfitting.
The Bayesian school of statistics is based on a different view of what it means to learn from data, in which probability is used to represent uncertainty about the relationship being learned (a use that is shunned in conventional--i.e., frequentist--statistics). Before we have seen any data, our prior opinions about what the true relationship might be can be expresssed in a probability distribution over the network weights that define this relationship. After we look at the data (or after our program looks at the data), our revised opinions are captured by a posterior distribution over network weights. Network weights that seemed plausible before, but which don't match the data very well, will now be seen as being much less likely, while the probability for values of the weights that do fit the data well will have increased.
Typically, the purpose of training is to make predictions for future cases in which only the inputs to the network are known. The result of conventional network training is a single set of weights that can be used to make such predictions. In contrast, the result of Bayesian training is a posterior distribution over network weights. If the inputs of the network are set to the values for some new case, the posterior distribution over network weights will give rise to a distribution over the outputs of the network, which is known as the predictive distribution for this new case. If a single-valued prediction is needed, one might use the mean of the predictive distribution, but the full predictive distribution also tells you how uncertain this prediction is.
Why bother with all this? The hope is that Bayesian methods will provide solutions to such fundamental problems as:
Good solutions to these problems, especially the last two, depend on using the right prior distribution, one that properly represents the uncertainty that you probably have about which inputs are relevant, how smooth the function is, how much noise there is in the observations, etc. Such carefully vague prior distributions are usually defined in a hierarchical fashion, using hyperparameters, some of which are analogous to the weight decay constants of more conventional training procedures. The use of hyperparameters is discussed by Mackay (1992a, 1992b, 1995) and Neal (1993a, 1996), who in particular use an "Automatic Relevance Determination" scheme that aims to allow many possibly-relevant inputs to be included without damaging effects.
Selection of an appropriate network architecture is another place where prior knowledge plays a role. One approach is to use a very general architecture, with lots of hidden units, maybe in several layers or groups, controlled using hyperparameters. This approach is emphasized by Neal (1996), who argues that there is no statistical need to limit the complexity of the network architecture when using well-designed Bayesian methods. It is also possible to choose between architectures in a Bayesian fashion, using the "evidence" for an architecture, as discussed by Mackay (1992a, 1992b).
Implementing all this is one of the biggest problems with Bayesian methods. Dealing with a distribution over weights (and perhaps hyperparameters) is not as simple as finding a single "best" value for the weights. Exact analytical methods for models as complex as neural networks are out of the question. Two approaches have been tried:
Monte Carlo methods for Bayesian neural networks have been developed by Neal (1993a, 1996). In this approach, the posterior distribution is represented by a sample of perhaps a few dozen sets of network weights. The sample is obtained by simulating a Markov chain whose equilibrium distribution is the posterior distribution for weights and hyperparameters. This technique is known as "Markov chain Monte Carlo (MCMC)"; see Neal (1993b) for a review. The method is exact in the limit as the size of the sample and the length of time for which the Markov chain is run increase, but convergence can sometimes be slow in practice, as for any network training method.
Work on Bayesian neural network learning has so far concentrated on multilayer perceptron networks, but Bayesian methods can in principal be applied to other network models, as long as they can be interpreted in statistical terms. For some models (eg, RBF networks), this should be a fairly simple matter; for others (eg, Boltzmann Machines), substantial computational problems would need to be solved.
Software implementing Bayesian neural network models (intended for research use) is available from the home pages of David MacKay and Radford Neal.
There are many books that discuss the general concepts of Bayesian inference, though they mostly deal with models that are simpler than neural networks. Here are some recent ones:
Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian Theory, New York: John Wiley.
Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian Data Analysis, London: Chapman & Hall, ISBN 0-412-03991-5.
O'Hagan, A. (1994) Bayesian Inference (Volume 2B in Kendall's Advanced Theory of Statistics), ISBN 0-340-52922-9.
Robert, C. P. (1995) The Bayesian Choice, New York: Springer-Verlag.
The following books and papers have tutorial material on Bayesian learning as applied to neural network models:
Bishop, C. M. (1995) Neural Networks for Pattern Recognition, Oxford: Oxford University Press.
Lee, H.K.H (1999), Model Selection and Model Averaging for Neural Networks, Doctoral dissertation, Carnegie Mellon University, Pittsburgh, USA, http://lib.stat.cmu.edu/~herbie/thesis.html
MacKay, D. J. C. (1995) "Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks", available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/network.ps.gz.
Mueller, P. and Insua, D.R. (1995) "Issues in Bayesian Analysis of Neural Network Models," Neural Computation, 10, 571-592, (also Institute of Statistics and Decision Sciences Working Paper 95-31), ftp://ftp.isds.duke.edu/pub/WorkingPapers/95-31.ps
Neal, R. M. (1996) Bayesian Learning for Neural Networks, New York: Springer-Verlag, ISBN 0-387-94724-8.
Ripley, B. D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
Thodberg, H. H. (1996) "A review of Bayesian neural networks with an application to near infrared spectroscopy", IEEE Transactions on Neural Networks, 7, 56-72.
Some other references:
Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds., (1985), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V. (North-Holland).
Buntine, W. L. and Weigend, A. S. (1991) "Bayesian back-propagation", Complex Systems, 5, 603-643.
MacKay, D. J. C. (1992a) "Bayesian interpolation", Neural Computation, 4, 415-447.
MacKay, D. J. C. (1992b) "A practical Bayesian framework for backpropagation networks," Neural Computation, 4, 448-472.
MacKay, D. J. C. (1993) "Hyperparameters: Optimize or Integrate Out?", available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/alpha.ps.gz.
Neal, R. M. (1993a) "Bayesian learning via stochastic dynamics", in C. L. Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural Information Processing Systems 5, San Mateo, California: Morgan Kaufmann, 475-482.
Neal, R. M. (1993b) Probabilistic Inference Using Markov Chain Monte Carlo Methods, available at ftp://ftp.cs.utoronto.ca/pub/radford/review.ps.Z.
O'Hagan, A. (1985) "Shoulders in hierarchical models", in J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith (editors), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V. (North-Holland), 697-710.
Sarle, W. S. (1995) "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages)
Wolpert, D. H. (1993) "On the use of evidence in neural networks", in C. L. Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural Information Processing Systems 5, San Mateo, California: Morgan Kaufmann, 539-546.
Finally, David MacKay maintains a FAQ about Bayesian methods for neural networks, at http://wol.ra.phy.cam.ac.uk/mackay/Bayes_FAQ.html .
By Warren Sarle.
Bayesian purists may argue over the proper way to do a Bayesian analysis, but even the crudest Bayesian computation (maximizing over both parameters and hyperparameters) is shown by Sarle (1995) to generalize better than early stopping when learning nonlinear functions. This approach requires the use of slightly informative hyperpriors and at least twice as many training cases as weights in the network. A full Bayesian analysis by MCMC can be expected to work even better under even broader conditions. Bayesian learning works well by frequentist standards--what MacKay calls the "evidence framework" is used by frequentist statisticians under the name "empirical Bayes." Although considerable research remains to be done, Bayesian learning seems to be the most promising approach to training neural networks.
Bayesian learning should not be confused with the "Bayes classifier." In the latter, the distribution of the inputs given the target class is assumed to be known exactly, and the prior probabilities of the classes are assumed known, so that the posterior probabilities can be computed by a (theoretically) simple application of Bayes' theorem. The Bayes classifier involves no learning--you must already know everything that needs to be known! The Bayes classifier is a gold standard that can almost never be used in real life but is useful in theoretical work and in simulation studies that compare classification methods. The term "Bayes rule" is also used to mean any classification rule that gives results identical to those of a Bayes classifier.
Bayesian learning also should not be confused with the "naive" or "idiot's" Bayes classifier (Warner et al. 1961; Ripley, 1996), which assumes that the inputs are conditionally independent given the target class. The naive Bayes classifier is usually applied with categorical inputs, and the distribution of each input is estimated by the proportions in the training set; hence the naive Bayes classifier is a frequentist method.
The term "Bayesian network" often refers not to a neural network but to a belief network (also called a causal net, influence diagram, constraint network, qualitative Markov network, or gallery). Belief networks are more closely related to expert systems than to neural networks, and do not necessarily involve learning (Pearl, 1988; Ripley, 1996). Here are some URLs on Bayesian belief networks:
References for comments:
Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, San Mateo, CA: Morgan Kaufmann.
Ripley, B. D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
Warner, H.R., Toronto, A.F., Veasy, L.R., and Stephenson, R. (1961), "A mathematical model for medical diagnosis--application to congenital heart disease," J. of the American Medical Association, 177, 177-184.
------------------------------------------------------------------------
Here is a list of terms used for various methods of combining models, mostly taken from Christoph M. Friedrich's web page (see below):
Clemen, Robert T. (1989), "Combining forecasts: A review and annotated bibliography", International Journal of Forecasting, Vol 5, pp 559-584.
Drucker, H. (1999), "Boosting using neural networks," in Sharkey (1999), pp. 51-78.
Hoeting, J. A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999) "Bayesian Model Averaging: A Tutorial (with discussion)," Statistical Science, 14:4, 382-417. Corrected version available at http://www.stat.washington.edu/www/research/online/hoeting1999.pdf
Sharkey, A.J.C. (1999), Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, London: Springer.
Taniguchi, M., and Tresp, V. (1997), "Averaging regularized estimators," Neural Computation, 9, 1163-1178.
------------------------------------------------------------------------
In MLPs with step/threshold/Heaviside activation functions, you need two hidden layers for full generality (Sontag 1992). For further discussion, see Bishop (1995, 121-126).
In MLPs with any of a wide variety of continuous nonlinear hidden-layer activation functions, one hidden layer with an arbitrarily large number of units suffices for the "universal approximation" property (e.g., Hornik, Stinchcombe and White 1989; Hornik 1993; for more references, see Bishop 1995, 130, and Ripley, 1996, 173-180). But there is no theory yet to tell you how many hidden units are needed to approximate any given function.
If you have only one input, there seems to be no advantage to using more than one hidden layer. But things get much more complicated when there are two or more inputs. To illustrate, examples with two inputs and one output will be used so that the results can be shown graphically. In each example there are 441 training cases on a regular 21-by-21 grid. The test sets have 1681 cases on a regular 41-by-41 grid over the same domain as the training set. If you are reading the HTML version of this document via a web browser, you can see surface plots based on the test set by clicking on the file names mentioned in the folowing text. Each plot is a gif file, approximately 9K in size.
Consider a target function of two inputs, consisting of a Gaussian hill in the middle of a plane (hill.gif). An MLP with an identity output activation function can easily fit the hill by surrounding it with a few sigmoid (logistic, tanh, arctan, etc.) hidden units, but there will be spurious ridges and valleys where the plane should be flat (h_mlp_6.gif). It takes dozens of hidden units to flatten out the plane accurately (h_mlp_30.gif).
Now suppose you use a logistic output activation function. As the input to a logistic function goes to negative infinity, the output approaches zero. The plane in the Gaussian target function also has a value of zero. If the weights and bias for the output layer yield large negative values outside the base of the hill, the logistic function will flatten out any spurious ridges and valleys. So fitting the flat part of the target function is easy (h_mlpt_3_unsq.gif and h_mlpt_3.gif). But the logistic function also tends to lower the top of the hill.
If instead of a rounded hill, the target function was a mesa with a large, flat top with a value of one, the logistic output activation function would be able to smooth out the top of the mesa just like it smooths out the plane below. Target functions like this, with large flat areas with values of either zero or one, are just what you have in many noise-free classificaton problems. In such cases, a single hidden layer is likely to work well.
When using a logistic output activation function, it is common practice to scale the target values to a range of .1 to .9. Such scaling is bad in a noise-free classificaton problem, because it prevents the logistic function from smoothing out the flat areas (h_mlpt1-9_3.gif).
For the Gaussian target function, [.1,.9] scaling would make it easier to fit the top of the hill, but would reintroduce undulations in the plane. It would be better for the Gaussian target function to scale the target values to a range of 0 to .9. But for a more realistic and complicated target function, how would you know the best way to scale the target values?
By introducing a second hidden layer with one sigmoid activation function and returning to an identity output activation function, you can let the net figure out the best scaling (h_mlp1_3.gif). Actually, the bias and weight for the output layer scale the output rather than the target values, and you can use whatever range of target values is convenient.
For more complicated target functions, especially those with several hills or valleys, it is useful to have several units in the second hidden layer. Each unit in the second hidden layer enables the net to fit a separate hill or valley. So an MLP with two hidden layers can often yield an accurate approximation with fewer weights than an MLP with one hidden layer. (Chester 1990).
To illustrate the use of multiple units in the second hidden layer, the next example resembles a landscape with a Gaussian hill and a Gaussian valley, both elliptical (hillanvale.gif). The table below gives the RMSE (root mean squared error) for the test set with various architectures. If you are reading the HTML version of this document via a web browser, click on any number in the table to see a surface plot of the corresponding network output.
The MLP networks in the table have one or two hidden layers with a tanh activation function. The output activation function is the identity. Using a squashing function on the output layer is of no benefit for this function, since the only flat area in the function has a target value near the middle of the target range.
Hill and Valley Data: RMSE for the Test Set (Number of weights in parentheses) MLP Networks HUs in HUs in Second Layer First ---------------------------------------------------------- Layer 0 1 2 3 4 1 0.204( 5) 0.204( 7) 0.189( 10) 0.187( 13) 0.185( 16) 2 0.183( 9) 0.163( 11) 0.147( 15) 0.094( 19) 0.096( 23) 3 0.159( 13) 0.095( 15) 0.054( 20) 0.033( 25) 0.045( 30) 4 0.137( 17) 0.093( 19) 0.009( 25) 0.021( 31) 0.016( 37) 5 0.121( 21) 0.092( 23) 0.010( 37) 0.011( 44) 6 0.100( 25) 0.092( 27) 0.007( 43) 0.005( 51) 7 0.086( 29) 0.077( 31) 8 0.079( 33) 0.062( 35) 9 0.072( 37) 0.055( 39) 10 0.059( 41) 0.047( 43) 12 0.047( 49) 0.042( 51) 15 0.039( 61) 0.032( 63) 20 0.025( 81) 0.018( 83) 25 0.021(101) 0.016(103) 30 0.018(121) 0.015(123) 40 0.012(161) 0.015(163) 50 0.008(201) 0.014(203)For an MLP with only one hidden layer (column 0), Gaussian hills and valleys require a large number of hidden units to approximate well. When there is one unit in the second hidden layer, the network can represent one hill or valley easily, which is what happens with three to six units in the first hidden layer. But having only one unit in the second hidden layer is of little benefit for learning two hills or valleys. Using two units in the second hidden layer enables the network to approximate two hills or valleys easily; in this example, only four units are required in the first hidden layer to get an excellent fit. Each additional unit in the second hidden layer enables the network to learn another hill or valley with a relatively small number of units in the first hidden layer, as explained by Chester (1990). In this example, having three or four units in the second hidden layer helps little, and actually produces a worse approximation when there are four units in the first hidden layer due to problems with local minima.
Unfortunately, using two hidden layers exacerbates the problem of local minima, and it is important to use lots of random initializations or other methods for global optimization. Local minima with two hidden layers can have extreme spikes or blades even when the number of weights is much smaller than the number of training cases. One of the few advantages of standard backprop is that it is so slow that spikes and blades will not become very sharp for practical training times.
More than two hidden layers can be useful in certain architectures such as cascade correlation (Fahlman and Lebiere 1990) and in special applications, such as the two-spirals problem (Lang and Witbrock 1988) and ZIP code recognition (Le Cun et al. 1989).
RBF networks are most often used with a single hidden layer. But an extra, linear hidden layer before the radial hidden layer enables the network to ignore irrelevant inputs (see How do MLPs compare with RBFs?") The linear hidden layer allows the RBFs to take elliptical, rather than radial (circular), shapes in the space of the inputs. Hence the linear layer gives you an elliptical basis function (EBF) network. In the hill and valley example, an ORBFUN network requires nine hidden units (37 weights) to get the test RMSE below .01, but by adding a linear hidden layer, you can get an essentially perfect fit with three linear units followed by two radial units (20 weights).
References:
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.
Chester, D.L. (1990), "Why Two Hidden Layers are Better than One," IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990, volume 1, 265-268.
Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation Learning Architecture," NIPS2, 524-532, ftp://archive.cis.ohio-state.edu/pub/neuroprose/fahlman.cascor-tr.ps.Z.
Hornik, K., Stinchcombe, M. and White, H. (1989), "Multilayer feedforward networks are universal approximators," Neural Networks, 2, 359-366.
Hornik, K. (1993), "Some new results on neural network approximation," Neural Networks, 6, 1069-1072.
Lang, K.J. and Witbrock, M.J. (1988), "Learning to tell two spirals apart," in Touretzky, D., Hinton, G., and Sejnowski, T., eds., Procedings of the 1988 Connectionist Models Summer School, San Mateo, CA: Morgan Kaufmann.
Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E., Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to handwritten ZIP code recognition", Neural Computation, 1, 541-551.
McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
Sontag, E.D. (1992), "Feedback stabilization using two-hidden-layer nets", IEEE Transactions on Neural Networks, 3, 981-990.
------------------------------------------------------------------------
Some books and articles offer "rules of thumb" for choosing an architecture; for example:
data chess; step = 1/4; do x = step/2 to 8-step/2 by step; do y = step/2 to 8-step/2 by step; c = 8*floor(x) + floor(y) + 1; output; end; end; run;No hidden units are needed.
data spirals; pi = arcos(-1); do i = 0 to 96; angle = i*pi/16.0; radius = 6.5*(104-i)/104; x = radius*cos(angle); y = radius*sin(angle); c = 1; output; x = -x; y = -y; c = 0; output; end; run;With one hidden layer, about 50 tanh hidden units are needed. Many NN programs may actually need closer to 100 hidden units to get zero training error.
An intelligent choice of the number of hidden units depends on whether you are using early stopping or some other form of regularization. If not, you must simply try many networks with different numbers of hidden units, estimate the generalization error for each one, and choose the network with the minimum estimated generalization error. For examples using statistical criteria to choose the number of hidden units, see ftp://ftp.sas.com/pub/neural/dojo/dojo.html.
Using conventional optimization algorithms (see "What are conjugate gradients, Levenberg-Marquardt, etc.?"), there is little point in trying a network with more weights than training cases, since such a large network is likely to overfit.
Using standard online backprop, however, Lawrence, Giles, and Tsoi (1996, 1997) have shown that it can be difficult to reduce training error to a level near the globally optimal value, even when using more weights than training cases. But increasing the number of weights makes it easier for standard backprop to find a good local optimum, so using "oversize" networks can reduce both training error and generalization error.
If you are using early stopping, it is essential to use lots of hidden units to avoid bad local optima (Sarle 1995). There seems to be no upper limit on the number of hidden units, other than that imposed by computer time and memory requirements. Weigend (1994) makes this assertion, but provides only one example as evidence. Tetko, Livingstone, and Luik (1995) provide simulation studies that are more convincing. Similar results were obtained in conjunction with the simulations in Sarle (1995), but those results are not reported in the paper for lack of space. On the other hand, there seems to be no advantage to using more hidden units than you have training cases, since bad local minima do not occur with so many hidden units.
If you are using weight decay or Bayesian estimation, you can also use lots of hidden units (Neal 1996). However, it is not strictly necessary to do so, because other methods are available to avoid local minima, such as multiple random starts and simulated annealing (such methods are not safe to use with early stopping). You can use one network with lots of hidden units, or you can try different networks with different numbers of hidden units, and choose on the basis of estimated generalization error. With weight decay or MAP Bayesian estimation, it is prudent to keep the number of weights less than half the number of training cases.
Bear in mind that with two or more inputs, an MLP with one hidden layer containing only a few units can fit only a limited variety of target functions. Even simple, smooth surfaces such as a Gaussian bump in two dimensions may require 20 to 50 hidden units for a close approximation. Networks with a smaller number of hidden units often produce spurious ridges and valleys in the output surface (see Chester 1990 and "How do MLPs compare with RBFs?") Training a network with 20 hidden units will typically require anywhere from 150 to 2500 training cases if you do not use early stopping or regularization. Hence, if you have a smaller training set than that, it is usually advisable to use early stopping or regularization rather than to restrict the net to a small number of hidden units.
Ordinary RBF networks containing only a few hidden units also produce peculiar, bumpy output functions. Normalized RBF networks are better at approximating simple smooth surfaces with a small number of hidden units (see How do MLPs compare with RBFs?).
There are various theoretical results on how fast approximation error decreases as the number of hidden units increases, but the conclusions are quite sensitive to the assumptions regarding the function you are trying to approximate. See p. 178 in Ripley (1996) for a summary. According to a well-known result by Barron (1993), in a network with I inputs and H units in a single hidden layer, the root integrated squared error (RISE) will decrease at least as fast as H^(-1/2) under some quite complicated smoothness assumptions. Ripley cites another paper by DeVore et al. (1989) that says if the function has only R bounded derivatives, RISE may decrease as slowly as H^(-R/I). For some examples with from 1 to 4 hidden layers see How many hidden layers should I use?" and "How do MLPs compare with RBFs?"
For learning a finite training set exactly, bounds for the number of hidden units required are provided by Elisseeff and Paugam-Moisy (1997).
References:
Barron, A.R. (1993), "Universal approximation bounds for superpositions of a sigmoid function," IEEE Transactions on Information Theory, 39, 930-945.
Berry, M.J.A., and Linoff, G. (1997), Data Mining Techniques, NY: John Wiley & Sons.
Blum, A. (1992), Neural Networks in C++, NY: Wiley.
Boger, Z., and Guterman, H. (1997), "Knowledge extraction from artificial neural network models," IEEE Systems, Man, and Cybernetics Conference, Orlando, FL.
Chester, D.L. (1990), "Why Two Hidden Layers are Better than One," IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990, volume 1, 265-268.
DeVore, R.A., Howard, R., and Micchelli, C.A. (1989), "Optimal nonlinear approximation," Manuscripta Mathematica, 63, 469-478.
Elisseeff, A., and Paugam-Moisy, H. (1997), "Size of multilayer networks for exact learning: analytic approach," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp.162-168.
Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
Lawrence, S., Giles, C.L., and Tsoi, A.C. (1996), "What size neural network gives optimal generalization? Convergence properties of backpropagation," Technical Report UMIACS-TR-96-22 and CS-TR-3617, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, http://www.neci.nj.nec.com/homepages/lawrence/papers/minima-tr96/minima-tr96.html
Lawrence, S., Giles, C.L., and Tsoi, A.C. (1997), "Lessons in Neural Network Training: Overfitting May be Harder than Expected," Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI-97, AAAI Press, Menlo Park, California, pp. 540-545, http://www.neci.nj.nec.com/homepages/lawrence/papers/overfitting-aaai97/overfitting-aaai97-bib.html
Neal, R. M. (1996) Bayesian Learning for Neural Networks, New York: Springer-Verlag, ISBN 0-387-94724-8.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press,
Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages)
Swingler, K. (1996), Applying Neural Networks: A Practical Guide, London: Academic Press.
Tetko, I.V., Livingstone, D.J., and Luik, A.I. (1995), "Neural Network Studies. 1. Comparison of Overfitting and Overtraining," J. Chem. Info. Comp. Sci., 35, 826-833.
Weigend, A. (1994), "On overfitting and the effective number of hidden units," Proceedings of the 1993 Connectionist Models Summer School, 335-342.
------------------------------------------------------------------------
These statistics can also be used as crude estimates of the generalization error in nonlinear models when you have a "large" training set. Correcting these statistics for nonlinearity requires substantially more computation (Moody 1992), and the theory does not always hold for neural networks due to violations of the regularity conditions.
Among the simple generalization estimators that do not require
the noise variance to be known, Schwarz's Bayesian Criterion
(known as both SBC and BIC; Schwarz 1978; Judge et al. 1980;
Raftery 1995) often works well for NNs (Sarle 1995, 1999).
AIC and FPE tend to overfit with NNs. Rissanen's Minimum
Description Length principle (MDL; Rissanen 1978, 1987, 1999)
is closely related to SBC. A special issue of Computer
Journal contains several articles on MDL, which can
be found online at
http://www3.oup.co.uk/computer_journal/hdb/Volume_42/Issue_04/
Several articles on SBC/BIC are available at the
University of Washigton's web site at
http://www.stat.washington.edu/tech.reports
For classification problems, the formulas are not as simple as for regression with normal noise. See Efron (1986) regarding logistic regression.
References:
Darlington, R.B. (1968), "Multiple Regression in Psychological Research and Practice," Psychological Bulletin, 69, 161-182.
Efron, B. (1986), "How biased is the apparent error rate of a prediction rule?" J. of the American Statistical Association, 81, 461-470.
Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, London: Chapman & Hall.
Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap, London: Chapman & Hall.
Miller, A.J. (1990), Subset Selection in Regression, London: Chapman & Hall.
Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems", in Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural Information Processing Systems 4, 847-854.
Raftery, A.E. (1995), "Bayesian Model Selection in Social Research," in Marsden, P.V. (ed.), Sociological Methodology 1995, Cambridge, MA: Blackwell, ftp://ftp.stat.washington.edu/pub/tech.reports/bic.ps.z or http://www.stat.washington.edu/tech.reports/bic.ps
Rissanen, J. (1978), "Modelling by shortest data description," Automatica, 14, 465-471.
Rissanen, J. (1987), "Stochastic complexity" (with discussion), J. of the Royal Statistical Society, Series B, 49, 223-239.
Rissanen, J. (1999), "Hypothesis Selection and Testing by the MDL Principle," Computer Journal, 42, 260-269, http://www3.oup.co.uk/computer_journal/hdb/Volume_42/Issue_04/
Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages)
Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results," ftp://ftp.sas.com/pub/neural/dojo/dojo.html
Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn, Morgan Kaufmann.
------------------------------------------------------------------------
Note that cross-validation is quite different from the "split-sample" or "hold-out" method that is commonly used for early stopping in NNs. In the split-sample method, only a single subset (the validation set) is used to estimate the generalization error, instead of k different subsets; i.e., there is no "crossing". While various people have suggested that cross-validation be applied to early stopping, the proper way of doing so is not obvious.
The distinction between cross-validation and split-sample validation is extremely important because cross-validation is markedly superior for small data sets; this fact is demonstrated dramatically by Goutte (1997) in a reply to Zhu and Rohwer (1996). For an insightful discussion of the limitations of cross-validatory choice among several learning methods, see Stone (1977).
Leave-one-out cross-validation often works well for estimating generalization error for continuous error functions such as the mean squared error, but it may perform poorly for discontinuous error functions such as the number of misclassified cases. In the latter case, k-fold cross-validation is preferred. But if k gets too small, the error estimate is pessimistically biased because of the difference in training-set size between the full-sample analysis and the cross-validation analyses. (For model-selection purposes, this bias can actually help; see the discussion below of Shao, 1993.) A value of 10 for k is popular for estimating generalization error.
Leave-one-out cross-validation can also run into trouble with various model-selection methods. Again, one problem is lack of continuity--a small change in the data can cause a large change in the model selected (Breiman, 1996). For choosing subsets of inputs in linear regression, Breiman and Spector (1992) found 10-fold and 5-fold cross-validation to work better than leave-one-out. Kohavi (1995) also obtained good results for 10-fold cross-validation with empirical decision trees (C4.5). Values of k as small as 5 or even 2 may work even better if you analyze several different random k-way splits of the data to reduce the variability of the cross-validation estimate.
Leave-one-out cross-validation also has more subtle deficiencies for model selection. Shao (1995) showed that in linear models, leave-one-out cross-validation is asymptotically equivalent to AIC (and Mallows' C_p), but leave-v-out cross-validation is asymptotically equivalent to Schwarz's Bayesian criterion (called SBC or BIC) when v = n[1-1/(log(n)-1)], where n is the number of training cases. SBC provides consistent subset-selection, while AIC does not. That is, SBC will choose the "best" subset with probability approaching one as the size of the training set goes to infinity. AIC has an asymptotic probability of one of choosing a "good" subset, but less than one of choosing the "best" subset (Stone, 1979). Many simulation studies have also found that AIC overfits badly in small samples, and that SBC works well (e.g., Hurvich and Tsai, 1989; Shao and Tu, 1995). Hence, these results suggest that leave-one-out cross-validation should overfit in small samples, but leave-v-out cross-validation with appropriate v should do better. However, when true models have an infinite number of parameters, SBC is not efficient, and other criteria that are asymptotically efficient but not consistent for model selection may produce better generalization (Hurvich and Tsai, 1989).
Shao (1993) obtained the surprising result that for selecting subsets of inputs in a linear regression, the probability of selecting the "best" does not converge to 1 (as the sample size n goes to infinity) for leave-v-out cross-validation unless the proportion v/n approaches 1. At first glance, Shao's result seems inconsistent with the analysis by Kearns (1997) of split-sample validation, which shows that the best generalization is obtained with v/n strictly between 0 and 1, with little sensitivity to the precise value of v/n for large data sets. But the apparent conflict is due to the fundamentally different properties of cross-validation and split-sample validation.
To obtain an intuitive understanding of Shao (1993), let's review some background material on generalization error. Generalization error can be broken down into three additive parts, noise variance + estimation variance + squared estimation bias. Noise variance is the same for all subsets of inputs. Bias is nonzero for subsets that are not "good", but it's zero for all "good" subsets, since we are assuming that the function to be learned is linear. Hence the generalization error of "good" subsets will differ only in the estimation variance. The estimation variance is (2p/t)s^2 where p is the number of inputs in the subset, t is the training set size, and s^2 is the noise variance. The "best" subset is better than other "good" subsets only because the "best" subset has (by definition) the smallest value of p. But the t in the denominator means that differences in generalization error among the "good" subsets will all go to zero as t goes to infinity. Therefore it is difficult to guess which subset is "best" based on the generalization error even when t is very large. It is well known that unbiased estimates of the generalization error, such as those based on AIC, FPE, and C_p, do not produce consistent estimates of the "best" subset (e.g., see Stone, 1979).
In leave-v-out cross-validation, t=n-v. The differences of the cross-validation estimates of generalization error among the "good" subsets contain a factor 1/t, not 1/n. Therefore by making t small enough (and thereby making each regression based on t cases bad enough), we can make the differences of the cross-validation estimates large enough to detect. It turns out that to make t small enough to guess the "best" subset consistently, we have to have t/n go to 0 as n goes to infinity.
The crucial distinction between cross-validation and split-sample validation is that with cross-validation, after guessing the "best" subset, we train the linear regression model for that subset using all n cases, but with split-sample validation, only t cases are ever used for training. If our main purpose were really to choose the "best" subset, I suspect we would still have to have t/n go to 0 even for split-sample validation. But choosing the "best" subset is not the same thing as getting the best generalization. If we are more interested in getting good generalization than in choosing the "best" subset, we do not want to make our regression estimate based on only t cases as bad as we do in cross-validation, because in split-sample validation that bad regression estimate is what we're stuck with. So there is no conflict between Shao and Kearns, but there is a conflict between the two goals of choosing the "best" subset and getting the best generalization in split-sample validation.
More information on jackknife and bootstrap confidence intervals is available at ftp://ftp.sas.com/pub/neural/jackboot.sas (this is a plain-text file).
References:
Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction", Neural Computation, 7, 624-638.
Breiman, L. (1996), "Heuristics of instability and stabilization in model selection," Annals of Statistics, 24, 2350-2383.
Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984), Classification and Regression Trees, Belmont, CA: Wadsworth.
Breiman, L., and Spector, P. (1992), "Submodel selection and evaluation in regression: The X-random case," International Statistical Review, 60, 291-319.
Dijkstra, T.K., ed. (1988), On Model Uncertainty and Its Statistical Implications, Proceedings of a workshop held in Groningen, The Netherlands, September 25-26, 1986, Berlin: Springer-Verlag.
Efron, B. (1982) The Jackknife, the Bootstrap and Other Resampling Plans, Philadelphia: SIAM.
Efron, B. (1983), "Estimating the error rate of a prediction rule: Improvement on cross-validation," J. of the American Statistical Association, 78, 316-331.
Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, London: Chapman & Hall.
Efron, B. and Tibshirani, R.J. (1997), "Improvements on cross-validation: The .632+ bootstrap method," J. of the American Statistical Association, 92, 548-560.
Goutte, C. (1997), "Note on free lunches and cross-validation," Neural Computation, 9, 1211-1215, ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz.
Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation, Model Selection, and Bootstrap, London: Chapman & Hall.
Hurvich, C.M., and Tsai, C.-L. (1989), "Regression and time series model selection in small samples," Biometrika, 76, 297-307.
Kearns, M. (1997), "A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split," Neural Computation, 9, 1143-1161.
Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI), pp. ?, http://robotics.stanford.edu/users/ronnyk/
Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0
Plutowski, M., Sakata, S., and White, H. (1994), "Cross-validation estimates IMSE," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.) Advances in Neural Information Processing Systems 6, San Mateo, CA: Morgan Kaufman, pp. 391-398.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
Shao, J. (1993), "Linear model selection by cross-validation," J. of the American Statistical Association, 88, 486-494.
Shao, J. (1995), "An asymptotic theory for linear model selection," Statistica Sinica ?.
Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York: Springer-Verlag.
Snijders, T.A.B. (1988), "On cross-validation for predictor evaluation in time series," in Dijkstra (1988), pp. 56-69.
Stone, M. (1977), "Asymptotics for and against cross-validation," Biometrika, 64, 29-35.
Stone, M. (1979), "Comments on model selection criteria of Akaike and Schwarz," J. of the Royal Statistical Society, Series B, 41, 276-278.
Tibshirani, R. (1996), "A comparison of some error estimates for neural network models," Neural Computation, 8, 152-163.
Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn, Morgan Kaufmann.
Zhu, H., and Rohwer, R. (1996), "No free lunch for cross-validation," Neural Computation, 8, 1421-1426.
------------------------------------------------------------------------
In addition to estimating over-all generalization error, it is often useful to be able to estimate the accuracy of the network's predictions for individual cases.
Let:
Y = the target variable y_i = the value of Y for the ith case X = the vector of input variables x_i = the value of X for the ith case N = the noise in the target variable n_i = the value of N for the ith case m(X) = E(Y|X) = the conditional mean of Y given X w = a vector of weights for a neural network w^ = the weight obtained via training the network p(X,w) = the output of a neural network given input X and weights w p_i = p(x_i,w) L = the number of training (learning) cases, (y_i,x_i), i=1, ..., L Q(w) = the objective functionAssume the data are generated by the model:
Y = m(X) + N E(N|X) = 0 N and X are independentThe network is trained by attempting to minimize the objective function Q(w), which, for example, could be the sum of squared errors or the negative log likelihood based on an assumed family of noise distributions.
Given a test input x_0, a 100c% prediction interval for y_0 is an interval [LPB_0,UPB_0] such that Pr(LPB_0 <= y_0 <= UPB_0) = c, where c is typically .95 or .99, and the probability is computed over repeated random selection of the training set and repeated observation of Y given the test input x_0. A 100c% confidence interval for p_0 is an interval [LCB_0,UCB_0] such that Pr(LCB_0 <= p_0 <= UCB_0) = c, where again the probability is computed over repeated random selection of the training set. Note that p_0 is a nonrandom quantity, since x_0 is given. A confidence interval is narrower than the corresponding prediction interval, since the prediction interval must include variation due to noise in y_0, while the confidence interval does not. Both intervals include variation due to sampling of the training set and possible variation in the training process due, for example, to random initial weights and local minima of the objective function.
Traditional statistical methods for nonlinear models depend on several assumptions (Gallant, 1987):
Assumption (3) is not satisfied for neural nets, because networks with hidden units always have multiple global minima, and the global minima are often improper. Hence, confidence intervals for the weights cannot be obtained using standard Hessian-based methods. However, Hwang and Ding (1997) have shown that confidence intervals for predicted values can be obtained because the predicted values are statistically identified even though the weights are not.
Cardell, Joerding, and Li (1994) describe a more serious violation of assumption (3), namely that that for some m(x), no finite global minimum exists. In such situations, it may be possible to use regularization methods such as weight decay to obtain valid confidence intervals (De Veaux, Schumi, Schweinsberg, and Ungar, 1998), but more research is required on this subject, since the derivation in the cited paper assumes a finite global minimum.
For large samples, the sampling variability in w^ can be approximated in various ways:
Cardell, N.S., Joerding, W., and Li, Y. (1994), "Why some feedforward networks cannot learn some polynomials," Neural Computation, 6, 761-766.
De Veaux,R.D., Schumi, J., Schweinsberg, J., and Ungar, L.H. (1998), "Prediction intervals for neural networks via nonlinear regression," Technometrics, 40, 273-282.
Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley.
Heskes, T. (1997), "Practical confidence and prediction intervals," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 176-182.
Hwang, J.T.G., and Ding, A.A. (1997), "Prediction intervals for artificial neural networks," J. of the American Statistical Association, 92, 748-757.
Nix, D.A., and Weigend, A.S. (1995), "Learning local error bars for nonlinear regression," in Tesauro, G., Touretzky, D., and Leen, T., (eds.) Advances in Neural Information Processing Systems 7, Cambridge, MA: The MIT Press, pp. 489-496.
Spall, J.C. (1998), "Resampling-based calculation of the information matrix in nonlinear statistical models," Proceedings of the 4th Joint Conference on Information Sciences, October 23-28, Research Triangle PArk, NC, USA, Vol 4, pp. 35-39.
Tibshirani, R. (1996), "A comparison of some error estimates for neural network models," Neural Computation, 8, 152-163.
White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Network Models", J. of the American Statistical Assoc., 84, 1008-1013.
------------------------------------------------------------------------Next part is part 4 (of 7). Previous part is part 2.