As this volume is written mainly for those who take an interest in the logical questions involved, rather than as an introduction to the actual processes of calculation, mathematical details have been throughout avoided as much as possible. For this reason comparatively few references have been made to the exponential equation of the Law of Error, or to the corresponding ‘Probability integral,’ tables of which are given in several handbooks on the subject. There are two points however in connection with these particular topics as to which difficulties are, or should be, felt by so many students that some notice may be taken of them here

(1) In regard to the ordinary algebraical expression for the law of error, viz. y = ^h/_√π e^−h²x², it will have been observed that I have always spoken of y as being proportional to the number of errors of the particular magnitude x. It would hardly be correct to say, absolutely, that y represents that number, because of course the actual number of errors of any precise magnitude, where continuity of possibility is assumed, must be indefinitely small. If therefore we want to pass from the continuous to the discrete, by ascertaining the actual number of errors between two consecutive divisions of our scale, when, as usual in measurements, all within certain limits are referred to some one precise point, we must modify our formula. In accordance with the usual differential notation, we must say that the number of errors falling into one subdivision (dx) of our scale is dx ^h/_√π e^−h²x², where dx is a (small) unit of length, in which both h⁻¹ and x must be measured.

The difficulty felt by most students is in applying the formula to actual statistics, in other words in putting in the correct units. To take an actual numerical example, suppose that 1460 men have been measured in regard to their height “true to the nearest inch,” and let it be known that the modulus here is 3.6 inches. Then dx = 1 (inch); h⁻¹ = 3.6 inches. Now ∑^h/_√πe^−h²x² dx = 1; that is, the sum of all the consecutive possible values is equal to unity. When therefore we want the sum, as here, to be 1460, we must express the formula thus;— y = ¹⁴⁶⁰/_{√π × 3.6} e^{−(^x/_3.6)²}, or y = 228e^{−(^x/_3.6)²}.

Here x stands for the number of inches measured from the central or mean height, and y stands for the number of men referred to that height in our statistical table. (The values of e^−t² for successive values of t are given in the handbooks.)

For illustration I give the calculated numbers by this formula for values of x from 0 to 8 inches, with the actual numbers observed in the Cambridge measurements recently set on foot by Mr Galton.

inches	calculated	observed
x = 0	y = 228	= 231
x = 1	y = 212	= 218
x = 2	y = 166	= 170
x = 3	y = 111	= 110
x = 4	y =  82	=  66
x = 5	y =  32	=  31
x = 6	y =  11	=  10
x = 7	y =  4	=  6
x = 8	y =  1	=  3

Here the average height was 69 inches: dx, as stated, = 1 inch. By saying, ‘put x = 0,’ we mean, calculate the number of men who are assigned to 69 inches; i.e. who fall between 68.5 and 69.5. By saying, ‘put x = 4,’ we mean, calculate the number who are assigned to 65 or to 73; i.e. who lie between 64.5 and 65.5, or between 72.5 and 73.5. The observed results, it will be seen, keep pretty close to the calculated: in the case of the former the means of equal and opposite divergences from the mean have been taken, the actual results not being always the same in opposite directions.

(2) The other point concerns the interpretation of the familiar probability integral, ²/_√π ∫₀^te^−t² dt. Every one who has calculated the chance of an event, by the help of the tables of this integral given in so many handbooks, knows that if we assign any numerical value to t, the corresponding value of the above expression assigns the chance that an error taken at random shall lie within that same limit, viz. t. Thus put t = 1.5, and we have the result 0.96; that is, only 4 per cent. of the errors will exceed ‘one and a half.’ But when we ask, ‘one and a half’ what? the answer would not always be very ready. As usual, the main difficulty of the beginner is not to manipulate the formulæ, but to be quite clear about his units.

It will be seen at once that this case differs from the preceding in that we cannot now choose our unit as we please. Where, as here, there is only one variable (t), if we were allowed to select our own unit, the inch, foot, or whatever it might be, we might get quite different results. Accordingly some comparatively natural unit must have been chosen for us in which we are bound to reckon, just as in the circular measurement of an angle as distinguished from that by degrees.

The answer is that the unit here is the modulus, and that to put ‘t = 1.5’ is to say, ‘suppose the error half as great again as the modulus’; the modulus itself being an error of a certain assignable magnitude depending upon the nature of the measurements or observations in question. We shall see this better if we put the integral in the form ²/_√π ∫₀^hxe^−h²x² d(hx); which is precisely equivalent, since the value of a definite integral is independent of the particular variable employed. Here hx is the same as x : ¹/_h; i.e. it is the ratio of x to ¹/_h, or x measured in terms of ¹/_h. But ¹/_h is the modulus in the equation (y = ^h/_√πe^−h²x²) for the law of error. In other words the numerical value of an error in this formula, is the number of times, whole or fractional, which it contains the modulus.

1 This kind of mean is called by Fechner and others the “dichteste Werth.” The most appropriate appeal to it that I have seen is by Prof. Lexis (Massenerscheinungen, p. 42) where he shows that it indicates clearly a sort of normal length of human life, of about 70 years; a result which is almost entirely masked when we appeal to the arithmetical average.

This mean ought to be called the ‘probable’ value (a name however in possession of another) on the ground that it indicates the point of likeliest occurrence; i.e. if we compare all the indefinitely small and equal units of variation, the one corresponding to this will tend to be most frequently represented.

2 A diagram illustrative of this number of results was given in Nature (Sept. 1, 1887). In calculating, as above, the different means, I may remark that the original results were given to three decimal places; but, in classing them, only one place was noted. That is, 29.9 includes all values between 29.900 and 29.999. Thus the value most frequently entered in my tables was 30.0, but on the usual principles of interpolation this is reckoned as 30.05.

3 There is some ambiguity in the phraseology in use here. Thus Airy commonly uses the expression ‘Error of Mean Square’ to represent, as here, √ ^∑e²/_n. Galloway commonly speaks of the ‘Mean Square of the Errors’ to represent ^∑e²/_n. I shall adhere to the former usage and represent it briefly by E.M.S. Still more unfortunate (to my thinking) is the employment, by Mr Merriman and others, of the expression ‘Mean Error,’ (widely in use in its more natural signification,) as the equivalent of this E.M.S.

The technical term ‘Fluctuation’ is applied by Mr F. Y. Edgeworth to the expression ^2∑e²/_n.

4 Practically, of course, we should allow for the expansion or contraction. But for purposes of logical explanation we may conveniently take this variation as a specimen of one of those disturbances which may be neutralised by resort to an average.

5 More strictly multinomial: the relative frequency of the different numbers being indicated by the coefficients of the powers of x in the development of

(1 + x + x² + … + x⁹)¹⁰.

6 By Mr Merriman, in his work on Least Squares.

CHAPTER XIX.

THE THEORY OF THE AVERAGE AS A MEANS OF APPROXIMATION TO THE TRUTH.

§ 1. In the last chapter we were occupied with the Average mainly under its qualitative rather than its quantitative aspect. That is, we discussed its general nature, its principal varieties, and the main uses to which it could be put in ordinary life or in reasoning processes which did not claim to be very exact. It is now time to enter more minutely into the specific question of the employment of the average in the way peculiarly appropriate to Probability. That is, we must be supposed to have a certain number of measurements,—in the widest sense of that term,—placed before us, and to be prepared to answer such questions as; Why do we take their average? With what degree of confidence? Must we in all cases take the average, and, if so, one always of the same kind?

The subject upon which we are thus entering is one which, under its most general theoretic treatment, has perhaps given rise to more profound investigation, to a greater variety of opinion, and in consequence to a more extensive history and literature, than any other single problem within the range of mathematics.[1] But, in spite of this, the main logical principles underlying the methods and processes in question are not, I apprehend, particularly difficult to grasp: though, owing to the extremely technical style of treatment adopted even in comparatively elementary discussions of the subject, it is far from easy for those who have but a moderate command of mathematical resources to disentangle these principles from the symbols in which they are clothed. The present chapter contains an attempt to remove these difficulties, so far as a general comprehension of the subject is concerned. As the treatment thus adopted involves a considerable number of subdivisions, the reader will probably find it convenient to refer back occasionally to the table of contents at the commencement of this volume.

§ 2. The subject, in the form in which we shall discuss it, will be narrowed to the consideration of the average, on account of the comparative simplicity and very wide prevalence of this aspect of the problem. The problem is however very commonly referred to, even in non-mathematical treatises, as the Rule or Method of Least Squares; the fact being that, in such cases as we shall be concerned with, the Rule of Least Squares resolves itself into the simpler and more familiar process of taking the arithmetical average. A very simple example,—one given by Herschel,—will explain the general nature of the task under a slightly wider treatment, and will serve to justify the familiar designation.

Suppose that a man had been firing for some time with a pistol at a small mark, say a wafer on a wall. We may take it for granted that the shot-marks would tend to group themselves about the wafer as a centre, with a density varying in some way inversely with the distance from the centre. But now suppose that the wafer which marked the centre was removed, so that we could see nothing but the surface of the wall spotted with the shot-marks; and that we were asked to guess the position of the wafer. Had there been only one shot, common sense would suggest our assuming (of course very precariously) that this marked the real centre. Had there been two, common sense would suggest our taking the mid-point between them. But if three or more were involved, common sense would be at a loss. It would feel that some intermediate point ought to be selected, but would not see its way to a more precise determination, because its familiar reliance,—the arithmetical average,—does not seem at hand here. The rule in question tells us how to proceed. It directs us to select that point which will render the sum of the squares of all the distances of the various shot-marks from it the least possible.[2]

This is merely by way of illustration, and to justify the familiar designation of the rule. The sort of cases with which we shall be exclusively occupied are those comparatively simple ones in which only linear magnitude, or some quality which can be adequately represented by linear magnitude, is the object under consideration. In respect of these the Rule of Least Squares reduces itself to the process of taking the average, in the most familiar sense of that term, viz. the arithmetical mean; and a single Law of Error, or its graphical equivalent, a Curve of Facility, will suffice accurately to indicate the comparative frequency of the different amounts of the one variable magnitude involved.

§ 3. We may conveniently here again call attention to a misconception or confusion which has been already noticed in a former chapter. It is that of confounding the Law of Error with the Method of Least Squares. These are things of an entirely distinct kind. The former is of the nature of a physical fact, and its production is one which in many cases is entirely beyond our control. The latter,—or any simplified application of it, such as the arithmetical average,—is no law whatever in the physical sense. It is rather a precept or rule for our guidance. The Law states, in any given case, how the errors tend to occur in respect of their magnitude and frequency. The Method directs us how to treat these errors when any number of them are presented to us. No doubt there is a relation between the two, as will be pointed out in the course of the following pages; but there is nothing really to prevent us from using the same method for different laws of error, or different methods for the same law. In so doing, the question of distinct right and wrong would seldom be involved, but rather one of more or less propriety.

§ 4. The reader must understand,—as was implied in the illustration about the pistol shots,—that the ultimate problem before us is an inverse one. That is, we are supposed to have a moderate number of ‘errors’ before us and we are to undertake to say whereabouts is the centre from which they diverge. This resembles the determination of a cause from the observation of an effect. But, as mostly happens in inverse problems, we must commence with the consideration of the direct problem. In other words, so far as concerns the case before us, we shall have to begin by supposing that the ultimate object of our aim,—that is, the true centre of our curve of frequency,—is already known to us: in which case all that remains to be done is to study the consequences of taking averages of the magnitudes which constitute the errors.

§ 5. We shall, for the present, confine our remarks to what must be regarded as the typical case where considerations of Probability are concerned; viz. that in which the law of arrangement or development is of the Binomial kind. The nature of this law was explained in Chap. II., where it was shown that the frequency of the respective numbers of occurrences was regulated in accordance with the magnitude of the successive terms of the expansion of the binomial (1 + 1)ⁿ. It was also pointed out that when n becomes very great, that is, when the number of influencing circumstances is very large, and their relative individual influence correspondingly small, the form assumed by a curve drawn through the summits of ordinates representing these successive terms of the binomial tends towards that assigned by the equation

y = Ae^−h²x².

For all practical purposes therefore we may talk indifferently of the Binomial or Exponential law; if only on the ground that the arrangement of the actual phenomena on one or other of these two schemes would soon become indistinguishable when the numbers involved are large. But there is another ground than this. Even when the phenomena themselves represent a continuous magnitude, our measurements of them,—which are all with which we can deal,—are discontinuous. Suppose we had before us the accurate heights of a million adult men. For all practical purposes these would represent the variations of a continuous magnitude, for the differences between two successive magnitudes, especially near the mean, would be inappreciably small. But our tables will probably represent them only to the nearest inch. We have so many assigned as 69 inches; so many as 70; and so on. The tabular statement in fact is of much the same character as if we were assigning the number of ‘heads’ in a toss of a handful of pence; that is, as if we were dealing with discontinuous numbers on the binomial, rather than with a continuous magnitude on the exponential arrangement.

§ 6. Confining ourselves then, for the present, to this general head, of the binomial or exponential law, we must distinguish two separate cases in respect of the knowledge we may possess as to the generating circumstances of the variable magnitudes.

(1) There is, first, the case in which the conditions of the problem are determinable à priori: that is, where we are able to say, prior to specific experience, how frequently each combination will occur in the long run. In this case the main or ultimate object for which we are supposing that the average is employed,—i.e. that of discovering the true mean value,—is superseded. We are able to say what the mean or central value in the long run will be; and therefore there is no occasion to set about determining it, with some trouble and uncertainty, from a small number of observations. Still it is necessary to discuss this case carefully, because its assumption is a necessary link in the reasoning in other cases.

This comparatively à priori knowledge may present itself in two different degrees as respects its completeness. In the first place it may, so far as the circumstances in question are concerned, be absolutely complete. Consider the results when a handful of ten pence is repeatedly tossed up. We know precisely what the mean value is here, viz. equal division of heads and tails: we know also the chance of six heads and four tails, and so on. That is, if we had to plot out a diagram showing the relative frequency of each combination, we could do so without appealing to experience. We could draw the appropriate binomial curve from the generating conditions given in the statement of the problem.

But now consider the results of firing at a target consisting of a long and narrow strip, of which one point is marked as the centre of aim.[3] Here (assuming that there are no causes at work to produce permanent bias) we know that this centre will correspond to the mean value. And we know also, in a general way, that the dispersion on each side of this will follow a binomial law. But if we attempted to plot out the proportions, as in the preceding case, by erecting ordinates which should represent each degree of frequency as we receded further from the mean, we should find that we could not do so. Fresh data must be given or inferred. A good marksman and a bad marksman will both distribute their shot according to the same general law; but the rapidity with which the shots thin off as we recede from the centre will be different in the two cases. Another ‘constant’ is demanded before the curve of frequency could be correctly traced out.

§ 7. (2) The second division, to be next considered, corresponds for all logical purposes to the first. It comprises the cases in which though we have no à priori knowledge as to the situation about which the values will tend to cluster in the long run, yet we have sufficient experience at hand to assign it with practical certainty. Consider for instance the tables of human stature. These are often very extensive, including tens or hundreds of thousands. In such cases the mean or central value is determinable with just as great certainty as by any à priori rule. That is, if we took another hundred thousand measurements from the same class of population, we should feel secure that the average would not be altered by any magnitude which our measuring instruments could practically appreciate.

§ 8. But the mere assignment of the mean or central value does not here, any more than in the preceding case, give us all that we want to know. It might so happen that the mean height of two populations was the same, but that the law of dispersion about that mean was very different: so that a man who in one series was an exceptional giant or dwarf should, in the other, be in no wise remarkable.

To explain the process of thus determining the actual magnitude of the dispersion would demand too much mathematical detail; but some indication may be given. What we have to do is to determine the constant h in the equation[4] y = ^h/_√πe^−h²x². In technical language, what we have to do is to determine the modulus of this equation. The quantity ¹/_h in the above expression is called the modulus. It measures the degree of contraction or dispersion about the mean indicated by this equation. When it is large the dispersion is considerable; that is the magnitudes are not closely crowded up towards the centre, when it is small they are thus crowded up. The smaller the modulus in the curve representing the thickness with which the shot-marks clustered about the centre of the target, the better the marksman.

§ 9. There are several ways of determining the modulus. In the first of the cases discussed above, where our theoretical knowledge is complete, we are able to calculate it à priori from our knowledge of the chances. We should naturally adopt this plan if we were tossing up a large handful of pence.

The usual à posteriori plan, when we have the measurements of the magnitudes or observations before us, is this:—Take the mean square of the errors, and double this; the result gives the square of the modulus. Suppose, for instance, that we had the five magnitudes, 4, 5, 6, 7, 8. The mean of these is 6: the ‘errors’ are respectively 2, 1, 0, 1, 2. Therefore the ‘modulus squared’ is equal to ¹⁰/₅; i.e. the modulus is √2. Had the magnitudes been 2, 4, 6, 8, 10; representing the same mean (6) as before, but displaying a greater dispersion about it, the modulus would have been larger, viz. √8 instead of √2.

Mr Galton's method is more of a graphical nature. It is described in a paper on Statistics by Intercomparison (Phil. Mag. 1875), and elsewhere. It may be indicated as follows. Suppose that we were dealing with a large number of measurements of human stature, and conceive that all the persons in question were marshalled in the order of their height. Select the average height, as marked by the central man of the row. Suppose him to be 69 inches. Then raise (or depress) the scale from this point until it stands at such a height as just to include one half of the men above (or below) the mean. (In practice this would be found to require about 1.71 inches: that is, one quarter of any large group of such men will fall between 69 and 70.71 inches.) Divide this number by 0.4769 and we have the modulus. In the case in question it would be equal to about 3.6 inches.

Under the assumption with which we start, viz. that the law of error displays itself in the familiar binomial form, or in some form approximating to this, the three methods indicated above will coincide in their result. Where there is any doubt on this head, or where we do not feel able to calculate beforehand what will be the rate of dispersion, we must adopt the second plan of determining the modulus. This is the only universally applicable mode of calculation: in fact that it should yield the modulus is a truth of definition; for in determining the error of mean square we are really doing nothing else than determining the modulus, as was pointed out in the last chapter.

§ 10. The position then which we have now reached is this. Taking it for granted that the Law of Error will fall into the symbolic form expressed by the equation y = ^h/_√π e^−h²x², we have rules at hand by which h may be determined. We therefore, for the purposes in question, know all about the curve of frequency: we can trace it out on paper: given one value,—say the central one,—we can determine any other value at any distance from this. That is, knowing how many men in a million, say, are 69 inches high, we can determine without direct observation how many will be 67, 68, 70, 71, and so on.

We can now adequately discuss the principal question of logical interest before us; viz. why do we take averages or means? What is the exact nature and amount of the advantage gained by so doing? The advanced student would of course prefer to work out the answers to these questions by appealing at once to the Law of Error in its ultimate or exponential form. But I feel convinced that the best method for those who wish to gain a clear conception of the logical nature of the process involved, is to begin by treating it as a question of combinations such as we are familiar with in elementary algebra; in other words to take a finite number of errors and to see what comes of averaging these. We can then proceed to work out arithmetically the results of combining two or more of the errors together so as to get a new series, not contenting ourselves with the general character merely of the new law of error, but actually calculating what it is in the given case. For the sake of simplicity we will not take a series with a very large number of terms in it, but it will be well to have enough of them to secure that our law of error shall roughly approximate in its form to the standard or exponential law.

For this purpose the law of error or divergence given by supposing our effort to be affected by ten causes, each of which produces an equal error, but which error is equally likely to be positive and negative (or, as it might perhaps be expressed, ‘ten equal and indifferently additive and subtractive causes’) will suffice. This is the lowest number formed according to the Binomial law, which will furnish to the eye a fair indication of the limiting or Exponential law.[5] The whole number of possible cases here is 2¹⁰ or 1024; that is, this is the number required to exhibit not only all the cases which can occur (for there are but eleven really distinct cases), but also the relative frequency with which each of these cases occurs in the long run. Of this total, 252 will be situated at the mean, representing the ‘true’ result, or that given when five of the causes of disturbance just neutralize the other five. Again, 210 will be at what we will call one unit's distance from the mean, or that given by six causes combining against four; and so on; until at the extreme distance of five places from the mean we get but one result, since in only one case out of the 1024 will all the causes combine together in the same direction. The set of 1024 efforts is therefore a fair representation of the distribution of an infinite number of such efforts. A graphical representation of the arrangement is given here.

Binomial distribution for the tenth power

§ 11. This representing a complete set of single observations or efforts, what will be the number and arrangement in the corresponding set of combined or reduced observations, say of two together? With regard to the number we must bear in mind that this is not a case of the combinations of things which cannot be repeated; for any given error, say the extreme one at F, can obviously be repeated twice running. Such a repetition would be a piece of very bad luck no doubt, but being possible it must have its place in the set. Now the possible number of ways of combining 1024 things two together, where the same thing may be repeated twice running, is 1024 × 1024 or 1048576. This then is the number in a complete cycle of the results taken two and two together.

§ 12. So much for their number; now for their arrangement or distribution. What we have to ascertain is, firstly, how many times each possible pair of observations will present itself; and, secondly, where the new results, obtained from the combination of each pair, are to be placed. With regard to the first of these enquiries;—it will be readily seen that on one occasion we shall have F repeated twice; on 20 occasions we shall have F combined with E (for F coming first we may have it followed by any one of the 10 at E, or any one of these may be followed by F); E can be repeated in 10 × 10, or 100 ways, and so on.

Now for the position of each of these reduced observations, the relative frequency of whose component elements has thus been pointed out. This is easy to determine, for when we take two errors there is (as was seen) scarcely any other mode of treatment than that of selecting the mid-point between them; this mid-point of course becoming identical with each of them when the two happen to coincide. It will be seen therefore that F will recur once on the new arrangement, viz. by its being repeated twice on the old one. G midway between E and F, will be given 20 times. E, on our new arrangement, can be got at in two ways, viz. by its being repeated twice (which will happen 100 times), and by its being obtained as the mid-point between D and F (which will happen 90 times). Hence E will occur 190 times altogether.

The reader who chooses to take the trouble may work out the frequency of all possible occurrences in this way, and if the object were simply to illustrate the principle in accordance with which they occur, this might be the best way of proceeding. But as he may soon be able to observe, and as the mathematician would at once be able to prove, the new ‘law of facility of error’ can be got at more quickly deductively, viz. by taking the successive terms of the expansion of (1 + 1)²⁰. They are given, below the line, in the figure on p. 476.

§ 13. There are two apparent obstacles to any direct comparison between the distribution of the old set of simple observations, and the new set of combined or reduced ones. In the first place, the number of the latter is much greater. This, however, is readily met by reducing them both to the same scale, that is by making the same total number of each. In the second place, half of the new positions have no representatives amongst the old, viz. those which occur midway between F and E, E and D, and so on. This can be met by the usual plan of interpolation, viz. by filling in such gaps by estimating what would have been the number at the missing points, on the same scale, had they been occupied. Draw a curve through the vertices of the ordinates at A, B, C, &c., and the lengths of the ordinates at the intermediate points will very fairly represent the corresponding frequency of the errors of those magnitudes respectively. When the gaps are thus filled up, and the numbers thus reduced to the same scale, we have a perfectly fair basis of comparison. (See figure on next page.)

Similarly we might proceed to group or ‘reduce’ three observations, or any greater number. The number of possible groupings naturally becomes very much larger, being (1024)³ when they are taken three together. As soon as we get to three or more observations, we have (as already pointed out) a variety of possible modes of treatment or reduction, of which that of taking the arithmetical mean is but one.

§ 14. The following figure is intended to illustrate the nature of the advantage secured by thus taking the arithmetical mean of several observations.

The curve ABCD represents the arrangement of a given number of ‘errors’ supposed to be disposed according to the binomial law already mentioned, when the angles have been smoothed off by drawing a curve through them. A′CD′ represents the similar arrangement of the same number when given not as simple errors, but as averages of pairs of errors. A″BD″, again, represents the similar arrangement obtained as averages of errors taken three together. They are drawn as carefully to scale as the small size of the figure permits.

Gaussian distributions for two or three observations

§ 15. A glance at the above figure will explain to the reader, better than any verbal description, the full significance of the statement that the result of combining two or more measurements or observations together and taking the average of them, instead of stopping short at the single elements, is to make large errors comparatively more scarce. The advantage is of the same general description as that of fishing in a lake where, of the same number of fish, there are more big and fewer little ones than in another water: of dipping in a bag where of the same number of coins there are more sovereigns and fewer shillings; and so on. The extreme importance, however, of obtaining a perfectly clear conception of the subject may render it desirable to work this out a little more fully in detail.

For one thing, then, it must be clearly understood that the result of a set of ‘averages’ of errors is nothing else than another set of ‘errors,’ No device can make the attainment of the true result certain,—to suppose the contrary would be to misconceive the very foundations of Probability,—no device even can obviate the possibility of being actually worse off as the result of our labour. The average of two, three, or any larger number of single results, may give a worse result, i.e. one further from the ultimate average, than was given by the first observation we made. We must simply fall back upon the justification that big deviations are rendered scarcer in the long run.

Again; it may be pointed out that though, in the above investigation, we have spoken only of the arithmetical average as commonly understood and employed, the same general results would be obtained by resorting to almost any symmetrical and regular mode of combining our observations or errors. The two main features of the regularity displayed by the Binomial Law of facility were (1) ultimate symmetry about the central or true result, and (2) increasing relative frequency as this centre was approached. A very little consideration will show that it is no peculiar prerogative of the arithmetical mean to retain the former of these and to increase the latter. In saying this, however, a distinction must be attended to for which it will be convenient to refer to a figure.

§ 16. Suppose that O, in the line D′OD, was the point aimed at by any series of measurements; or, what comes to the same thing for our present purpose, was the ultimate average of all the measurements made. What we mean by a symmetrical arrangement of the values in regard to O, is that for every error OB, there shall be in the long run a precisely corresponding opposite one OB′; so that when we erect the ordinate BQ, indicating the frequency with which B is yielded, we must erect an equal one B′Q′. Accordingly the two halves of the curve on each side of P, viz. PQ and PQ′ are precisely alike.

Symmetry of a distribution

It then readily follows that the secondary curve, viz. that marking the law of frequency of the averages of two or more simple errors, will also be symmetrical. Consider any three points B, C, D: to these correspond another three B′, C′, D′. It is obvious therefore that any regular and symmetrical mode of dealing with all the groups, of which BCD is a sample, will result in symmetrical arrangement about the centre O. The ordinary familiar arithmetical average is but one out of many such modes. One way of describing it is by saying that the average of B, C, D, is assigned by choosing a point such that the sum of the squares of its distances from B, C, D, is a minimum. But we might have selected a point such that the cubes, or the fourth powers, or any higher powers should be a minimum. These would all yield curves resembling in a general way the dotted line in our figure. Of course there would be insuperable practical objections to any such courses as these; for the labour of calculation would be enormous, and the results so far from being better would be worse than those afforded by the employment of the ordinary average. But so far as concerns the general principle of dealing with discordant and erroneous results, it must be remembered that the familiar average is but one out of innumerable possible resources, all of which would yield the same sort of help.

§ 17. Once more. We saw that a resort to the average had the effect of ‘humping up’ our curve more towards the centre, expressive of the fact that the errors of averages are of a better, i.e. smaller kind. But it must be noticed that exactly the same characteristics will follow, as a general rule, from any other such mode of dealing with the individual errors. No strict proof of this fact can be given here, but a reference to one of the familiar results of taking combinations of things will show whence this tendency arises. Extreme results, as yielded by an average of any kind, can only be got in one way, viz. by repetitions of extremes in the individuals from which the averages were obtained. But intermediate results can be got at in two ways, viz. either by intermediate individuals, or by combinations of individuals in opposite directions. In the case of the Binomial Law of Error this tendency to thicken towards the centre was already strongly predominant in the individual values before we took them in hand for our average; but owing to this characteristic of combinations we may lay it down (broadly speaking) that any sort of average applied to any sort of law of distribution will give a result which bears the same general relation to the individual values that the dotted lines above bear to the black line.[6]

§ 18. This being so, the speculative advantages of one method of combining, or averaging, or reducing, our observations, over another method,—irrespective, that is, of the practical conveniences in carrying them out,—will consist solely in the degree of rapidity with which it tends thus to cluster the result about the centre. We shall have to subject this merit to a somewhat further analysis, but for the present purpose it will suffice to say that if one kind of average gave the higher dotted line in the figure on p. 479 and another gave the lower dotted line, we should say that the former was the better one. The advantage is of the same general kind as that which is furnished in algebraical calculation, by a series which converges rapidly towards the true value as compared with one which converges slowly. We can do the work sooner or later by the aid of either; but we get nearer the truth by the same amount of labour, or get as near by a less amount of labour, on one plan than on the other.

As we are here considering the case in which the individual observations are supposed to be grouped in accordance with the Binomial Law, it will suffice to say that in this case there is no doubt that the arithmetical average is not only the simplest and easiest to deal with, but is also the best in the above sense of the term. And since this Binomial Law, or something approximating to it, is of very wide prevalence, a strong primâ facie case is made out for the general employment of the familiar average.

§ 19. The analysis of a few pages back carried the results of the averaging process as far as could be conveniently done by the help of mere arithmetic. To go further we must appeal to higher mathematics, but the following indication of the sort of results obtained will suffice for our present purpose. After all, the successive steps, though demanding intricate reasoning for their proof, are nothing more than generalizations of processes which could be established by simple arithmetic.[7] Briefly, what we do is this:—

(1) We first extend the proof from the binomial form, with its finite number of elements, to the limiting or exponential form. Instead of confining ourselves to a small number of discrete errors, we then recognize the possibility of any number of errors of any magnitude whatever.

(2) In the next place, instead of confining ourselves to the consideration of an average of two or three only,—already, as we have seen, a tedious piece of arithmetic,—we calculate the result of an average of any number, n. The actual result is extremely simple. If the modulus of the single errors is c, that of the average of n of these will be c ÷ √n.

(3) Finally we draw similar conclusions in reference to the sum or difference of two averages of any numbers. Suppose, for instance, that m errors were first taken and averaged, and then n similarly taken and averaged. These averages will be nearly, but not quite, equal. Their sum or difference,—these, of course, are indistinguishable in the end, since positive and negative errors are supposed to be equal and opposite,—will itself be an ‘error’, every magnitude of which will have a certain assignable probability or facility of occurrence. What we do is to assign the modulus of these errors. The actual result again is simple. If c had been the modulus of the single errors, that of the sum or difference of the averages of m and n of them will be