?

Log in

No account? Create an account

Standard deviation and average are poor statistical measures of latency - Journal of Omnifarious

Nov. 24th, 2008

10:39 am - Standard deviation and average are poor statistical measures of latency

Previous Entry Share Next Entry

I've noticed that ping and a few other similar utilities that measure network latency have begun to include an interesting statistic. They show the standard deviation of all the latencies gathered from each individual ping packet. I think this is bad statistics.

If I am not mistaken standard deviation is based on the idea that your sample set follows a normal distribution, a bell-curve. Network ping times do not follow this distribution. I would guess that network ping times follow a power-law curve in which the majority of ping times are hover just above the theoretical minimum value for the path with increasingly rare outliers arbitrarily far from that value.

It would be nice to have some sort of statistical measure that more accurately reflected this measure. Perhaps something like a measure of how shallow the curve was. The shallower it is, the more uncertainty there is.

That also means the the mean ping time is also a poor measure. There should be some measure of a power law curve where you can guess that 50% of the values would be below and 50% would be above.

The reason I'm guessing that ping times follow a power law curve is that I remember seeing research showing that measuring network traffic bursts showed that network traffic burstiness displayed scale invariant properties. That basically a measure of traffic spikes looked approximately the same at almost any scale you wanted to examine. Scale invariance, fractal patterns and power law curves are strongly related.

And this brings to mind another issue. Given the widespread applicability of Benford's Law, it's clear the scale invariance is a property of many statistical sample sets. Yet it seems that standard bell-curve distributions are considered the default. IMHO, power law curve based statistics are what should be taught in High School, not the mean/mode/median/standard-deviation 'normal distribution' based statistics that are currently taught.

Incidentally, the widespread applicability of Benford's Law also lends even more support to the already overwhelming evidence that scale invariance is the default property of almost any network, a hypothesis that is thoroughly explored in Linked: The New Science of Networks.

Current Mood: [mood icon] contemplative

Comments:

From:esoterrica
Date:November 24th, 2008 07:32 pm (UTC)

Teaching moment ahoy!

(Link)
If I am not mistaken standard deviation is based on the idea that your sample set follows a normal distribution, a bell-curve.

Not so--standard deviation is the square root of the variance, which is a property of all parametric statistical distributions and can be calculated for any sample. Perhaps the power-law curve could be expressed by an exponential distribution? Exponential distributions are commonly used for problems which involve waiting for something to occur, and an exponential distribution would be a good theoretical fit for the power-law curve. The variance of an exponential random variable is the square of its expected value, so the standard deviation you mentioned would also be the mean--and now you can express ping time with a well-known and well-behaved distribution!

There should be some measure of a power law curve where you can guess that 50% of the values would be below and 50% would be above.

That would be the median, which is also on the Wikipedia page. If you know the mean and assume an exponential distribution, you can calculate this.

IMHO, power law curve based statistics are what should be taught in High School, not the mean/mode/median/standard-deviation 'normal distribution' based statistics that are currently taught.

I did not encounter statistics until I hit undergrad. I think you are mixing populations and samples--mode is most often considered when dealing with sample summaries, and mean, median and SD can be calculated for any sample, and also well-behaved distributions. (If you want to see a wonky distribution, check out the Cauchy distribution.) In order to understand distributions you have to be fluent in multivariate calculus, and to really understand statistical theory you have to know measure theory, so I would be surprised if any high school course moved beyond calculating summary statistics.
(Reply) (Thread)
[User Picture]
From:omnifarious
Date:November 25th, 2008 12:59 am (UTC)

Re: Teaching moment ahoy!

(Link)

Thanks for pointing me at that wikipedia page. :-) And I appreciate your response, but sometimes it seems like you assume I know very little, and at other times it seems like you assume I know a lot more than I do, and it's sort of frustrating to read, so my response probably sounds a little rough.

I have never taken a statistics course myself. And I agree with you that 'mode' has nothing really to do with what I'm interested in. I only talked about it because I remember in the basic (and rather lacking in the explanations of fundamentals) statistics I had in HS as part of math. It was one of the measures we were told how to derive, along with the mean, the median and the standard deviation (though the latter was considered a bit advanced).

I was unaware the the 'mean' and 'median' were concepts that existed apart from the standard bell-curve distribution. I note that the method of computing the mean of an exponential distribution is not by adding together all your samples and dividing by the number of samples. So while the concept is the same, the method of computing it is different from how you would do so for a standard distribution.

What I meant was that the most useful distribution to teach people the rote means of computing the various parameters for (like the mean) is the exponential distribution. They don't have to understand it in detail, they just have to know how to compute the parameters and some rough idea of what the parameter means for the expected values.



Edited at 2008-11-25 01:02 am (UTC)
(Reply) (Parent) (Thread)
From:esoterrica
Date:November 25th, 2008 01:50 am (UTC)

Re: Teaching moment ahoy!

(Link)
I hope this doesn't sound rude, but I still think you are missing the difference between the sample and the population. (I apologize if I am explaining this on a lower level than I should, but it's a crucial concept in statistics!) I know next to nothing about networks, but I will try to use the ping time example to illustrate this completely. In this case the population of interest would be all ping times for a specific bit of hardware/software/network (can you tell I'm clueless?) over a month, or the lifetime of the network in question, maybe. The sample would be the ping times observed in a day, or two days, or however many days you want to spend collecting data. From the sample data, just a list of numbers, you could calculate the mean, median, mode, and variance/standard deviation. Calculation of these descriptive statistics does not require any distributional assumptions--add up the ping times and divide by the number observed for the mean, pick the middle observation from a list of ordered observations for the median, and so on. These are your sample statistics, and they are calculated the same way for every sample, regardless of what assumptions you make about the underlying distribution.

From what you now know about the sample you could make some guesses about the distribution. Since we are dealing with time, all possible ping times are greater than zero. You have hypothesized that Benford's Law makes sense in this case, so the probability of observing a low ping time is quite high, and there are few extreme ping times. Finally, you are waiting for something to occur. The exponential distribution fits these criteria. Assuming you know everything about the distribution is a pretty big jump, but not an unreasonable one in this case. The exponential distribution is characterized by a particular density function, seen here. With the density function you can use calculus to come up with expressions for the population mean (aka expected value), variance (standard deviation squared), median, and other population summaries. These values will probably not match the sample values unless the sample is pretty big.

I hope that made some sort of sense...

Edited at 2008-11-25 01:51 am (UTC)
(Reply) (Parent) (Thread)
[User Picture]
From:omnifarious
Date:November 25th, 2008 06:27 am (UTC)

Re: Teaching moment ahoy!

(Link)

Ahh, that makes a whole lot of sense. Thank you for explaining.

My basic complaint is that all that is shown to someone using these programs is the sample analysis. I think it would be much more useful to show an analysis of a particular distribution assuming that the samples fit it.

Ping time basically measures how long it will take for a packet sent from one computer to reach the destination, be sent back and reach the original computer. The amount of time it takes information to reach one computer from another could be guessed to be half the ping time.

Ping time is affected by the traffic load between the source and destination. I am guessing that it fits an exponential distribution because the load over time tends to.

I guess one problem would be that the samples will not always fit that distribution in all networking situation. That might be hard for a program to notice and account for, especially after only a few samples.

But telling me the sample mean isn't very helpful because it doesn't really do a very good job of giving me a solid prediction I can use. And the standard deviation is frequently larger than the sample mean, which tells me that it's also utterly useless for making any decent predictions about what will happen.

If no distribution is assumed, telling me the sample median would be much more helpful. And then it would be most useful to come up with some kind of a variance measure that is bi-valued (i.e. -x +y) because it will likely vary between a little lower than the median and a lot higher.

(Reply) (Parent) (Thread)
[User Picture]
From:sparklewench
Date:November 25th, 2008 09:58 pm (UTC)

Re: Teaching moment ahoy!

(Link)
I thought that was a great response. It seems a little defensive to say she's assuming you know very little, and then admit not to having had statistics. I mean, why would you expect of yourself mastery of a topic you haven't studied?

I was interested to reply but hesitant because of the whole issue of guys spouting off and feeling inferior if I have something to say that corrects them based on my education. That's a topic in my life right now, not just in reference to this interaction.

All that said, I think school children should be taught to visualize population curves and distributions as a matter of course in grade school. As soon as they can graph, they can begin to conceptualize groups of measurements. I think our society would be better if people could converse at that level of abstraction. For example, I am a lefty liberal who is against the minimum wage concept, or rent control. Why? I want to cut of the lower tail of poverty levels rather than shift the mean up. IF we put the hump higher up, we still have tails into the very low (income level) region. Minimum wage and rent control simply shift the distribution up and maybe reduce absolute numbers in those tails on the graph, but don't get at the core problem they try to address. blah blah blah.

If we had to pick a single, non-bell-shaped curve to work with I would want to teach a Gaussian. The beauty of bell-shaped curves is that they typically represent what we hope for on a grade distribution, so students naturally relate to them more. Plus, symmetry makes people comfy.

So much of math is about being comfortable with the topics. Having the confidence to be wrong, a safe sapce to be wrong and learn, not be ridiculed. emotional stuff that gets in the way.

(Reply) (Parent) (Thread)
[User Picture]
From:omnifarious
Date:November 26th, 2008 06:02 am (UTC)

Re: Teaching moment ahoy!

(Link)

I thought that was a great response. It seems a little defensive to say she's assuming you know very little, and then admit not to having had statistics. I mean, why would you expect of yourself mastery of a topic you haven't studied?

Well, actually I wouldn't have minded if it was all pitched at a level that was below me. What I found frustrating was the inconsistency of the response. I felt like she chose to tell me that mode wasn't in the same class as median and mean, then assumed I understood the difference between stuff about the sample data vs. stuff about the population as a whole and how that difference applied to what I was saying.

If you noticed, she was much more careful the second time and pitched it all at a level that made sense to me. And it's clear she knows more about and understands statistics better than I do, and it's OK with me.

I agree with you that kids in high school and possibly even elementary school should be taught about population distributions. It's not that hard a concept.

(Reply) (Parent) (Thread)
[User Picture]
From:foxfirefey
Date:November 25th, 2008 04:14 am (UTC)

Re: Teaching moment ahoy!

(Link)
Oh baby. Talk statistically to me.
(Reply) (Parent) (Thread)
From:esoterrica
Date:November 25th, 2008 04:46 am (UTC)

Re: Teaching moment ahoy!

(Link)
I looked around on the intarwebs for statistical pickup lines but none could match the mathematical beauty of: I wish I was your derivative so I could lie tangent to your curves.
(Reply) (Parent) (Thread)
[User Picture]
From:omnifarious
Date:November 25th, 2008 06:33 am (UTC)

Re: Teaching moment ahoy!

(Link)

My favorite mixture of the mathematical and the erotic has to be "The Cyberiad" by Stanislaw Lem. But, it isn't exactly a pick-up line. :-)

(Reply) (Parent) (Thread)
[User Picture]
From:omnifarious
Date:November 25th, 2008 06:28 am (UTC)

Re: Teaching moment ahoy!

(Link)

*chuckle*

(Reply) (Parent) (Thread)
[User Picture]
From:iceprincess1010
Date:November 29th, 2008 04:35 am (UTC)
(Link)
Another thing you may want to look at is your CI. This would help you to find out how accuarate it truly is-Obviously if you only have a CI of .10 compared to a CI of .05 or even .005 then you would have a huge difference there and it would be considered inaccurate.
(Reply) (Thread)
[User Picture]
From:omnifarious
Date:November 30th, 2008 08:10 pm (UTC)
(Link)

What's CI? :-) I am not particularly well-versed in statistics.

(Reply) (Parent) (Thread)
[User Picture]
From:iceprincess1010
Date:December 1st, 2008 04:14 am (UTC)
(Link)
Confidence Intervals. Also, things like the sample size and how it was chosen to determine the ping interval would be good. Where/how was it originally determined, etc. You are testing for an effect so you have to know this and what you WANT it to be. Inferences concerning the variance using chi-squared could help although I'm not too familiar with exactly what you are looking for. You have to determine your df with the current information first.
(Reply) (Parent) (Thread)
[User Picture]
From:omnifarious
Date:December 1st, 2008 04:17 am (UTC)

df?

(Link)

By 'df', do you mean distribution function?

(Reply) (Parent) (Thread)
[User Picture]
From:iceprincess1010
Date:December 1st, 2008 04:18 am (UTC)

Re: df?

(Link)
no, sorry, degrees of freedom
(Reply) (Parent) (Thread)
[User Picture]
From:omnifarious
Date:December 1st, 2008 04:20 am (UTC)

Re: df?

(Link)

If I gave you a bunch of data, would you be willing to plough through it and tell me some interesting and useful statistical details I could clean about it and similar data sets?

(Reply) (Parent) (Thread)
[User Picture]
From:iceprincess1010
Date:December 1st, 2008 04:22 am (UTC)

Re: df?

(Link)
I could try, it would probably take me a week because this is finals week so I've been trying to study a bit but yeah.
(Reply) (Parent) (Thread)
[User Picture]
From:omnifarious
Date:December 16th, 2008 08:32 pm (UTC)

Re: df?

(Link)

I have the data now. The easiest form for me to give it to you is to give you a URL to some ASCII files that look like this:

60.000000 207
120.000000 207
180.000000 207
240.000000 207
300.000000 207
360.000000 207
420.000000 207
480.000000 207
540.000000 208

The first value is the time since t0 the ping was sent, and the second value is how many milliseconds the ping took to get pack. One thing I'm not sure how to deal with is pings that don't come back. I'm sure that data might be useful, but there is no good way to represent it in that format.

(Reply) (Parent) (Thread)