Message Archiving Benchmark: How Many Letters Are in Messages?

Let’s look at distribution of the number of letters in message’s body. Note, that it’s not a byte length, it’s an amount of Unicode symbols. Cyrillic characters are represented using 2 bytes in UTF-8, so some messages can be actually 2 times longer in bytes. Also AFAIK English sentences are generally shorter than Russian, so average message length should be less for servers with English-speaking users.

Messages: 474,562
Total letters: 19,152,806
Min: 1
Max: 35079
Range: 35078
Interquartile range: 26
Mean: 40.4
Median: 18
Mode: 3
Standard deviation: 255.3
Quartile deviation: 13

Here is the plot of length distribution histogram.
image

It is well-known that the number of letters per word and the number of words per sentence are log-normally distributed, so no wonder this distribution is also log-normal. Green line here plots probability density function (PDF) of Log-N(2.83, 1.15), and you can see it fits actual data pretty good.


Let us know what you think 💬


Leave a Comment


This site uses Akismet to reduce spam. Learn how your comment data is processed.