Produits Support Open Source Company
 
 
 
Home > Blogs > Message Archiving Benchmark: How Many Letters Are in Messages?

Message Archiving Benchmark: How Many Letters Are in Messages?

Posted by Alexey Shchepin on 08 Jul 2007 at 19:58
Let's look at distribution of the number of letters in message's body. Note, that it's not a byte length, it's an amount of Unicode symbols. Cyrillic characters are represented using 2 bytes in UTF-8, so some messages can be actually 2 times longer in bytes. Also AFAIK English sentences are generally shorter than Russian, so average message length should be less for servers with English-speaking users.
Messages: 474,562
Total letters: 19,152,806
Min: 1
Max: 35079
Range: 35078
Interquartile range: 26
Mean: 40.4
Median: 18
Mode: 3
Standard deviation: 255.3
Quartile deviation: 13

Here is the plot of length distribution histogram.
image
It is well-known that the number of letters per word and the number of words per sentence are log-normally distributed, so no wonder this distribution is also log-normal. Green line here plots probability density function (PDF) of Log-N(2.83, 1.15), and you can see it fits actual data pretty good.



Comments

anonymous avatar

Did you also updated the implementation to the latest version of the XEP, and if so, is there some test server available that can be used by client developers to implement support?

Posted by Sander on 09 Jul 2007 at 00:51
anonymous avatar

Yes, it should be updated, and no servers with it are available yet i think.

Posted by Alexey Shchepin on 10 Jul 2007 at 22:09

Page 1 of 1 pages

Add comment

Name:

Email:

URL:

Smileys

Remember my personal information

Notify me of follow-up comments?