Let’s look at distribution of the number of letters in message’s body. Note, that it’s not a byte length, it’s an amount of Unicode symbols. Cyrillic characters are represented using 2 bytes in UTF-8, so some messages can be actually 2 times longer in bytes. Also AFAIK English sentences are generally shorter than Russian, so average message length should be less for servers with English-speaking users.
Messages: 474,562 Total letters: 19,152,806 Min: 1 Max: 35079 Range: 35078 Interquartile range: 26 Mean: 40.4 Median: 18 Mode: 3 Standard deviation: 255.3 Quartile deviation: 13
Here is the plot of length distribution histogram.
It is well-known that the number of letters per word and the number of words per sentence are log-normally distributed, so no wonder this distribution is also log-normal. Green line here plots probability density function (PDF) of Log-N(2.83, 1.15), and you can see it fits actual data pretty good.