As part of this effort, I wanted to use that sample data to get a better idea of how the use of various headers in request and response messages compare and contribute to the overall size of a message. Given a sample, how often is any single header used, what are the average lengths in bytes consumed, etc. I've generated a script that reads the sample files and outputs some rather interesting bits of data.
The results can be found here: https://github.com/jasnell/compression-test/tree/master/counts
Looking at the results for the Amazon.com sample data, for instance, we see that given a sample of 366 HTTP Response messages, a total of 110,743 bytes are consumed by Header values (just the values, not the header names and message formatting overhead). In these messages, 44 unique headers are used. Let's take a sampling of some of those headers...
p3p:
Instances: 52
Total: 7610
Average: 146.00
Low: 28
High: 238
Percent of Total Size: 6.871766
Percent of Total Count: 0.967622
The "p3p" header appears a total of 52 times and has a highly variable length in the range of 28-238 bytes. While it contributes to only 0.97% of the total number of headers, it accounts for 6.87% of the total header size. That's a fairly significant ratio. If you look at the specific format of the P3P header value you should notice that the text-based format is very inefficient overall.
location:
Instances: 15
Total: 1832
Average: 122.00
Low: 22
High: 152
Percent of Total Size: 1.654281
Percent of Total Count: 0.279122
The Location header appears a total of only 15 times in this sample and accounts for 1.65% of the total header size. Given that these are URL values, this does not seem to be too unreasonable.
Below are a few more examples:
set-cookie:
Instances: 48
Total: 5014
Average: 104.00
Low: 56
High: 176
Percent of Total Size: 4.527600
Percent of Total Count: 0.893189
date:
Instances: 365
Total: 10585
Average: 29.00
Low: 29
High: 29
Percent of Total Size: 9.558166
Percent of Total Count: 6.791961
last-modified:
Instances: 302
Total: 8758
Average: 29.00
Low: 29
High: 29
Percent of Total Size: 7.908401
Percent of Total Count: 5.619650
expires:
Instances: 314
Total: 8777
Average: 27.00
Low: 1
High: 29
Percent of Total Size: 7.925557
Percent of Total Count: 5.842948
cache-control:
Instances: 321
Total: 8073
Average: 25.00
Low: 7
High: 67
Percent of Total Size: 7.289851
Percent of Total Count: 5.973204
It's certainly interesting to see just how much space Dates contribute to the overall message size. A recent thread of discussion on the HTTP Mailing List demonstrated that we ought to be able to compactly encode date values into no more than 4-bytes. The average encoded size of dates in the sample messages are 29-bytes. By compactly encoding just the Last-Modified value, for instance, we could save 7,550 bytes without loss of any data. Apply that same mechanism to the Expires and Date headers, as well as expiration times in Set-Cookie headers, and the savings in bytes-on-the-wire adds up quickly -- and we haven't even applied any actual compression yet.
Look also at the Cache-Control header. In this sample, it accounts for 5.97% of the total headers and 7.29% of the total value size. The header value length ranges in size between 7 and 67 bytes. By optimizing the encoding (at the sacrifice of some extensibility and additional complexity) we can potentially encode the exact same data in as few as 6 to 9 bytes.
Let's look at Request messages in a second sample for Craigslist data:
req:
TOTAL HEADER VALUE LENGTH: 10199
NUMBER OF UNIQUE HEADERS: 12
TOTAL NUMBER OF HEADERS: 392
TOTAL NUMBER OF MESSAGES: 33
Here, given only 33 request messages, there is 10,199 bytes of data in the values of the header; only 12 unique headers are used.
user-agent:
Instances: 33
Total: 2673
Average: 81.00
Low: 81
High: 81
Percent of Total Size: 26.208452
Percent of Total Count: 8.418367
cookie:
Instances: 33
Total: 1940
Average: 58.00
Low: 32
High: 66
Percent of Total Size: 19.021473
Percent of Total Count: 8.418367
referer:
Instances: 29
Total: 1136
Average: 39.00
Low: 29
High: 48
Percent of Total Size: 11.138347
Percent of Total Count: 7.397959
accept:
Instances: 33
Total: 1044
Average: 31.00
Low: 3
High: 63
Percent of Total Size: 10.236298
Percent of Total Count: 8.418367
Noting that the User-Agent header accounts for 26.21% of the overall header data is actually fairly astounding. The User-Agent syntax is a horrible mess. I hope very much that HTTP/2 gives us a fair chance at doing something about it.
Note also, however, that Cookie values account for 19.02% of the total message size. Referer headers come in at 11.14% and Accept headers at 10.24%. The question for these are: is there a way that we can increase the density of the data -- transmitting fewer bytes on the wire -- without incurring any data loss and without sacrificing backwards compatibility *too much*.
My next step is to begin applying alternative encoding mechanisms to this data and to show those results side by side with this output... essentially illustrating how much of a reduction in overall bytes-on-the-wire we can achieve even before applying compression mechanisms.
Update:
I just updated the code and the sample data to include a measurement of each header values variability.. that is, a measure of the number of unique header values within each sample. For example:
TOTAL VARIABILITY FOR ALL HEADERS:
:version 0.0151515151515 66
accept-language 0.030303030303 33
user-agent 0.030303030303 33
:method 0.030303030303 33
accept-encoding 0.030303030303 33
server 0.030303030303 33
:scheme 0.030303030303 33
connection 0.030303030303 66
vary 0.0416666666667 24
content-encoding 0.0454545454545 22
x-frame-options 0.0454545454545 22
cookie 0.0606060606061 33
:status 0.0606060606061 33
:status-text 0.0606060606061 33
transfer-encoding 0.111111111111 9
accept 0.121212121212 33
:host 0.121212121212 33
content-type 0.181818181818 33
cache-control 0.21875 32
referer 0.310344827586 29
accept-ranges 0.5 2
content-length 0.875 24
date 0.878787878788 33
:path 0.939393939394 33
expires 0.958333333333 24
last-modified 0.958333333333 24
location 1.0 1
set-cookie 1.0 1
Headers with a lower variability have a fewer number of unique values within the given sample set. We are most interested in headers at both ends of this spectrum.
2nd Update:
Just checked in a second update that includes experimental binary encoding schemes for select headers (date and numeric headers). These are included to illustrate examples of the savings that can be found in increasing data density without employing compression. For example:
last-modified:
Instances: 158
Total: 4582
Average: 29.00
Low: 29
High: 29
Variability: 0.5190
Percent of Total Size: 10.564907
Percent of Total Count: 5.153294
Encoded:
Total: 632
Average: 4.00
Low: 4
High: 4
Ratio: 86.21
The Ratio field shows the amount of space saved by changing the encoding.
The dates are encoded as the number of seconds since midnight on January 1, 1990 captured as a uvarint. This encoding is applied to the Last-Modified, Date, Expires, If-Modified-Since and If-Unmodified-Since headers. Headers with numeric values (:status, content-length, age) are encoded as uvarints.
3rd Update:
The output data now includes frequency tables for each value for each header within the sample set. This makes the output very verbose but is very informative.
No comments:
Post a Comment