As part of this effort, I wanted to use that sample data to get a better idea of how the use of various headers in request and response messages compare and contribute to the overall size of a message. Given a sample, how often is any single header used, what are the average lengths in bytes consumed, etc. I've generated a script that reads the sample files and outputs some rather interesting bits of data.
The results can be found here: https://github.com/jasnell/compression-test/tree/master/counts
Looking at the results for the Amazon.com sample data, for instance, we see that given a sample of 366 HTTP Response messages, a total of 110,743 bytes are consumed by Header values (just the values, not the header names and message formatting overhead). In these messages, 44 unique headers are used. Let's take a sampling of some of those headers...
p3p: Instances: 52 Total: 7610 Average: 146.00 Low: 28 High: 238 Percent of Total Size: 6.871766 Percent of Total Count: 0.967622
The "p3p" header appears a total of 52 times and has a highly variable length in the range of 28-238 bytes. While it contributes to only 0.97% of the total number of headers, it accounts for 6.87% of the total header size. That's a fairly significant ratio. If you look at the specific format of the P3P header value you should notice that the text-based format is very inefficient overall.
location: Instances: 15 Total: 1832 Average: 122.00 Low: 22 High: 152 Percent of Total Size: 1.654281 Percent of Total Count: 0.279122
The Location header appears a total of only 15 times in this sample and accounts for 1.65% of the total header size. Given that these are URL values, this does not seem to be too unreasonable.
Below are a few more examples:
set-cookie: Instances: 48 Total: 5014 Average: 104.00 Low: 56 High: 176 Percent of Total Size: 4.527600 Percent of Total Count: 0.893189 date: Instances: 365 Total: 10585 Average: 29.00 Low: 29 High: 29 Percent of Total Size: 9.558166 Percent of Total Count: 6.791961 last-modified: Instances: 302 Total: 8758 Average: 29.00 Low: 29 High: 29 Percent of Total Size: 7.908401 Percent of Total Count: 5.619650 expires: Instances: 314 Total: 8777 Average: 27.00 Low: 1 High: 29 Percent of Total Size: 7.925557 Percent of Total Count: 5.842948 cache-control: Instances: 321 Total: 8073 Average: 25.00 Low: 7 High: 67 Percent of Total Size: 7.289851 Percent of Total Count: 5.973204
It's certainly interesting to see just how much space Dates contribute to the overall message size. A recent thread of discussion on the HTTP Mailing List demonstrated that we ought to be able to compactly encode date values into no more than 4-bytes. The average encoded size of dates in the sample messages are 29-bytes. By compactly encoding just the Last-Modified value, for instance, we could save 7,550 bytes without loss of any data. Apply that same mechanism to the Expires and Date headers, as well as expiration times in Set-Cookie headers, and the savings in bytes-on-the-wire adds up quickly -- and we haven't even applied any actual compression yet.
Look also at the Cache-Control header. In this sample, it accounts for 5.97% of the total headers and 7.29% of the total value size. The header value length ranges in size between 7 and 67 bytes. By optimizing the encoding (at the sacrifice of some extensibility and additional complexity) we can potentially encode the exact same data in as few as 6 to 9 bytes.
Let's look at Request messages in a second sample for Craigslist data:
req: TOTAL HEADER VALUE LENGTH: 10199 NUMBER OF UNIQUE HEADERS: 12 TOTAL NUMBER OF HEADERS: 392 TOTAL NUMBER OF MESSAGES: 33
Here, given only 33 request messages, there is 10,199 bytes of data in the values of the header; only 12 unique headers are used.
user-agent: Instances: 33 Total: 2673 Average: 81.00 Low: 81 High: 81 Percent of Total Size: 26.208452 Percent of Total Count: 8.418367 cookie: Instances: 33 Total: 1940 Average: 58.00 Low: 32 High: 66 Percent of Total Size: 19.021473 Percent of Total Count: 8.418367 referer: Instances: 29 Total: 1136 Average: 39.00 Low: 29 High: 48 Percent of Total Size: 11.138347 Percent of Total Count: 7.397959 accept: Instances: 33 Total: 1044 Average: 31.00 Low: 3 High: 63 Percent of Total Size: 10.236298 Percent of Total Count: 8.418367
Noting that the User-Agent header accounts for 26.21% of the overall header data is actually fairly astounding. The User-Agent syntax is a horrible mess. I hope very much that HTTP/2 gives us a fair chance at doing something about it.
Note also, however, that Cookie values account for 19.02% of the total message size. Referer headers come in at 11.14% and Accept headers at 10.24%. The question for these are: is there a way that we can increase the density of the data -- transmitting fewer bytes on the wire -- without incurring any data loss and without sacrificing backwards compatibility *too much*.
My next step is to begin applying alternative encoding mechanisms to this data and to show those results side by side with this output... essentially illustrating how much of a reduction in overall bytes-on-the-wire we can achieve even before applying compression mechanisms.
I just updated the code and the sample data to include a measurement of each header values variability.. that is, a measure of the number of unique header values within each sample. For example:
TOTAL VARIABILITY FOR ALL HEADERS: :version 0.0151515151515 66 accept-language 0.030303030303 33 user-agent 0.030303030303 33 :method 0.030303030303 33 accept-encoding 0.030303030303 33 server 0.030303030303 33 :scheme 0.030303030303 33 connection 0.030303030303 66 vary 0.0416666666667 24 content-encoding 0.0454545454545 22 x-frame-options 0.0454545454545 22 cookie 0.0606060606061 33 :status 0.0606060606061 33 :status-text 0.0606060606061 33 transfer-encoding 0.111111111111 9 accept 0.121212121212 33 :host 0.121212121212 33 content-type 0.181818181818 33 cache-control 0.21875 32 referer 0.310344827586 29 accept-ranges 0.5 2 content-length 0.875 24 date 0.878787878788 33 :path 0.939393939394 33 expires 0.958333333333 24 last-modified 0.958333333333 24 location 1.0 1 set-cookie 1.0 1
Headers with a lower variability have a fewer number of unique values within the given sample set. We are most interested in headers at both ends of this spectrum.
Just checked in a second update that includes experimental binary encoding schemes for select headers (date and numeric headers). These are included to illustrate examples of the savings that can be found in increasing data density without employing compression. For example:
last-modified: Instances: 158 Total: 4582 Average: 29.00 Low: 29 High: 29 Variability: 0.5190 Percent of Total Size: 10.564907 Percent of Total Count: 5.153294 Encoded: Total: 632 Average: 4.00 Low: 4 High: 4 Ratio: 86.21
The Ratio field shows the amount of space saved by changing the encoding.
The dates are encoded as the number of seconds since midnight on January 1, 1990 captured as a uvarint. This encoding is applied to the Last-Modified, Date, Expires, If-Modified-Since and If-Unmodified-Since headers. Headers with numeric values (:status, content-length, age) are encoded as uvarints.
The output data now includes frequency tables for each value for each header within the sample set. This makes the output very verbose but is very informative.