2013-01-22

HTTP 2.0 Header Stats...

As many of you know, work is underway on HTTP/2.0. As part of the effort, we are working to collect as much data as possible to evaluate various encoding and compression improvements that can be made in order to make HTTP/2.0 significantly more performant than 1.1. To this end, +Mark Nottingham and  others have been collecting samples of HTTP traffic to use as the basis of analyzing various new encoding and compression mechanisms. I'll post more on that later.

As part of this effort, I wanted to use that sample data to get a better idea of how the use of various headers in request and response messages compare and contribute to the overall size of a message. Given a sample, how often is any single header used, what are the average lengths in bytes consumed, etc. I've generated a script that reads the sample files and outputs some rather interesting bits of data.

The results can be found here: https://github.com/jasnell/compression-test/tree/master/counts

Looking at the results for the Amazon.com sample data, for instance, we see that given a sample of 366 HTTP Response messages, a total of 110,743 bytes are consumed by Header values (just the values, not the header names and message formatting overhead). In these messages, 44 unique headers are used. Let's take a sampling of some of those headers...


p3p: 
 Instances: 52
 Total:     7610
 Average:   146.00
 Low:       28
 High:      238
 Percent of Total Size: 6.871766
 Percent of Total Count: 0.967622

The "p3p" header appears a total of 52 times and has a highly variable length in the range of 28-238 bytes. While it contributes to only 0.97% of the total number of headers, it accounts for 6.87% of the total header size. That's a fairly significant ratio. If you look at the specific format of the P3P header value you should notice that the text-based format is very inefficient overall.

location: 
 Instances: 15
 Total:     1832
 Average:   122.00
 Low:       22
 High:      152
 Percent of Total Size: 1.654281
 Percent of Total Count: 0.279122

The Location header appears a total of only 15 times in this sample and accounts for 1.65% of the total header size. Given that these are URL values, this does not seem to be too unreasonable.
Below are a few more examples:

set-cookie: 
 Instances: 48
 Total:     5014
 Average:   104.00
 Low:       56
 High:      176
 Percent of Total Size: 4.527600
 Percent of Total Count: 0.893189


date: 
 Instances: 365
 Total:     10585
 Average:   29.00
 Low:       29
 High:      29
 Percent of Total Size: 9.558166
 Percent of Total Count: 6.791961


last-modified: 
 Instances: 302
 Total:     8758
 Average:   29.00
 Low:       29
 High:      29
 Percent of Total Size: 7.908401
 Percent of Total Count: 5.619650


expires: 
 Instances: 314
 Total:     8777
 Average:   27.00
 Low:       1
 High:      29
 Percent of Total Size: 7.925557
 Percent of Total Count: 5.842948


cache-control: 
 Instances: 321
 Total:     8073
 Average:   25.00
 Low:       7
 High:      67
 Percent of Total Size: 7.289851
 Percent of Total Count: 5.973204

It's certainly interesting to see just how much space Dates contribute to the overall message size. A recent thread of discussion on the HTTP Mailing List demonstrated that we ought to be able to compactly encode date values into no more than 4-bytes. The average encoded size of dates in the sample messages are 29-bytes. By compactly encoding just the Last-Modified value, for instance, we could save 7,550 bytes without loss of any data. Apply that same mechanism to the Expires and Date headers, as well as expiration times in Set-Cookie headers, and the savings in bytes-on-the-wire adds up quickly -- and we haven't even applied any actual compression yet.
Look also at the Cache-Control header. In this sample, it accounts for 5.97% of the total headers and 7.29% of the total value size. The header value length ranges in size between 7 and 67 bytes. By optimizing the encoding (at the sacrifice of some extensibility and additional complexity) we can potentially encode the exact same data in as few as 6 to 9 bytes.
Let's look at Request messages in a second sample for Craigslist data:

req: 
TOTAL HEADER VALUE LENGTH: 10199
NUMBER OF UNIQUE HEADERS:  12
TOTAL NUMBER OF HEADERS:   392
TOTAL NUMBER OF MESSAGES:  33

Here, given only 33 request messages, there is 10,199 bytes of data in the values of the header; only 12 unique headers are used.


user-agent: 
 Instances: 33
 Total:     2673
 Average:   81.00
 Low:       81
 High:      81
 Percent of Total Size: 26.208452
 Percent of Total Count: 8.418367

cookie: 
 Instances: 33
 Total:     1940
 Average:   58.00
 Low:       32
 High:      66
 Percent of Total Size: 19.021473
 Percent of Total Count: 8.418367

referer: 
 Instances: 29
 Total:     1136
 Average:   39.00
 Low:       29
 High:      48
 Percent of Total Size: 11.138347
 Percent of Total Count: 7.397959

accept: 
 Instances: 33
 Total:     1044
 Average:   31.00
 Low:       3
 High:      63
 Percent of Total Size: 10.236298
 Percent of Total Count: 8.418367

Noting that the User-Agent header accounts for 26.21% of the overall header data is actually fairly astounding. The User-Agent syntax is a horrible mess. I hope very much that HTTP/2 gives us a fair chance at doing something about it.

Note also, however, that Cookie values account for 19.02% of the total message size. Referer headers come in at 11.14% and Accept headers at 10.24%. The question for these are: is there a way that we can increase the density of the data -- transmitting fewer bytes on the wire -- without incurring any data loss and without sacrificing backwards compatibility *too much*.

My next step is to begin applying alternative encoding mechanisms to this data and to show those results side by side with this output... essentially illustrating how much of a reduction in overall bytes-on-the-wire we can achieve even before applying compression mechanisms.

Update:

I just updated the code and the sample data to include a measurement of each header values variability.. that is, a measure of the number of unique header values within each sample. For example:


TOTAL VARIABILITY FOR ALL HEADERS:
:version             0.0151515151515                66                  
accept-language      0.030303030303                 33                  
user-agent           0.030303030303                 33                  
:method              0.030303030303                 33                  
accept-encoding      0.030303030303                 33                  
server               0.030303030303                 33                  
:scheme              0.030303030303                 33                  
connection           0.030303030303                 66                  
vary                 0.0416666666667                24                  
content-encoding     0.0454545454545                22                  
x-frame-options      0.0454545454545                22                  
cookie               0.0606060606061                33                  
:status              0.0606060606061                33                  
:status-text         0.0606060606061                33                  
transfer-encoding    0.111111111111                  9                   
accept               0.121212121212                 33                  
:host                0.121212121212                 33                  
content-type         0.181818181818                 33                  
cache-control        0.21875                        32                  
referer              0.310344827586                 29                  
accept-ranges        0.5                             2                   
content-length       0.875                          24                  
date                 0.878787878788                 33                  
:path                0.939393939394                 33                  
expires              0.958333333333                 24                  
last-modified        0.958333333333                 24                  
location             1.0                             1                   
set-cookie           1.0                             1                   

Headers with a lower variability have a fewer number of unique values within the given sample set. We are most interested in headers at both ends of this spectrum.

2nd Update:

Just checked in a second update that includes experimental binary encoding schemes for select headers (date and numeric headers). These are included to illustrate examples of the savings that can be found in increasing data density without employing compression. For example:


last-modified: 
 Instances:   158
 Total:       4582
 Average:     29.00
 Low:         29
 High:        29
 Variability: 0.5190
 Percent of Total Size: 10.564907
 Percent of Total Count: 5.153294
 Encoded: 
  Total:      632
  Average:    4.00
  Low:        4
  High:       4
  Ratio:      86.21

The Ratio field shows the amount of space saved by changing the encoding.

The dates are encoded as the number of seconds since midnight on January 1, 1990 captured as a uvarint. This encoding is applied to the Last-Modified, Date, Expires, If-Modified-Since and If-Unmodified-Since headers. Headers with numeric values (:status, content-length, age) are encoded as uvarints.

3rd Update:

The output data now includes frequency tables for each value for each header within the sample set. This makes the output very verbose but is very informative.