We describe BLT, a tool for extracting full HTTP level as well as TCP level traces via packet monitoring. This paper presents the software architecture that allows us to collect traces continuously, online, and at any point in the network. The software has been used to extract extensive traces within AT&T WorldNet since spring 1997 as well as at AT&T Labs-Research. BLT offers a much richer alternative to Web proxy logs, client logs, and Web server logs due to the accuracy of the timestamps, the level of details available by considering several protocol layers (TCP/IP and HTTP events), and its non intrusive way of gathering data. Traces gathered using BLT have provided the foundation of several Web performance studies[16,29,14,18,24,25].
To improve the performance of the network and the network protocol, it is important to characterize the dominant applications [7,12,13,18,29,32,33]. Only by utilizing data about all events initiated by the Web (including TCP and HTTP events) can one hope to understand the chain of performance problems that current Web users face. Due to the popularity of the Web it is crucial to understand how usage relates to the performance of the network, the servers, and the clients. Such comprehensive information is only available via packet monitoring. Unfortunately, extracting HTTP information from packet sniffer data is non-trivial due to the huge volume of data, the line speed of the monitored links, the need for continuous monitoring, the need to preserve privacy, and the need to be able to monitor at any point in the network. These needs translate into requirements for online processing and online extraction of the relevant data, the topic of this paper.
The software described in this paper runs on the PacketScope monitor developed by AT&T Labs [1]. PacketScope is deployed at several different locations within AT&T WorldNet, a production IP network, and at AT&T Labs-Research. One PacketScope monitors T3 backbone links, another PacketScope may monitor traffic generated by a large set of modems on a FDDI ring or traffic on other FDDI rings, another PacketScope monitors traffic between AT&T Labs-Research and the Internet. First deployed in Spring 1997, the software has run without interruption for weeks at a time collecting and reconstructing detailed logs of millions of Web downloads with less than a worst case packet loss of 0.3%.
The rest of this paper is organized as follows. Section 2 discusses the advantages of packet sniffing and Section 3 outlines some of the difficulties of extracting HTTP data from packet traces. The overall software architecture is described in Section 4. Our solution (including the logfile format) is presented in Sections 5 - 7. In Section 8 we revisit some of the studies based upon data collected by BLT and point out how each study benefited from the data. Finally, Section 9 briefly summarizes some of the lessons learned.
There are many ways of gaining access to information about user accesses to the Web:
While each of these methods has its advantages most have severe limitations regarding the detail of information that can be logged. Distributing modified Web browsers to a representative sample of consumers and having them agree to monitor their browsing behavior is problematic, especially since Microsoft Internet Explorer and Netscape's browser became more popular than Mosaic and Lynx. The source code to Microsoft Internet Explorer is not available and the source code to Netscape has just recently become available. Some studies such as Crovella et al. [12] clearly show the benefit of such data sources. Yet, to evaluate changes in Web client access patterns between 1995 and 1999 [6] the same authors augmented a proxy instead of modifying the client.
While the logfiles from Web servers are extremely useful to scale the performance of the specific Web server they are not necessarily representative of the overall Web. The access pattern from users to specific files are heavily influenced by what content the Web server is offering [14,28]. Therefore a lot of Web server logs have to be analyzed in order to generalize to the overall Web. While possible [3,11,28] this is non trivial. Another aspect is that currently the standard log files generated by Web servers do not include sufficient detail regarding timing of all possible aspects of data retrieval.
Using the Web proxy for logging information can be suboptimal especially if either not all users are encouraged/forced to use the Web proxy, or it is impossible to instrument the Web proxy, or if insufficient detail is available in the logged information. The work of Mogul et al. [31,26] shows how useful Web proxy traces can be. One benefit of data from packet traces over proxy traces is very precise timestamps.
The information gathered from packet monitoring includes the full HTTP header information plus detailed timestamps of HTTP and TCP events. This may, e.g., include the timestamp for when a GET request was issued, when the corresponding HTTP response was sent, when the first/last data packet was sent, etc.. In addition full IP packet headers can be collected. The advantages of monitoring on the wire via packet sniffing include that this monitoring methodology is passive and therefore oblivious to the user. It does not impact the performance of the network. The amount of detail that can be gathered is sufficient to capture TCP and HTTP interactions. If desired and allowed BLT has the potential to collect the actual downloaded Web page (including results from CGI scripts). Having this detail available has enabled such studies as the effectiveness of delta-encoding and compression [29], the rate of change of Web pages [14], Web cache coherency schemes [25], the benefit of Web caching for heterogeneous bandwidth environments [16,8], and the characterization of IP-flows for WWW traffic [18].
Three other projects have used packet level data to extract Web data. A group from IBM augmented their Web server logs with partial packet level data for the collection of the traces during the Olympics. This data allows TCP level performance characterization and analysis [5]. Still, while it is possible to glean information about the access patterns within a site it is impossible to learn cross site effects. A group at Berkeley used a packet sniffer on the HomeIP network at the University of California at Berkeley [20]. They wrote their own set of software to continuously extract HTTP information on top of the Internet Protocol Scanning Engine ISPE [19]. Their user-level HTTP module is sitting on top of the TCP module and mainly logs HTTP level information. Since the main interest of the authors at the time of the sniffer code development was focused on HTTP traces they currently do not log full HTTP headers nor the full set of timestamp for TCP events. If studying things like Web caching and the burstiness of the arrival pattern of Web requests [16,8,17,20] these missing details can lead to misleading predictions. A group at Virginia Tech have developed HTTPDump [35] to extract HTTP headers from tcpdump traces [21]. The performance of their general tool is not sufficient to collect continuous traces on an 10Mbit/second Ethernet. The simpler PERL version [36] that only parses the first packet of the first HTTP request/response on a TCP connection promises good performance but is severely limited in its generality.
Other applications of accessing Web data from packet level data are Layer 5switching [2] and content-based request distribution schemas [4]. Both redirect HTTP request towards different servers based upon the content of the HTTP request by either moving the TCP state or rewriting the TCP sequence numbers or a combination of the two methods. Layer 5 switching is easier than layer 5 information extraction because the switch is in the data path and it is close to the Web server. Therefore it can throttle the server and it should see both sides of the packet stream.
The work most closely related to ours in terms of being able to collect 24x7- 24 hours, 7 days a week, around the clock passive measurements at key locations in the network is Windmill [27]. Windmill offers an extensible experimental platform in which application modules can process the subsets of the packet stream that they need. Our software design is driven by the desire to collect an extensive TCP/IP and HTTP level traces. As such we have identified the key events to log for Web performance studies, and since more than 70% of the traffic is Web traffic, we have designed the system for maximal throughput and minimal interaction between data collection and data processing.
Adding support for Web trace extraction goes well beyond the basic idea of packet sniffers like tcpdump [21] that any packet can be processed in isolation. Indeed, the extraction software has to almost run a TCP and a full HTTP stack in order to demultiplex the packets and extract the content. The software has to go from packets to TCP connections, from TCP connections to individual HTTP transactions (there may be more than one), from individual HTTP transactions to HTTP requests and HTTP responses and the transfered data. While this is hard enough to implement correctly on an end system it is even harder in this case because the software is incapable of throttling the end system, cannot make any assumptions about the compliance of either the clients nor the servers with the TCP and HTTP specifications[22], nor may it see both halves of the TCP connection. Even worse due to packet by packet load balancing it may only see every other packet for some, currently small fraction, of the transfers. On the other hand the software has the advantage that it doesn't have to be perfect. It is our desire to gather continuous traces without downtime on a high speed transmission medium such as FDDI or multiple T3's with capacities greater than 100 Mbit/second. In such an setting it is almost impossible to not lose some small fraction of the HTTP transactions due to packet losses at the sniffer. It is possible to keep packet losses small (e.g., by running the sniffer at higher priority) but it is impossible to guarantee that no packet will ever be lost. Therefore the resulting trace data should only be used for such analysis that are statistically robust against losing a very small fraction of the transactions.
To get a better flavor for the problems that need to be addressed consider the following subproblems: assumptions about how Web pages and their meta information is fragmented into packets, assumptions about how TCP connections are used by HTTP, demultiplexing (including reordering and loss) of TCP packets to HTTP transactions, and sanity checking of the extracted information.
Web pages and their meta information are often fragmented into TCP packets in an unexpected fashion:
HTTP uses TCP as its underlying transport protocol leading to the following issues:
Demultiplexing of the TCP packets into HTTP transactions implies dealing with lost packets, retransmitted packets, and reordered packets:
While it seems easy to debug HTTP extraction software, not all bugs may be bugs:
All of the above indicate that one needs a sophisticated tool to extract HTTP information from packet level data and that just inspecting the first x bytes of each Web TCP connection is insufficient.
The hardware and software design for the monitoring system was driven by the desire to gather continuous traces without downtime on a high speed transmission medium. If there is a collection machine that is capable of capturing packet on a medium than it should be possible to run BLT. The software should be deployable even on backbone links. Due to the asymmetric routing, common in todays Internet, backbone links may only see packets of one direction of a TCP connections.
Hardware design: The hardware of the AT&T Packetscope [1] consists of standard hardware components, a Dec Alpha 500 Mhz Workstation with a 8 Gigabyte Raid disk array and a 7 tape DLT tape robot. For more details on the hardware architecture see Figure 1. Several security precaution have been taken, including using no IP addresses and using read only device drivers. The Dec Alpha platform was chosen because of the kernel performance optimizations to support packet sniffing by Mogul and Ramakrishnan [30].
Software design, online vs.\ offline extraction: Given that HTTP headers can easily be larger than 1500 bytes and will span multiple packets we had no choice but to collect full packet traces of the wire. At speeds of 100 Mbit/second this implies that the processing of the data into the log format has to be done on the monitoring machine itself. No current DLT tape technology can transfer data to tape at a rate anywhere close the 100 Mbit/second. Neither does the disk system allow storage of more than a few hours of data. Besides, processing the logs offline would introduce serious privacy concerns with respect to the data content of the packets. Since the Packetscope, due to the placement in the network, may only see packets either directed to the Web server or to the Web client, no matching of HTTP requests with their HTTP responses is done online. Rather, where possible, this is done offline.
Software design, partitioning of the software: Packet sniffing involves having the packets pass through at least some part of the protocol stack on the monitoring machine at interrupt level. At line rate even pure packet sniffing can already stress even such powerful machines as the DEC Alphas [30]. The amount of processing per packet for HTTP header extraction is variable and potentially quite large. E.g., one can imagine collecting all packets of a TCP connection and extracting the HTTP information upon receiving a packet with a FIN flag. In this case the processing time for the packet with the FIN flag would be much larger than the processing time for any of the other packets of the same TCP connection. Only when receiving a packet with a FIN flag would the much more involved HTTP extraction be executed. Due to variable processing time we separated the processing priorities of the tasks - high priority for the packet sniffing; lower priorities for the packet extraction and any other software. To avoid interference between the HTTP extraction software and packet sniffing, the extraction software should avoid processing at interrupt level. Splitting the software into two stages introduces the need to pipeline the processing. We choose to use files as buffers between the collection and the processing stages.
We decompose the overall task into four components: packet sniffing, a control script, HTTP header extraction, and HTTP header matching. Figure 2 shows how the first three interact with each other.
software based on tcpdump [21] that will copy a fixed number of bytes from each packet to a file. Once it has processed some number of packets, this software will close the current file, move the file to a different directory, and open a new file. In addition all IP addresses are encrypted as they come of the wire before saving them to disk. This process runs at normal priority.
a perl [34] script that controls the pipeline. It monitors a directory and will start the HTTP header extraction software for each file that the packet sniffing software generates. Once the header extraction software is done it will copy the logfiles to tape and clean up the disk. Besides controlling the copying of files the control script also needs to monitor the tape usage, switch tapes on the tape robot, and allow for personnel at the PacketScope locations to change tape sets at any point in time.
software that will process files generated by tcpdump (containing packets with full data content) and extracts logfiles containing full HTTP request and response headers and relevant timestamp information, TCP timestamps and data summarizing the data portion of the HTTP requests/responses. In addition the software creates pure packet header tcpdump files for the observed traffic. The software extends tcpdump [21] to reconstruct HTTP sessions and is run niced to the maximum possible level.
offline post-processing software that will match HTTP request information with HTTP response information where applicable. The match is based upon either a match between sequence numbers and acknowledgment numbers or additional heuristics.
The benefit of building most of the software on top of tcpdump is that we can take full advantage of the filtering mechanism, and the built-in knowledge about the IP/TCP protocol stack. Using the filtering mechanism is especially useful if BLT is run in an environment where the capturing hardware is at its limits. In this case using a more restrictive filter may provide BLT with the cycles. (Note, that not all Web traffic is using port 80.) Adding the support for multiple files for the packet sniffing software is a trivial extension of tcpdump. Next, we give more detail on the HTTP header extraction (Section 5), the logfile format (Section 6), and the HTTP header matching software (Section 7).
The software is built along the following lines:
To not impede the packet sniffing effort it is crucial to avoid unnecessary file I/O and therefore the software should stay memory resident. This makes it impossible to follow the above recipe step by step while continuously monitoring packets. Alone the memory requirements for storing about 200,000 packets each of size 1,500 bytes exceeds the memory of our monitoring machine. Therefore it is necessary to split the steps outline above into substages. Whenever all packets have been received for one HTTP transaction its information is extracted. Unfortunately, a single transaction can involve thousands of packets; therefore even this step has to be staged. Whenever a sufficient number of packets is received for one HTTP transaction its partial information is extracted. The clean-up step controls this staging.
Instead of using TCP connections we use IP-flows [10,18] to demultiplex packets. This accounts for the possibility of losing packets with TCP flags, monitoring only packets from one side of a TCP connection, and fragmentation of Web pages and their meta information. An IP-flow is a set of packets that are close in time and that have the same IP addresses and the same TCP port numbers of both the source and the destination. Our definition of close in time is somewhat looser than the definition used in [18] by using a 10 minute (a compile time constant) timeout value. For the most part, all packets in an IP-flow correspond to a single unidirectional TCP connection, and all packets in a single unidirectional TCP connection correspond to a single flow. The main data structure is a per flow list of packets and a list of partial information extracted from this flow. The desire is to append any new incoming packet to the correct list of packets and then, at the appropriate time intervals, extract the HTTP information. Figure 3 shows a schematic of how the various steps are done on the per flow data structure.
If every TCP connection would contain exactly one HTTP transaction, the key for indexing the per flow data structure would correspond to a single HTTP request or response. Unfortunately the use of persistent connections [31] is common enough, even in HTTP/1.0, that the key is not sufficient. Instead, the off-line matching of HTTP requests and HTTP responses is done by matching the sequence numbers with the acknowledgments. Indeed there may not be a match because not all HTTP requests will generate a HTTP response message. Given the complications of finding the match between HTTP requests and HTTP responses and the fact that it is separable from the information extraction, this step is done during off-line post-processing.
The extraction stage performs the steps outlined at the beginning of the section on a single, sufficiently long, list of packets. This means that the packets are ordered according to their sequence numbers. (The sorting is efficient since most packets are only slightly or not at all out of order. The chosen sorting routine handles sequence number overflows correctly.) Next, either all or an initial subset of the packets will be used to extract the TCP/HTTP timing and HTTP request/response information. If no packet has been received for an IP-flow within the last 10 minutes all packets will be processed. If the list contains more than 300 packets, the first say 200packets (both numbers are compile time constants) within the list are processed. By processing only about 2/3 of the packet list current gaps in the later part of the list are likely to be filled by packets that are still in transit. (The TCP window size is limited.) In contrast processing all packets would lead to many more missing packets and a much more incomplete logfile.
Due to persistent connections an HTTP header can start at any point during an IP-flow. The HTTP header information is found by looking for one of the following patterns. (For simplicity we are denoting the patterns using perl the number of tcp connection notation even though the implementation is in C.)
Here \n may be the UNIX or the MSDOS newline character. The end of the HTTP header is found by looking for two CRLF or when running out of a preset limit of 50000 bytes. This limit is necessary in case the packet containing the newline is lost. Whenever a gap in the sequence numbers is discovered, the flag GAP is set in the logfile and the data should be disregarded in further analysis. In general the packet loss is well below 0.3% and very few HTTP transactions are affected by losses.
To support partial processing of packets, our software keeps a state machine for any active flow. The state machine records
It is necessary to record the number of TCP connections that are using a flowid although it is very unlikely that the same customer may reuse a given flowid for a different TCP connection. Still in modem environments this is more likely since a given IP address can be reassigned to different customers and those may use the same port number to visit for example the same Web server. (Port numbers on newly rebooted machines usually start at a fixed port number 1024.)
Since persistent connections can use a given TCP connection for more than one HTTP request the index of an HTTP request within a TCP connection leads to a useful heuristics for the matching of HTTP requests and responses. The next field keeps track of information extracted from each HTTP body, the timestamp of the first packet containing data and the filename containing the partially reassembled body (where appropriate). The remaining fields store the flow state once a partial list of packets has been processed. The state consists of a bit to indicate if the software is currently extracting an HTTP body, the size of the body extracted so far, and the next sequence number. This state is sufficient under the assumption that partial processing never ends within an HTTP header. To keep this invariant, partial processing is continued beyond the first 200 packets of the current list of packets if an HTTP header is reconstructed. (While HTTP headers can be spread among many packets, we have not yet found an HTTP header spread among more than 50 packets.) As soon as some complete information, e.g., about a TCP event or about an HTTP header, has been retrieved from the packets this information is written to the logfile.
This stage is used to age flows. It triggers the extraction stages for an IP-flow if either the list of its packets is larger than 300 packets or if no packet has been received for this flow for the timeout of 10 minutes. Finding the time intervals for scheduling the clean-up stage is complicated by the desire to balance processing overheads versus memory. Currently, the clean-up stage is executed after processing 50,000 packets (another compile-time constant).
Finding an appropriate logfile format for BLT is crucial since the online processing paradigm makes it impossible to go back in time and augment the logfile with additional information. The other motivation for a detailed discussion of the logfile format is that it shows the breath of information that is available by tracing across multiple protocol layers. Our choice of logfile format for BLT was guided by the following concerns:
We want to distill all essential information from the traces, this includes full HTTP headers information, full IP/TCP header information, and TCP connection information.
Most Web performance studies do not need the full IP/TCP header information but they gain significantly from accurate timestamp information about HTTP events.
Web performance studies also gain from knowledge of lower layer events such as, e.g., TCP connection establishment timestamps. These timestamps have a natural equivalent in the application and supply crucial timing information.
The HTTP header information may contain fields that reveal information about the source or destination of the HTTP request. Privacy concerns demand that one separates this information from IP address information.
To meet our goals the HTTP header extraction software splits the information into three different files:
In terms of size the packet header logs are by far the largest. The next smaller ones are the HTTP/TCP logs while the per flow files are the smallest ones. By separating IP/TCP packet headers from HTTP level information we address the conciseness problem. Yet, by keeping strategic TCP events and HTTP events together with the HTTP header information we ensure a level of completeness sufficient for most Web performance studies. In case this level of detail is insufficient, the packet header information is structured to allow an easy join of the datasets. This level of detail is sometimes necessary to verify assumptions and simplifications made using timing information available at the higher levels. For example, we used the packet level information to estimate the impact of slow start on the time savings yielded by first applying delta encoding or compression before transferring the data [29]. By keeping TCP events and HTTP events in the same file it becomes natural to consider cross protocol effects. We keep full HTTP header information since the HTTP protocol is still under development, subject to customization, e.g., cookies, and subject to use by other applications as their transport protocol. (Ignoring any such header can potential lead to misleading number, e.g., ignoring cookies may lead to much higher cache hit rates for Web caching [16]). For privacy reasons it is necessary to separate the per flow information from the HTTP header information since the per flow information contains encrypted IP addresses and the HTTP header may, e.g., contain the hostname of the contacted host. (There is not always a one-to-one correspondence between IP address and hostname.) By keeping the information separate we lessen the impact.
In general the file formats where chosen to facilitate easy processing by scripting languages such as awk and perl. The per flow files contain the encrypted source and destination IP addresses and port numbers, each entry contains a unique identifier for each flow that is used to cross reference with the HTTP/TCP logfile. We need this cross reference ability in order to match HTTP requests with HTTP responses.
The file format of the HTTP/TCP file is more complicated in part because it needs to record different kinds of information such as TCP events, HTTP events, and HTTP request/response headers. But more importantly the reconstruction procedure may create information for a particular request at any time. We can identify the parts that are associated with the same HTTP transfer by taking advantage of the per flow state. One can identify all TCP events associated with the TCP connection that is used by a particular HTTP transfer by locating all of the TCP events with the same flow identifier and flow count. A TCP connection identified via flow identifier and flow count is a persistent connection if it has more than one HTTP request with the same index.
The file format of the HTTP/TCP file consists of two parts. The first part consist of the basic flow information including flow identifier, number of TCP connections seen on this flow identifier and the number of HTTP requests seen on this flow identifier (see Table 1). Each of these numbers is initialized to 0 in the beginning. The second part consists of a string identifying what kind of record to expect followed by the record-specific information. We distinguish four kinds of records: TCP, DATA, REQUEST, and HTTP headers.
TCP events are identified by the TCP flag they use: SYN, FIN, RST. In addition we differentiate between the first instance of such an event and additional instances. Most analysis are only concerned about when the first such event happened, yet others (e.g., those that track error conditions) care about repeated TCP signaling and therefore about repeated SYNs, FINs, RSTs. By labeling them differently these are easier to find or eliminate. The specific information is just the timestamp of the packet with the TCP flag.
DATA events summarize the information about an HTTP body, the time of the first packet of the body, the time of the last packet of the body, the length of the body and potentially the filename that contains the data. In addition the information contains a flag that indicates if BLT suspects a missing packet might have created a gap in the data content.
REQUEST events and HTTP headers occur together. The first contains the information if BLT encountered a potential gap and the timestamp information of the first and last packet contributing to this HTTP header. We delimit the raw text of the HTTP header fields with two ``random'' magic numbers to simplify post-processing of the log files. The HTTP header field starts with the magic number 0xa1b2c3d4 and ends with the magic number 0xb1b2c3d4 on a separate line. In between those two magic numbers we store the header length, the start and end sequence number and the start and end acknowledgment numbers and the actual content of the HTTP header fields.
Tables 2 summarizes the file format of the per HTTP event tables while Table 3 shows a sample entry from one of the logfiles.
This means that for flow 211 and the first TCP connection on this flow the first SYN was observed at time 870839085.884436. The first HTTP request on flow 211 1 starting at time 870839086.513424 and ending at time 870839086.513424. The actual HTTP request header was of size 285 bytes started at sequence number 3871952 and ended at sequence number 3872237. The acknowledgments were for sequence numbers 68743 and the actual text of the requests is: HTTP/1.0 200 OK Date: Wed, 06 Aug 1997 03:40:57 GMT etc.
Most Internet service provider (ISP) use hot potato routing to hand traffic off to other ISPs as early as possible, creating lots of asymmetric routes in the Internet. Therefore it is very unlikely that a packet monitor will see the HTTP response that is generated by an HTTP request unless the packet monitor is deployed close to either the Web clients or the Web servers. Close here means that there is exactly one pass from the Web clients or the Web servers to the rest of the Internet and that the packet monitor is on this pass. Since our goal is to be able to deploy BLT at any point in the network, we match HTTP requests with HTTP responses in a separate offline step. This has the additional advantage of reducing processing overheads on the monitoring machine itself. The timeline in the left of Figure 4 shows the of the basic steps in a Web transfer. In the simplest possible case each line corresponds to a single packet. For the purpose of matching HTTP requests and HTTP responses we would like to point out that the HTTP response is the first data that is sent back to the client and will acknowledge the last byte of the HTTP request. Consequently the sequence number of the last byte of the HTTP request should be equal to the acknowledge number from the HTTP request. In addition the first sequence number of the HTTP response should be equal to the acknowledge number of the request. This reasoning holds even if the client and server use persistent TCP connections as long as no HTTP requests are pipelined. If HTTP requests are pipelined (see right timeline in Figure 4), detectable by finding more HTTP requests than HTTP responses during a time interval, the above equalities become inequalities. In this case we need additional information; the logfile contains the information about the index of each HTTP request/response on a given TCP connection. Missing HTTP requests/responses are detected by monitoring the inequalities on the sequence numbers and acknowledgment numbers and timing information. Any inconsistencies are handled by adding/subtracting an offset to the request index number. The matching of requests with responses both for non pipelined as well as for pipelined requests/responses is using the same information that we would have used if the matching had been done online. But for pipelined requests the matching may incorrectly match a request with a response. Matching requests and responses that were collected at different places in the network one has to be especially careful with regards to the clock synchronization of the monitoring machines.
Besides matching the HTTP requests and responses the HTTP header matching software (written in C) produces a second logfile that contains HTTP request/response pair information. The design of this second logfile format is significantly simpler than the design of the original logfile format (Section 6) since it can be recomputed from the initial logfile format. The choice of logfile format was guided by the following concerns:
To meet these goals the traces that are extracted in an online fashion are processed on a file by file basis. (Requests/response pairs that span more than one file are not matched.) While processing the file the software creates an index of all events and parses the HTTP requests and response headers. Any HTTP header that is questionable (e.g., because of a missing packet or a miss-parsing of the HTTP header pattern) is rejected. While parsing the HTTP headers the presence of certain header fields and their values is noted and stored. Once all events have been processed the requests and responses are matched. Next information about the associated TCP events and DATA transfers is added to the records and a log entry is written.
A logfile entry consists of information that describes the events and entries that make post processing simpler, e.g., a unique index for each HTTP request/response pair. To be able to sort all requests in the order in which they were issued, the first element of the log entry is the timestamp of the first packet of the HTTP request. A flag field is added to flag those request/response pairs who's transfers where affected by a packet loss. The per HTTP request information includes the type of request (one of GET, HEAD, POST), the URL, the referrer field, and the size of the HTTP request header. The per HTTP response information includes the response timestamp, the response code, the number of bytes in the response header, the number of bytes of the data that was received, the content-length and type from the HTTP response header, where appropriate the filename of the file that stores the reassembled content. Additional timestamps are the SYN|FIN|RST timestamps from the sender and receivers, the timestamp of the first DATA packet and the timestamp of the last data packet. The header fields that are extracted from the HTTP headers include pragma, cache, authorization, authentication, refresh, cookie, set-cookie, expire time, if-modified-since, last-modified, and cache-last-checked directives.
A simple indexing schema for our logfile uses the unique flow identifier and the timestamp of the request. Studies [16,29,14] have shown that the logfile contains relevant timing and HTTP header information and as such is fairly complete but also concise and simple. Privacy is achieved by eliminating any reference to even the scrambled IP addresses. The flow identifier still allows the identification of the same source/destination IP addresses.
Data collected by BLT and its predecessors have been used for various Web performance analysis studies. In this section we will point out what part of BLT has enabled each study.
In addition the studies mentioned above the data collected by BLT has been used to derive various statistics about the popularity of Web sites, the usage of HTTP header fields [15], the behavior of consumers and researchers browsing the Web.
Yet another use for BLT involves augmenting active measurements by passive measurements, e.g., to measure what the performance of retrieving a Web page is from a Web server. The level of detail, available via BLT, allows us to distinguish DNS delay, from TCP connection setup delay, from delay to process the HTTP request, from delay to send the data. The biggest benefit of using BLT to augment active measurements is that one does not need to use a specially instrumented client. Rather one can use a standard Web client such as Netscape and control it via the remote control features. This approach enables one to separate delays due to rendering at the client from delays due to the network or Web server.
BLT has been designed to allow continuous collection of real world traces at many different locations in the Internet. It has been used to collect several months of real world traces from AT&T WorldNet, a consumer based ISP, and from AT&T Labs Research in Florham Park. BLT is unique in giving us access to HTTP and TCP level traces at the same time. The collected datasets are novel (1) in the degree of detailed information they provide, (2) in presenting us with an client side view of the Web and (3) in the duration of the traces. The latter both challenges and benefits any analysis driven by data collected by BLT. Without datasets such as those collected by BLT, one can only speculate about the Web or construct artificial datasets with all their pitfalls. The richness of the datasets and their completeness have motivated and enabled several studies.
The most important lesson we learned from writing BLT is: expect the unexpected and respect the challenges to the HTTP header reconstruction as discussed in Section 3. Once data about multiple layers in the networking stack is available it provides a playground for many analysis. To avoid preempting these analysis it is important to create well documented, precise, yet complete logfiles. (E.g., include the full HTTP headers.) In extracting the information it is crucial to avoid assumption about how well-behaved either the clients or the servers or the network might be [22]. They are not. Other common lessons from the implementation include: don't try to do too much processing in the time-critical steps of the logfile extraction; simplify wherever sensible and reasonable; reduce memory use and disk I/O. But in the end the most crucial lesson was to never expect a perfect logfile. There will always be one more exception or one more misbehaved client/server. Therefore, the matching software and any analysis program should test whatever assumption the data has to satisfy and eliminate any data that violates the assumption. With enough care the number of requests that are discarded by each step is small.
It is currently possible to monitor links up to 100 Mbits using of the shelf computer components. As link speeds grow the memory and CPU performance of these systems become bottlenecks. In this case the processing of the data could to be pushed closer to the link, e.g., onto the interface cards. Alternatively one could develop special purpose hardware or restrict the observed traffic to the specific subset of interest. There are two options for the later approach, select a subset and perform the same computation or select all and perform a simpler computation that approximates the full computation. The experience collected with tools like BLT are crucial to judge the quality of the resulting datasets.
The software needs to undergo continuous evolution. Even as we are outlining the current design of BLT the next generation is being developed. The new version incorporates, among others, the following significant improvements: (1) There is no notion of files and requests, responses pairs will be properly matched. (2) It is not necessary to parse the data content since the new tool can determine the length of the HTTP content from the HTTP header information unless a RST is encountered. (3) This enables a direct split of the HTTP content from the HTTP header information and has the potential to reduces the overhead of protocol information extraction significantly. (4) The linked list of packets is replaced with a modified splay tree routine that will automatically account for retransmitted packets and/or gaps. Another avenue of future work is to extent the protocol awareness to other protocols such as RTSP. Such protocols add the complication of using dynamically assigned UDP ports for exchanging media data. mmdump [9] is tool that allows users to monitor such multimedia traffic.
I acknowledge all my colleagues at AT&T Labs that are involved in the measurement effort and their help in developing the software architecture. Special thanks go to A. Greenberg, R. Caceres, N. Duffield, P. Mishra, C. Kalmanek, K.K. Ramakrishnan, and J. Rexford. Many thanks to everyone in WorldNet that made the deployment of the PacketScopes possible.
I am very grateful to J. Rexford and B. Krishnamurthy for many discussions and constructive criticism on the presentation of the material. Many thanks to G. Glass for writing the HTTP header matching software.
0=6 =10 .55 -0 =.9 0