Monday, February 20, 2012

Identifying computers behind NAT with pyflag

I've been a bit busy recently as I'm preparing to move across the world to the US to work at a small Internet company in the SF Bay Area. In the mean time though my current employer has been kind enough to let me contribute back some of the code we have written for the pyflag project (the link goes to my github page which has a fork of the project as the upstream site pyflag.net is down right now). Update: An alternate version (without the feature described below) is available on google code

The new features centre around identifying computers that are all lumped together behind a network address translation gateway (NAT). The idea is if you can identify the computers behind the NAT gateway you can attribute traffic to a specific system rather than only down to the network itself. The implementation is some visualisation tools in pyflag that allow you to plot certain packet headers fields against packet numbers or time.

Here's an example:
IPID field plotted against PCAP packet number
The plot takes the IP Identification field from the IP header and plots it sequentially against the PCAP packet number (pyflag also supports plotting against time). It looks like a big mess but you can see some lines and maybe some patterns in there. The IPID field is used to associate fragmented packets together for reassembly and it is generally left untouched by NAT gateways. Usefully different networking stacks have different strategies for picking IPID values.

In my anecdotal (non-scientifically determined) experience:

  •  Windows machines start at 0 when the computer is booted and increment for each packet sent up until 2^16 and then start again. In some cases it seems to wrap at 2^15 which to me suggests a signed integer problem but I haven't conclusively figured out on what versions it happens on. Additionally, I've read (but not seen) that some versions of Windows send the field in host order rather than network byte order.
  • Linux machines pick a random number for the start of the connection and then increment the value for each subsequent packet of the connection. I've heard (but again not seen) that packets with the Don't Fragment bit set get their IPID set to 0 on Linux.
  • BSD machines (including Mac OS X) pick a random number for every packet.
So looking back at our example we can see a haze of small lines and also a couple of longer lines which suggests that we might be looking at one or more Linux boxes along with one or more Windows boxes. To test this theory I looked for any user-agent strings in web traffic and found the following:

User-Agent strings present in the sample PCAP file
Based on those user agent strings it looks like there is at least one Ubuntu system and one Windows system. Also of note is the presence of Java user agent strings as well as Transmission (the Ubuntu Bittorrent client).

If we revisit our previous IPID plot and tell pyflag to colour all the Chrome/Windows user agent string related streams blue we get the following:

IP ID versus PCAP number with Chrome on Windows streams highlighted
From this it becomes clear that there are two distinct lines of IPID growth which implies that behind this NAT gateway are two Windows systems, one which was active for longer and even sent enough packets that the IPID value wrapped. Knowing the shape of these lines means that you can associate other traffic (perhaps traffic with no distinguishing application layer features such as encrypted streams) to a specific computer and any metadata gleamed from other application protocols (like HTTP).  

To make this even clearer there's another header field to consider, this time in the TCP header. There is an optional header in TCP called the timestamp value (defined by RFC1323) which is used to measure packet round trip times. By default Windows systems omit this value while most other systems include it (I've read that Windows can be configured to send timestamps and that in some cases will use timestamps if the client connecting to it uses timestamps). This means that if we exclude packets that have a TCP timestamp we should be left with all Windows traffic (assuming we exclude non-TCP traffic as well).

IPID versus PCAP number for Chrome user-agents, minus packets that have a TCP timestamp
After excluding packets with the TCP timestamp option set most of the background packets have been excluded. The remaining packets that don't fall on the lines are likely parser failures or packets generated by a Linux box that do not have a timestamp value for one reason or another (more investigation is required).

So we're convinced that there are two Windows system on the network and some yet to be determined number of Linux systems, if we change our filter to highlight Firefox on Linux and then plot IPID we get something that looks like this:

IPID versus PCAP number for Firefox sessions on Linux
 The things to note here is that the IPID values change dramatically between connections, also that in general HTTP traffic seems to be in the minority of the non-Windows traffic and finally that we're no closer to determining how many Linux systems are present. However, if we consider the TCP timestamp field for a moment we learn that it's generally determined as:

From: Identifying hosts with natfilterd
The interesting part in this case is that wallclock - boottime should be unique among the hosts that use the TCP timestamp option and it should increment in a predictable fashion. So if we graph the TCP timestamp value of packets versus their PCAP number we get:

TCP Timestamp value versus PCAP packet number (Firefox/Linux traffic highlighted)
Again we can see that the Firefox traffic accounts for only a minority of packets and we also see that there're two distinct lines for the first half on the plot. These two lines suggests that there are two Linux systems and the line fragment at the end probably represents a reboot (and not wrapping because the timestamp values are 32 bit numbers and the values we see are around 2^18 at their highest) of one of the systems or the appearance of a new one.

So at this point I'm convinced that there are two Linux systems and two Windows system and that most of the Windows packets are HTTP traffic (using Chrome) and that while there is HTTP traffic it accounts for only a small amount of the Linux related packets. For the remainder of the Linux traffic I'd guess that at least one of the systems is transferring files using BitTorrent based on the Transmission user-agent that was present before. Maybe if we plot the traffic with the Transmission user-agent we'll be able to tell which computers were running BitTorrent:


TCP Timestamp versus PCAP Packet Number for the user-agent "Transmission"
 At first this looks good, the line with the lower timestamp values is associated with Transmission and the higher one is not. Unfortunately this plot is ambiguous because the third line section is also associated with Transmission traffic and that line could easily belong to the top line section (after a reboot). If instead we ask pyflag to generate a table with only traffic that is not to or from ports 80 or 53 (to eliminate HTTP and DNS) we're left with a lot of connections between high ports transferring lots of  encrypted (looking) data to our NAT gateway address which fits the hypothesis of BitTorrent traffic. When we plot the timestamp values again and highlight any packet from our Not-HTTP/Not-DNS table we get the following:

TCP Timestamp versus PCAP number with non-HTTP/non-DNS traffic highlighted
At this point I'm reasonably confident that both the observed Linux hosts are downloading files over BitTorrent once I combine this plot with some analysis of the ports / stream sizes seen while I'm equally convinced that the Windows systems are not using BitTorrent or at least that there isn't a significant level of BitTorrent traffic observed during this packet capture.
The above little demo is contrived but I have found that this kind of analysis can be really useful in characterising the use of a network. This example was constructed from 5 virtual machines, 2 running Windows XP, 2 running Ubuntu 10.04 and a NAT gateway running Ubuntu 10.04 and using iptables/netfilter to do the NATing. Also, just in case you were wondering the Windows machines were watching youtube (in particular nyan cat and techno viking) while the Ubuntu systems were each using BitTorrent to download ubuntu images (12.04 alpha for different architectures). 


Future Work
  • Spring cleaning of the pyflag source (it's a little annoying to build and use right now)
  • More options on what to graph (maybe a system for generically plotting table information)
  • Ability to choose what to highlight based off the reverse side of a stream
  • Implementing a minimal version of this visualisation outside of pyflag Done! Identifying computers behind NAT with plotpcap

Related Work and Further Reading
Now that I've got the links handy I thought I'd also point at Michael Cohen's work. Michael is one of the authors of pyflag (project lead is probably a better description), and it's his ideas and that lead to the implementation of IP ID processing in pyflag.