The Collisions group did a large amount of mining on the raw DITL data. What follows is information on using the processed data from that work, as well as notes made during the processing. Roy Hooper from Demand Media started this data processing process. Kevin White from JAS Global Advisors contributed additional features.
At the time of this work, there were eight years of DITL data available, 2006-2013. Each year has been processed, with two different sets of results files created. Each different set has the same data rows in it, the difference is what files the data are stored in:
- intermediate: one file for each original source pcap file
- by-tld: one file for each new proposed TLD
The processed data has been made available in a directory structure in /mnt/oarc-pool3/collisions. There are two top-level directories, intermediate and by-tld. Inside of each are folders for each year, with a run tag appended. The run tags allow different versions of the data to be made available. Current run tags:
- encoding: uses the encoding logic to correctly display every possible octet used in the query name
For example, the 2013 "encoding" is in these directories:
Both directories contain a "map" file called "pcapmap". This is a simple file, containing one row per source pcap file. The row contains a numeric file ID, a space, and the full path to the file. Each row in the output contains the same numeric file ID, so that any given row in the dataset can be traced back to the source pcap. The output format is the same in each file. Each row contains the following data elements, separated by space:
date time tld sld fileid protocol sourceip.sourceport destip.destport root querytype stringlength string
2013-05-29 11:55:07.998745 home _udp 2 udp 2a01:c0:1:1:be30:5bff:fed0:31eb.1731 2001:503:c27::2:30.53 j-root PTR 20 db._dns-sd._udp.home
Notes: The root is determined by destination IP, not by source pcap. You can use the fileid to determine the source pcap folder, if needed. Protocol is udp or tcp. The string itself uses an encoding, described below.
Query String Encoding
Technically, any value from 0-255 is legal in the label of a domain name. Thus, the output needs to be able to encode any value. However, the common use case is LDH, so a raw byte encoding wouldn't be too useful. Essentially, for the set from 0-255, the characters from 33 to 126, inclusive, are printable in ASCII. 32 is the space, and we chose to encode it, so as not to break space as a delimiter. 127 is DEL. The non-printable characters are everything else. Everything greater than 128 isn't defined in ASCII. Everything below 32 are control characters. We display a non-printable character with a three character string, \xx, where xx is a 2-digit hex representation of the character. Printable characters that are encoded that way are:
- SP: \20
- .: \2e
- \: \5c
Space is encoded to allow the line to be space delimited. '.' is encoded because technically '.' is a valid character inside of a label, and is thus indistinguishable from a '.' used to separate the parts of a label (in other words, a literal '.' in the output file is known to be a label separator, and any actual period character in the query string itself will be represented as \2e.) '\' is encoded to distinguish it from the control sequence initiator. Parsing this encoding is as simple as taking each character literally, unless the character is a \. In that case, take the \ and the following two characters, and parse them as hex, and replace with the appropriate byte. Examples:
JAS Global Advisors has made its version of the code used to do the processing available in a GitHub repository. It is located here: https://github.com/JASdevteam/dns-oarc. The code in the repository consists of a Perl script that began life as the Perl and shell scripting that Roy Hooper from Demand Media wrote. The code actually contains two distinct processing phases, selected by command-line parameter:
- --raw: Process the raw PCAP files
- --jas: Do SLD summarization on the by-tld files
subtld.pl --year YEAR --suffix SUFFIX [--raw] [--jas] [--single tldname] [--jassuffix SUFFIX] [--ldh | --noldh] [--len | --nolen] [--random | --norandom] [--hyphen | --nohyphen] [--punycode | --nopunycode] --jassuffix: suffix used for destination in JAS pass --raw: do processing of RAW data --jas: do processing of JAS summarization (Without --raw or --jas, nothing will be done.) --ldh: do checks for LDH at SLD --len: do length checks: entire string, SHORTSLD --random: do checks for random at left-most label --hyphen: do checks for trailing/leading hyphens at all levels
The various yes/no flags affect the JAS summarization step. The --raw phase uses the /mnt/oarc-pool* locations as its source, and sets up a destination directory tree in the user's home. The --jas phase looks in first the local destination directory tree, and then the system-wide one mentioned above, for its source. The --suffix option adds the run tag. There is an optional second suffix option --jassuffix, to include a second tag for a JAS summarization run. The various checks enable/disable different validity checks in the JAS summarization step. The random check here is the simplified random check. JAS's more statistically rigorous random checks happen elsewhere.
At the time of this writing, an1-3 default to the "C" locale when a new user account is made. Most other modern systems are using UTF-8 at this point, so this resulted in things like sort order to appear to be inconsistent when results were compared to files sorted on machines like Red Hat boxes, and OS X machines. JAS has transitioned to doing all of our processing/sorting in UTF-8. To set a user account to UTF-8 on FreeBSD, edit ~/login.conf. Make its contents look like the following:
me:\ :charset=UTF-8:\ :lang=en_US.UTF-8:
After logging out and in again, the "locale" command should show en_US.UTF-8 for everything. For Linux, add the following to your shell startup somewhere:
LANG=en_US.UTF-8 export LANG
On an2, I placed them at the start of .bashrc. .bashrc is called from .profile. If you have made a .bash_profile or a .bash_login, and you don't automatically call .bashrc, you may need to place the commands in there also.
We had noticed that there were many data sets labeled DITL-2010*. We used /mnt/oarc-pool4/DITL-20100413/RAW, because it corresponded to the dates given in the DITL-2010 web page. We asked Duane Wessels for confirmation on that, and he responded:
2010 was a big year for the root servers due to DNSSEC signing of the root zone. Each time a new root server switched from unsigned to the signed (but unvalidatable) zone, there was a "mini-DITL". So all those dates you see are collection events with only root server traffic. The "official" DITL date is the one you've been using (mid-April) and the one with the most root servers participating.
- JAS created a map of IP address to root. The goal was for this map to be "complete", to have all known root IP addresses over the years. This map is probably not complete. In addition, several entries were made to the map simply because there were large numbers of queries destined for those addresses, and it seemed incorrect to throw them out. Improvements to the map are always welcome. The map is represented by the variable %ROOTIP in the Perl code.
- 2012 j-root appears to have some large data files that may not actually contain root traffic. These files are large: much larger than other pcap files for that year and root. They also don't contain much data that belongs to the set of new GTLDs: they instead seem to contain queries to valid TLDs at the time. The packets are GRE encapsulated, and neither destination IP matches the list of known root IPs. Processing through the files takes a long time, since they are so huge, and results in very little (and in the case of several of the hosts, no) matched rows. These files are located in /mnt/oarc-pool4/DITL-20120417/RAW/j-root/pcap/devnr*
- The 2009 data appears to no longer contain the RAW data. It was reported that, at some point, a cleanup process may have removed it. Thus, for 2009, I used the CLEAN-ROOTS data.
- Actually getting the query name to be byte-perfect involved actually scanning the raw packet data: tcpdump mangles the names irreversibly upon output. Doing so meant diving deep into the packets and frames in the dumps. Some of the input files have a link type in the header of "1", which is Ethernet. Packets in these files contain the full Ethernet frame. Many contain VLAN tags, which can be detected by looking at the EtherType field. Other files have type "101", which is LINKTYPE_RAW. These files do not have the Ethernet frame, and instead start with the IPv4 or IPv6 header. The 2008 DITL data contains pcap files with type 108, which is OpenBSD Loopback: these files have 4 extra bytes at the start that can be skipped over, taking you to the IP packet.
JAS and simMachines submitted a report in response to ICANN's 05aug13 DNS Name Collisions Public Comment period.
Please see the JAS and simMachines phase one draft report Mitigating the Risk of DNS Namespace Collisions.
We want to recognize the valuable analytical contributions from our longtime partner Arnoldo Muller-Molina and the whole team at simMachines.