As well as running twice-yearly workshops, various public benefit tools and inter-member co-operation platforms, OARC operates a number of large-scale data gathering initiatives, which collect data from its members' infrastructure. One of these, initiated in 2004 in co-operation with CAIDA and funded by the NSF, is a "Day in the Life of the Internet"" (DITL). This gathers detailed data-sets of DNS queries to root and top-level DNS operators for a 48-hour period at least once a year. The idea is to have a baseline data archive which can be compared year-on-year, and data has also been gathered during significant change points in the global DNS, such as the IPv6 delegation and DNSSEC signing of the root. Over the past decade, OARC has accumulated a data-set in excess of 40Tb of DITL queries.

During 2012, a potential new obstacle on the path to deployment of ICANN's new TLDs became apparent to the ICANN SSAC (Security and Stability Advisory Committee). A risk was identified that some of the proposed new TLDs were already in widespread internal-only use within enterprises, and on top of this, SSL certificates which had only ever been intended for such internal use had already been issued to these organizations. This could lead to a risk of collisions between valid internal use of these TLDs, and potentially malicious misuse of these certificates on the global Internet.

Clearly this was a potentially significant problem, with a tension between the interests of new TLD operators who want to see their new domains deployed as quickly as possible, versus some very real risks of abusive activity, or even just unintended consequences, either or both of which could have global impact.

When determining policies on how to proceed in such situations, it's important to have data to base them upon. Given the tight deployment time-scales, gathering new data from scratch could have been a significant and time-consuming exercise. Fortunately, it was quickly identified that OARC's DITL data-set could contain evidence needed to help determine if the SSAC's concerns were real ones in practice, and if so the extent of their severity. The log of queries to the root and TLD servers contain not just valid top-level domain strings, but also "leakage" of strings intended for internal-only use but which escape into the wider Internet due to various mis-configurations. It is exactly these kind of unintended consequences which can lead to the concerns expressed in the study, making the data gathered a useful sample of what could go astray or be exploited.

While OARC's DITL data set was recognized as being of high relevance for this particular need, it is however important to understand that it is only one view of the DNS, and by no means a definitive or complete view: for example it only includes some queries to some root operators for a small time, and not for example to many other TLD operators or ISPs providing DNS resolver services to their subscribers. It is probably impossible to get a complete view of the DNS by traffic gathering techniques, and the value of multiple different approaches should not be overlooked.

Having identified the problem and the data-set which could be a solution, it became possible to start working on the data using existing OARC facilities. In the meantime, however, a number of requirements needed to be tackled to perform further analysis of the data:

  • DNS data submitted to OARC from jurisdictions across the world is potentially sensitive, and held in trust by OARC under strict confidentiality terms. This allows data submission by a much wider community than otherwise possible. However, these terms prevent the copying of the data from OARC's archive to 3rd party systems.
  • While the systems hosting OARC's growing data set have been regularly updated over the years, much of its supporting infrastructure, including computing resources for doing in-situ data analysis by members and researchers, had not been upgraded since the original NSF bootstrap funding a decade earlier, and was in sore need of upgrading.
  • Many new TLD operators wanted to become OARC members in order to both support its mission and carry out their own analysis of the DITL data sets at OARC, independently of the ICANN-sponsored work that had been carried out by Interisle/RTFM.
  • This was all happening in the context of the pressing timescales of new TLD deployment and ICANN's analysis and comments timescales.

Fortunately, as a result of a re-development plan committed to by OARC's Board earlier in 2013, a major hardware and software refresh was already under way, and at the time the Collisions Strings study requirement was identified, OARC's Systems Engineer, was ready to deploy the new compute resources needed.

OARC was thus quickly able to take delivery of, and bring into service, the significant equipment donation of several Dell r820-grade servers from various interested OARC members. These are very high spec machines, with 64-core processors and at least 48Gb of RAM. They take OARC's analysis capability firmly into the present, and will be of immense value not just for ongoing Collisions studies, but for general-purpose needs of OARC member and researchers for some years into the future.

The results of these various studies are going to be presented and discussed at OARC's upcoming twice-yearly Workshop in Phoenix Arizona on 5th October 2013.

OARC has been in the business of "Big Data" for much of its existence, but it is only recently that the value of such large-scale data gathering has been widely defined and recognized. With this major contribution, and further donations of equipment and space to host it pending, OARC looks forward to participating in the innovation revolution of Cloud Computing and Big Data.

OARC's ability to provide a solution to a problem that was not envisaged at its founding underlines the value of neutral general-purpose data gathering from the DNS in the wider context of Internet Science.