Principal Investigator: Michael Reiter and Fabian Monrose
Funding Agency: John Hopkins University
Agency Number: 2000457356
The availability of realistic network data plays a significant role in fostering collaboration and ensuring U.S. technical leadership in network security research. Unfortunately, a host of technical, legal, policy, and privacy issues limit the ability of operators to produce datasets for information security testing. In an effort to help overcome these limitations, the Department of Homeland Security (DHS) has endeavored to create a national repository of network traces under the Protected Repository for the Defense of Infrastructure against Cyber Threats (PREDICT) program. A key technique used in this program to assure low-risk, high-value data is that of trace anonymization—a process of sanitizing data before release so that information of concern cannot be extracted. Indeed, many believe that proven anonymization techniques are the missing link that will enable cyber security researchers to tap real Internet traffic and develop effective solutions tailored to current risks.
Recently, however, the utility of these techniques in protecting host identities, network topologies, and network security practices within enterprise networks has come under scrutiny. Much of our own work, for example, has shown that deanonymization of public servers , recovery of network structure, and identification of browsing habits  may not be as difficult as first thought. Given the significant reliance on anonymized network traces for security research, we argue that a more exhaustive and principled analysis of the trace anonymization problem is in order.
The naive solution to this problem (i.e., new anonymization techniques that directly addresses the specifics of these attacks) fail to address in its entirety the underlying dynamic of the dataset publication—the trade-off between dataset quality and privacy. While important, isolated advances will simply shift the information-encoding burden to other properties of the traces, resulting in future breaches. To truly address this problem, we argue that what is needed is a framework for evaluating the risk in anonymization techniques and datasets. Based on novel information-theoretic concepts, we propose techniques that will allow the network or security practitioner to evaluate the risk inherent in their choice of policy and anonymization technique, and ways for minimizing the risks of deanonymization. We propose to implement and evaluate this framework in the context of traces, techniques, and policies from our own networks as well as those offered as a part of the PREDICT project. In addition, we propose to investigate the public policy implications of this work, particularly those pertaining to types of networking research that rely on trace attributes that cannot be effectively anonymized.