This could lead to the next big breakthrough in common sense AI

November 10, 2020

Researchers are teaching giant language models how to “see” to help them understand the world.

November 6, 2020

You’ve probably heard us say this countless times: GPT-3, the gargantuan AI that spews uncannily human-like language, is a marvel. It’s also largely a mirage. You can tell with a simple trick: Ask it the color of sheep, and it will suggest “black” as often as “white”—reflecting the phrase “black sheep” in our vernacular.

That’s the problem with language models: because they’re only trained on text, they lack common sense. Now researchers from the University of North Carolina, Chapel Hill, have designed a new technique to change that. They call it “vokenization,” and it gives language models like GPT-3 the ability to “see.”

It’s not the first time people have sought to combine language models with computer vision. This is actually a rapidly growing area of AI research. The idea is that both types of AI have different strengths. Language models like GPT-3 are trained through unsupervised learning, which requires no manual data labeling, making them easy to scale. Image models like object recognition systems, by contrast, learn more directly from reality. In other words, their understanding doesn’t rely on the kind of abstraction of the world that text provides. They can “see” from pictures of sheep that they are in fact white.

AI models that can parse both language and visual input also have very practical uses. If we want to build robotic assistants, for example, they need computer vision to navigate the world and language to communicate about it to humans.

But combining both types of AI is easier said than done. It isn’t as simple as stapling together an existing language model with an existing object recognition system. It requires training a new model from scratch with a data set that includes text and images, otherwise known as a visual-language data set.

The most common approach for curating such a data set is to compile a collection of images with descriptive captions. A picture like the one below, for example, would be captioned “An orange cat sits in the suitcase ready to be packed.” This differs from typical image data sets, which would label the same picture with only one noun, like “cat.” A visual-language data set can therefore teach an AI model not just how to recognize objects but how they relate to and act on one other, using verbs and prepositions.

But you can see why this data curation process would take forever. This is why the visual-language data sets that exist are so puny. A popular text-only data set like English Wikipedia (which indeed includes nearly all the English-language Wikipedia entries) might contain nearly 3 billion words. A visual-language data set like Microsoft Common Objects in Context, or MS COCO, contains only 7 million. It’s simply not enough data to train an AI model for anything useful.

“Vokenization” gets around this problem, using unsupervised learning methods to scale the tiny amount of data in MS COCO to the size of English Wikipedia. The resultant visual-language model outperforms state-of-the-art models in some of the hardest tests used to evaluate AI language comprehension today.

“You don’t beat state of the art on these tests by just trying a little bit,” says Thomas Wolf, the cofounder and chief science officer of the natural-language processing startup Hugging Face, who was not part of the research. “This is not a toy test. This is why this is super exciting.”

From tokens to vokens

Let’s first sort out some terminology. What on earth is a “voken”?

In AI speak, the words that are used to train language models are known as tokens. So the UNC researchers decided to call the image associated with each token in their visual-language model a voken. Vokenizer is what they call the algorithm that finds vokens for each token, and vokenization is what they call the whole process.

The point of this isn’t just to show how much AI researchers love making up words. (They really do.) It also helps break down the basic idea behind vokenization. Instead of starting with an image data set and manually writing sentences to serve as captions—a very slow process—the UNC researchers started with a language data set and used unsupervised learning to match each word with a relevant image (more on this later). This is a highly scalable process.

The unsupervised learning technique, here, is ultimately the contribution of the paper. How do you actually find a relevant image for each word?


Let’s go back for a moment to GPT-3. GPT-3 is part of a family of language models known as transformers, which represented a major breakthrough in applying unsupervised learning to natural-language processing when the first one was introduced in 2017. Transformers learn the patterns of human language by observing how words are used in context and then creating a mathematical representation of each word, known as a “word embedding,” based on that context. The embedding for the word “cat” might show, for example, that it is frequently used around the words “meow” and “orange” but less often around the words “bark” or “blue.”

This is how transformers approximate the meanings of words, and how GPT-3 can write such human-like sentences. It relies in part on these embeddings to tell it how to assemble words into sentences, and sentences into paragraphs.

There’s a parallel technique that can also be used for images. Instead of scanning text for word usage patterns, it scans images for visual patterns. It tabulates how often a cat, say, appears on a bed versus on a tree, and creates a “cat” embedding with this contextual information.

The insight of the UNC researchers was that they should use both embedding techniques on MS COCO. They converted the images into visual embeddings and the captions into word embeddings. What’s really neat about these embeddings is that they can then be graphed in a three-dimensional space, and you can literally see how they are related to one another. Visual embeddings that are closely related to word embeddings will appear closer in the graph. In other words, the visual cat embedding should (in theory) overlap with the text-based cat embedding. Pretty cool.

You can see where this is going. Once the embeddings are all graphed and compared and related to one another, it’s easy to start matching images (vokens) with words (tokens). And remember, because the images and words are matched based on their embeddings, they’re also matched based on context. This is useful when one word can have totally different meanings. The technique successfully handles that by finding different vokens for each instance of the word.

For example:

Here is her contact.
Some cats love human contact.

The token is the word “contact” in both examples. But in the first sentence, context suggests that the word refers to contact information, so the voken is the contact icon. In the second sentence, the context suggests the word refers to touch, so the voken shows a cat being stroked.

The researchers used the visual and word embeddings they created with MS COCO to train their vokenizer algorithm. Once trained, the vokenizer was then able to find vokens for the tokens in English Wikipedia. It’s not perfect. The algorithm only found vokens for roughly 40% of the tokens. But that’s still 40% of a data set with nearly 3 billion words.

With this new data set, the researchers retrained a language model known as BERT, an open-source transformer developed by Google that predates GPT-3. They then tested the new and improved BERT on six different language comprehension tests, including SQuAD, the Stanford Question Answering Dataset, which asks models to answer reading comprehension questions about a series of articles, and SWAG, which tries to trip up models with subtleties of the English language to probe whether it’s merely mimicking and memorizing. The improved BERT performed better on all of them, which Wolf says is nothing to sneeze at.

The researchers, Hao Tan, a PhD student, and Mohit Bansal, his advisor, will be presenting their new vokenization technique in two weeks at the Conference on Empirical Methods in Natural Language Processing. While the work is still early, Wolf sees their work as an important conceptual breakthrough in getting unsupervised learning to work for visual-language models. It was a similar spark that helped dramatically advance natural-language processing back in the day.

“In NLP, we had this huge breakthrough over two years ago, and then suddenly NLP was a field where a lot of things were happening and it kind of got ahead of all the other AI fields,” he says. “But we have this problem of connecting text with other things. So it’s like this robot that is only able to talk but cannot see, cannot hear.”

“This paper is one example where they managed to connect it to another modality and it works better,” he says. “You could imagine that maybe some of these techniques could be reused when you want to leverage this really powerful language model in a robot. Maybe you use the same thing to connect the robot’s senses to text.”

NC Begins Smartphone Coronavirus Contact Tracing, But Will Enough People Use The App? [Prof. Samarjit Chakraborty interviewed]

November 6, 2020

WFAE | By Greg Barnes | North Carolina Health News
Published October 31, 2020 at 8:51 AM EDT

This story originally appeared in North Carolina Health News.

About a quarter of a million people in North Carolina have now downloaded a cell phone app that alerts them when they come into close contact with someone who has tested positive for the coronavirus.

The idea behind the app, launched by the N.C. Department of Health and Human Services on Sept. 22, is to help quickly track infections and slow the spread of COVID-19, which is increasing dramatically across the state and the country as the weather turns colder.

The app, called SlowCOVIDNC, relies on users to voluntarily and anonymously report a close contact with an infected person and to then get tested and self-quarantine if necessary. The app can be downloaded for free on iPhone or Android cell phones through the Apple App Store and the Google Play Store.

Since the app’s launch, DHHS has focused its use at the state’s colleges and universities and is now also seeking to involve more businesses, DHHS spokeswoman Kelly Haight Connor said in an email last week.

So far, Haight Connor wrote, the app has sent 346 exposure notifications, most of those beginning Oct. 3. The app has led 46 people to anonymously notify others of their positive COVID-19 test results.

It is not clear where those 346 exposure notifications came from or whether the bulk of people downloading the app are college students.

“I don’t think we have any demographic info about those who have downloaded the app since it’s completely anonymous,” Haight Connor said in another email.

Kimberly Powers, an associate professor at UNC’s Gillings School of Public Health, sees issues with the app, including whether people who use it will get tested and self-quarantine if necessary. But she says the benefits outweigh any issues.

“As with any single intervention against COVID-19, apps like these are unlikely to be, you know, a miracle or a silver bullet or a panacea, but, again, I think they offer an additional tool to help us combat the spread,” Powers said.

Bluetooth Reliability
Reliability of the app could also be an issue, said Samarjit Chakraborty, a professor in UNC’s Department of Computer Science.

“The contact tracing app should be used, and it will certainly help, so it is a good thing,” Chakraborty said. “The only pitfall is that we have to, all of us, and especially the doctors and policymakers, realize that these contact tracing apps are not 100% reliable.”

Chakraborty was among the authors of a report that studied why smartphone-based contact tracing could be unreliable, and what steps could be taken to improve their reliability.

In it, the researchers explain that contact tracing apps rely on a mechanism called Neighbor Discovery, which involves smartphones transmitting and scanning for Bluetooth signals to record their mutual presence whenever they are in close proximity.

“The hardware support and the software protocols used for ND in smartphones, however, were not designed for reliable contact tracing,” the researchers reported. “Even though their Bluetooth radios support the essential features necessary for contact tracing, tracing reliability will always be limited, potentially leading to false positives and/or missed contacts.”

Chakraborty explained that Bluetooth works by sending out beacons, searching to pair with another Bluetooth device in close proximity. It can take up to 5 minutes for two cell phones to pair, he said.

“You know, you have your phone and I have my phone and both our phones are in our pockets or in our bags, and we cross each other and the phones are supposed to register this and that registration process will not happen with 100% reliability,” he said.

Chakraborty’s paper concludes that an as-yet-developed wearable device, such as a wristband, could eliminate many of the shortcomings of a smartphone app and be much more effective at contact tracing.

Chakraborty acknowledges that such a wearable device would face logistical hurdles, including speed of development and access to the masses. But he said it could be beneficial to use in places such as schools, where children share close spaces and might not be able to use smartphones.


Both private citizens and some computer science experts are nervous that the SlowCOVIDNC app is too invasive.

Privacy Concerns
Privacy concerns are the biggest reason people might not use the app.

Scientists in the United States and around the globe are taking the issue of anonymity — and privacy — extremely seriously. In April, hundreds of scientists from 28 countries signed a memo stating their privacy concerns with apps that use GPS as a tracking device.

“Research has demonstrated that solutions based on sharing geolocation (i.e., GPS) to discover contacts lack sufficient accuracy and also carry privacy risks because the GPS data is sent to a centralized location,” the memo states. ”For this reason, Bluetooth-based solutions for automated contact tracing are strongly preferred when available.”

The researchers agreed that the apps “must only be used to support public health measures for the containment of COVID-19. The system must not be capable of collecting, processing, or transmitting any more data than what is necessary to achieve this purpose.”

Among those who signed the memo was Anupam Das, an assistant professor in N.C. State’s Department of Computer Science who specializes in privacy issues. Although Das said he has not analyzed the app that DHHS is using, he suspects it was developed by Google and Apple.

“If so, it should be fine,” Das said in an email.

The app uses Bluetooth technology based on the Exposure Notification System developed by Google and Apple. Many other states and countries are using the same system.

Regardless, people may be reluctant to download the app because of privacy concerns.


About a quarter of a million people in North Carolina have now downloaded a cell phone app that alerts them when they come into close contact with someone who has tested positive for the coronavirus.

Model Gives Promising Data
At the University of Oxford, researchers developed a model that found that in Washington state, coronavirus infections could be reduced by 8% and deaths by 6% if just 15% of the population used the tracing smartphone app in addition to traditional contact tracing.

“Our models show we can stop the epidemic if approximately 60% of the population use the app, and even with lower numbers of app users, we still estimate a reduction in the number of coronavirus cases and deaths,” Oxford professor Christophe Fraser said in a university new release in September.

That may sound great on paper, but getting a large percentage of people to download the app could be a major challenge.

survey conducted by Avira, a computer security software company, found that more than 71% of Americans who responded said they don’t plan to download a contact tracing app, mostly because of concerns over digital privacy.

This article first appeared on North Carolina Health News and is republished here under a Creative Commons license.

CS hackathons find new home online

October 30, 2020

When classes at UNC shifted fully online, so did events, such as corporate recruiting and annual hackathons. In October, UNC Computer Science student leaders and staff worked together to hold two popular hackathons, Carolina Data Challenge and HackNC, all-online for the first time.

A hackathon is a coding competition in which participants team up to develop brand new software projects. At the end of a short competition period, typically only 24 to 48 hours, the finished projects are presented to a panel of faculty and industry judges for prizes. For in-person events, the goal is to drive community and hands-on learning opportunities with social activities and skill development workshops. This fall, student leaders worked to recreate those events in a virtual environment – requiring creativity and determination to combat the fatigue associated with long periods on telecommunication platforms.

COVID calls for creative solutions

Carolina Data Challenge held its fourth annual 24-hour datathon on October 5-6. Participants worked on a dataset from either the financial, technology, or non-profit sector, and prizes were awarded to the teams who provide the best data visualization, most valuable insights, and best use of outside data, as well as to the top beginner team.

HackNC, North Carolina’s largest hackathon, was held on October 16-18. For its seventh annual event, HackNC organized its projects into four tracks: accessibility and inclusivity, education, healthcare, and sustainability, with an additional non-profit challenge.

The past seven months have demonstrated how teams can adapt to online work. The collaboration tools we use daily were creatively incorporated into both hackathons. For workshops, the team at HackNC coordinated a live stream via multiple video conferencing software on Twitch, allowing students to access content both synchronously and asynchronously on HackNC’s YouTube channel.

To foster community, Carolina Data Challenge created social events around the clock and launched a meme sharing competition, all accessible in real time via Discord.

While the department’s in-person hackathons typically draw participants from the East Coast and the southeast, the virtual editions of Carolina Data Challenge and HackNC saw participants from all over the United States and even other countries.

Project submissions reflect current times

The two hackathons brought together more than 1,300 students and mentors, with more than 100 unique projects submitted.

Projects drew inspiration from our current environment, including submissions from COVID tracking to self-care apps. Carolina Data Challenge awarded winners based each submission category: finance, health & sciences, humanities, and pop culture. Winners were also selected for best data visualization and use of visual data tools. LoganNehaLucySilas, winner of the health & sciences category, observed the relationship between the August Complex Fires, a group of 38 fires in California, and the levels of particulate matter < 2.5 in the San Fransisco area. The team demonstrated a relationship between the August Complex Fires and increased particulate matter < 2.5 levels, as well as increased levels of carbon monoxide and black carbon, in the area from August 21-23 as the fires spread. The team developed a variety of data visualizations based on different hypotheses, examining the connection between wind direction, sensor proximity to the fires, and the peak readings for particulate mater < 2.5 levels.

Drizzle, the first place hack at HackNC, produced a customized lo-fi hip-hop music creator that worked by combining a library of instrumental samples and machine learning algorithms. Lo-fi hip-hop music has become popular background music for studying, and the program creates a sample and customizes it further by using location data to match current time and weather forecasts to project corresponding images with the music. For optimal study conditions, the development team also added automatic reminders for users to look away from the screen every 20 minutes and to look 20 feet away for 20 seconds, implementing the 20/20/20 rule designed to reduce eye strain.

To redirect funding typically spent on a venue and food, HackNC supported donations to the winning hackers’ charity of choice. In total, $10,000 was raised and split among a variety of non-profits serving underserved and marginalized communities.

To see all projects, check out the Carolina Data Challenge site and HackNC DevPost page.

In addition to the support from the UNC Department of Computer Science, Carolina Data Challenge and HackNC were made possible by the following sponsors: CapTech, EY, NCSU’s Institute for Advanced Analytics, Metlife, NCDS, RENCI, SAS, Visual Data Tools, Credit Suisse, Capital One, Genesys, John Deere, Postman, Square, IQVIA, Optum, CoStar, Lionode, Millennium Advisors, Vanguard, and Deutsche Bank.

Looking forward to more virtual hackathons

With the findings and best practices from these events, student leaders are collaborating on more upcoming virtual hackathons. December 2020 will bring the inaugural queer_hack, a hackathon serving LGBTQ+ community, and Pearl Hacks, one of the nation’s longest-running hackathons for women and non-binary students, will return for its eighth event in February 2021.

Open Course: Extending COMP 110 beyond Carolina

October 30, 2020

Each year, over 1,000 students in COMP 110: Introduction to Programming are introduced to computer science by Teaching Professor and UNC CS alumnus Kris Jordan and his team of 45 enthusiastic Undergraduate Teaching Assistants (UTAs). Jordan and his dedicated UTA team have driven COMP 110 to become one of the most popular courses at UNC.

With the move to remote learning, Jordan quickly pivoted coursework to allow for streaming of all lectures, as well as make them freely and publicly available. Now, anyone wishing for a primer on introductory programming can access pre-recorded lessons, slides, and hands-on lab tutorials by subscribing to Jordan’s YouTube channel.

“As a student in western North Carolina, I did not have access to any programming courses at my high school. The same is true for many of my LAs who help me run the course. My hope with offering these learning experiences more broadly to the state of North Carolina is that it might spark an early interest,” stated Jordan.

Increased access to programming curriculum has been a long-time goal for Jordan.

“A small silver lining in the move to remote learning due to the COVID-19 pandemic,” Jordan said, “is that it provided the impetus to make this happen.”

Guided by strategies to support increased access and opportunity in computer science, Jordan is committed to finding ways for all interested students to find their place in tech. In this truncated fall semester alone, visitors to Jordan’s YouTube channel have spent more than 10,000 hours viewing his instructional videos. For more information about the course and to access the online materials, check out Jordan’s dedicated COMP 110 site.

Islam, Embedded Intelligence Lab develop for a future with fewer batteries

October 29, 2020
Members of the Embedded Intelligence Lab work in Sitterson Hall (Jon Gardiner/UNC-Chapel Hill)

Continuous health monitoring is the future of healthcare, and wearable technology is helping to lead the charge by collecting useful baseline data between check-ups and detecting changes that are early signs of illness. Wearable devices can even help detect signs that would be missed in traditional health screenings.

Similarly, the Internet of Things (IoT) has enabled us to increase efficiency and monitor the world using always-on devices in our homes, our places of work, and even in remote areas of the world. We use IoT devices to detect pollutants in our air and water, lower our energy consumption in buildings, and act as personal assistants.

Unfortunately, the current trend in embedded and wearable systems is unsustainable. Nearly a decade ago, tech experts predicted that the world would reach 1 trillion connected devices in the coming years. If each of those trillion devices has a battery that lasts 10 years, we would need to replace nearly 274 million batteries every day.

Lithium-ion batteries are used in portable electronic devices, electric vehicles, and even in aerospace applications due to their high energy density and long discharge cycles. Only a small percentage of these batteries are recycled correctly, however, and most of the metals and other valuable materials that can be harvested from used batteries end up in landfills. It is estimated that the world will produce 2 million metric tons of used lithium batteries per year by 2030, and most of that waste will likely not be recycled. Furthermore, experts predict an impending shortage of lithium by the mid-2020s, so alternative materials and methods for storing energy and powering devices will be necessary. And those alternatives will be needed soon.

One obvious solution to the problem of batteries is to build devices without them, but the concept is much simpler than the implementation. The Embedded Intelligence (EI) Lab in the Department of Computer Science designs and programs batteryless systems that are optimized for low-power operation. These devices can power themselves by harvesting energy from changes in light or temperature, from vibrations, and even from radio-frequency (RF) microwaves. The issue with these systems, though, is that power can be sporadic. A solar powered device, for example, must be able to operate as intended through long, daily periods of darkness as well as less predictable intermittent periods of cloud cover. Lower power in a device means that processing takes longer. More compute-intensive tasks have to wait to run until sufficient power can be harvested. During periods where energy is scarce, tasks may not be able to run at all.

Undaunted by these limitations, doctoral student Bashima Islam and her advisor and EI Lab director Shahriar Nirjon have developed task scheduling frameworks to enable tasks to be run effectively on batteryless systems. Additionally, these frameworks have been optimized in order to operate within defined time constraints, making their implementation consistent and predictable.

Bashima Islam
Bashima Islam

The first scheduling algorithm, Celebi, balances the trade-off between mutually exclusive cycles of power charging and computation in batteryless systems. Because these systems are unable to charge while executing computational tasks, the time needed to complete a job is a function of both the time needed to compute each task and the time needed to harvest enough energy to power the device through the computations. Harvesting more energy than is necessary adds to the runtime. Celebi focuses on both sets of constraints to maximize efficiency by determining exactly how much energy is necessary for each task and optimizing the schedule to accommodate as many tasks as possible in a given time frame. In testing, the online version of Celebi was able to schedule between 8 and 22 percent more jobs than existing algorithms.

The second scheduling algorithm, Zygarde, focuses on the computational demand of deep neural networks (DNNs) on a microcontroller in an embedded system. Monitoring systems like security cameras, toxin detectors, and voice assistant devices are an ideal implementation of batteryless systems. Unfortunately, running video and audio recognition tasks requires a relatively large amount of energy on these devices, and getting meaningful results with unpredictable energy availability can be complicated. When given a deep learning job to execute, Zygarde simplifies the job by determining the minimum set of tasks that need to run in order to make an accurate inference. Zygarde prioritizes those tasks to ensure that the mandatory tasks will finish on time in the event of an energy shortage. After prioritizing the mandatory tasks, the optional tasks are executed to improve the accuracy of the inference as time allows. Sacrificing a small amount of processing time and accuracy can make a large difference in the runtime of a machine learning task through intermittent power.

Bashima is excited about the range of applications for her work. IBM’s project Rhino, for example, monitors a herd of impalas as an early warning for rhino poachers. Motion detectors on the impalas could be powered by kinetic energy as the animals move, and Celebi would ensure that energy harvest is sufficient to keep the sensors active. Zygarde would optimize the systems to notify the rangers as quickly as possible to a poacher threat with minimal trade-off in accuracy. The frameworks could also be useful in continuous monitoring of industrial machinery and HVAC systems, enabling preventive maintenance that minimizes unplanned downtime and costly repairs. There are numerous other applications, including methane gas monitoring in underground mines and temperature and humidity monitoring in warehouses.

The pioneering work of the EI Lab will hopefully reduce our reliance on batteries in embedded and wearable sensor systems. In addition to task scheduling research, the group has projects related to low power communication, sustainable energy harvesting, low power recognition optimization, in-home assistive healthcare, and more.

Islam’s research prompted her selection to UC Berkeley’s Rising Stars Workshop 2020, a highly selective academic career workshop for women in computer science, computer engineering, and electrical engineering. In recognition of her work, Islam was named a finalist for the Gaetano Borriello Outstanding Student Award at the ACM International Joint Conference on Pervasive and Ubiquitous Computing and International Symposium on Wearable Computers (UbiComp/ISWC) 2020. More information about Islam’s research can be found at

With help from open source community, Graphene grows from prototype to product

October 27, 2020

A guiding research principle of the Department of Computer Science is the idea, put forth by founder Frederick P. Brooks, Jr., that a computer scientist is a toolsmith, creating resources that enhance the work of others. Sometimes, researchers are surprised by the usefulness of the tools that they build. Associate Professor Donald Porter has seen a research project grow far beyond the parameters of its initial paper.

In 2014, Porter, then a faculty member at Stony Brook University, led the effort to build Graphene, a Linux-compatible library operating system (sometimes referred to as a “unikernel”) that seamlessly and efficiently executed both single and multi-process applications, all while ensuring security isolation of distrusting applications on the same host system. Recent developments in library OS research had demonstrated orders-of-magnitude reduction of memory requirements when compared to running the same single-process applications on an OS kernel in a virtual machine. But prior to Graphene, those security and efficiency benefits had not been extended to multi-process applications, meaning that the library OS framework was unable to execute many commonly used tasks. Graphene addressed this problem by presenting multiple collaborative library OS instances that appeared as a single shared OS.

At its core, Graphene is a project about portability. The goal was to effectively re-deploy software from an older, less secure system onto another more secure and efficient system, while avoiding or limiting compromises that negate the benefits of the new system. This is especially valuable in commercial cloud computing environments, where hardware is shared by multiple unrelated clients, and potential security risks include side channel attacks from other tenants and unwanted access or hardware damage from cloud provider employees. Furthermore, some environments utilize security protocols that require code modification for applications to run.

At the time, Graphene was a research prototype. While it had promise, it lacked the robustness needed for widespread adoption. But the project showed enough promise to attract attention from Intel, who wanted to use it with CPUs running Intel’s new Software Guard Extensions (SGX). Intel SGX hardware enables an application to protect itself from a malicious OS or cloud hypervisor by creating secure enclaves within RAM that are invisible to both the user and a potential attacker, but many potential users were discouraged by the perception that Linux code would need to be heavily modified in order to run efficiently on SGX. With help from Intel, Porter and his collaborator Chia-Che Tsai, a then-doctoral student who is now an assistant professor at Texas A&M University, ported Graphene to SGX, creating a version of the OS that would enhance the security benefits of SGX and allow unmodified Linux code to run on it with minimal performance overhead.

Graphene started as a research prototype for Porter’s group and was maintained for some time by a two-person team during spare time, but since being published for SGX, it has rapidly been expanded for a number of new uses and features and grown to the point that Intel, Golem, and Invisible Things Lab have dedicated resources and developer effort to make the project into production quality software. In 2018, a working group was formed to organize the development and build a contributors’ community. The developers are currently working on integrations for platform-as-a-service products like Docker and languages like Go and Java, as well as support for Microsoft Windows as a host. Thanks to support from the broader open source community, Porter envisions Graphene being ready for mainstream use as early as this year.

Alumni Profile: Cassidy Soutter (B.S. 2020)

October 26, 2020

When UNC moved to remote instruction in March 2020 due to the COVID-19 pandemic, so much changed for the Class of 2020 and their peers. Today, as we continue to navigate this unprecedented situation and try to recreate the campus experience, while remaining committed to quality teaching and research and providing opportunity for students, our UNC Computer Science alumni are rising to the challenge. For Cassidy Soutter (B.S. 2020), her commitment to supporting others and bettering society grounded her Carolina experience, and thus prepared her for today’s challenges. She’s found ways to continue to give and support her UNC CS community, while also starting a career that aligns with her experiences and beliefs as a Forward Deployed Engineer at Palantir.

In late spring, bearing witness to the consequences of COVID-19 on society, including the grief, loss, and impacts on mental health, Soutter searched for ways to create opportunity for her fellow UNC CS students. Specifically, Soutter hoped to help those students that experienced the loss of an internship or job opportunity. The result was UNC CS Summer of Code, a program for UNC CS students whose internships had been canceled due to COVID-19. Soutter connected with UNC CS staff and then led the program, which paired approximately 50 students with nine community partners who needed custom technology solutions, many of which stemmed from adapting to COVID-19. Additionally, 15 alumni and faculty mentors participated, answering students’ questions and guiding them through the process of creating a contract with the client, designing a product, and then developing it. The six week program culminated with a day of presentations to showcase the work of all of the teams.

Reflecting on the program and process, Soutter said, “It’s incredibly difficult to make a one-size-fits-all solution for students of all different backgrounds, but we had a successful first launch and I look forward to improving the program for upcoming summers.”

The technology developed by students had immediate impact and still serves the clients well today. For students, the opportunity to participate in Summer of Code may not have been how they imagined spending their summers, but it still provided growth and skill development, with many sharing that this experience filled the gap created by the cancellation of their internship.

How did Soutter arrive at leading efforts to support the Summer of Code program? As a North Carolina native, Soutter’s choice to attend UNC was a no-brainer. Initially, Soutter had her sights set on a future as an opera singer, singing in the world’s most famous opera houses. After arriving at Carolina, she realized that her calling was in a much different place: computer science. Having a family member working in tech influenced her change in course, but a larger influence was the recognition of the impact one could make by working in the tech field.

“Since I was little, I have felt that my purpose in life is to make the world a better place and uplift those around me. Seeing how dependent our society is and continues to be on technology led me to understand what an expansive impact I could have by bringing my perspective to the problems that tech tries to tackle. Technology empowers us to reach parts of the world and the universe that we never thought imaginable,” Soutter said.

While at UNC, Soutter was involved in a myriad of opportunities that helped her to understand that contributing to work that resulted in a positive impact on society was crucial for her moving forward. Soutter was one of the founding members of the student organization CS+Social Good and later served as president for two years. This and other opportunities solidified what was important to her.

“Because technology is so powerful, it is our responsibility to treat it delicately, considering implications of our work and protecting the privacy of the people,” she said, reflecting on what she valued in her career search.

Currently, as a Forward Deployed Engineer at Palantir in New York City, Soutter works in the US government space, helping some of the world’s largest organizations see how their data from many different sources is connected.

For Soutter, Palantir could not align any better with her goals of making a positive impact on society, “Every day, I get to practice my skills while helping people and growing into the technologist that I want to be.”

Moving forward, Soutter hopes to continue to stay connected to the department through mentorship opportunities, such as her current role with CS+Social Good, where she is serving as liaison between the student organization and Rewriting the Code, a non-profit whose mission is to support women in technology. Soutter was involved with Rewriting the Code as an undergraduate student and currently serves on the Alumni Board. She also hopes to continue her work with Summer of Code, providing more opportunities for students in the upcoming summer.

“The professors encouraged me, the staff of the department joined me in my efforts, and I consistently felt supported to raise the bar for what I felt I was capable of. No one gets to where we are alone. I gained so much from my time at UNC that it’s my turn to give back and continue trying to make an impact,” Soutter shared.

The UNC CS community is made better every day through the work and dedication of our alumni – we recognize that this is one of many stories of our alumni making an impact during this challenging time. If you are interested in connecting with the department to share about your current work or experiences, please contact Erin Lane at

Data science builds bridges

October 19, 2020

Collaborative data science enhances discoveries by Carolina’s computer scientists, biostatisticians and health care professionals.

Scott Jared, The Well, Monday, October 19th, 2020

Carolina’s faculty, researchers and scientists continue to tear down silos and build bridges by sharing data.

Silos from years past have rapidly disappeared from the University’s academic landscape over the past decade. Increasingly, through interdisciplinary collaborations, researchers across campus have used data science to enhance each other’s discoveries and help people in their everyday lives.

“Data science is the most important work we can do,” said Vice Chancellor for Research Dr. Terry Magnuson. “Our researchers and faculty are generating so much data, and there’s a need for expertise to analyze that data. That includes artificial intelligence and the ability to use data to come to conclusions and generate hypotheses.”

Terry Magnuson

Terry Magnuson

Many of Carolina’s natural scientists, social scientists and humanists are finding new ways to use data in the digital age, and various committees are participating in formulating what a data science initiative at Carolina should be. “That effort relies on people who develop new algorithms and new ways of analyzing data and people who apply data science to solve problems,” Magnuson said.

By generating foundational data, large-scale access to data and funding opportunities, Magnuson said, the outcome will be convergent research, collaboration and innovation aimed at solving the world’s major problems.

“Enriches the whole process and our findings”

At the Carolina School of Nursing, Assistant Professor Saif Khairat’s data science collaborations are, among many things, helping to deliver health care in rural North Carolina and to assess the benefits of telehealth in prisons. For Khairat, collaboration and data science go together.

“Multi or interdisciplinary data science work is where the world is going. No longer will a group of computer scientists or a group of clinicians or a group of business-minded folks do a project on their own,” said Khairat, who also is a core faculty member at the highly interdisciplinary Carolina Health Informatics Program. “I do this day in, day out, so I need representatives from every discipline to participate. That enriches the whole process and our findings.”

Khairat was one of about 100 faculty, staff and students who were part of feasibility planning for a new Data Science @ Carolina initiative, a pan-University committee charged with charting the pedagogy, practice and application of data science at UNC-Chapel Hill.

“When the Data Science Initiative subcommittees met, we had people from the law school, school of medicine, nursing, liberal arts, and we had such an enriching conversation. I defined data science in a way that is so different from someone in sociology or journalism. Some of their definitions were completely different than mine and had never crossed my mind, and I was like, ‘Yeah, that makes a lot of sense.’”

The bridges of interdisciplinary data science cross to all corners of campus. Here’s a look at a few.

AURA: the virtual nurse

In Carolina’s Computer Science Department, Assistant Professor Shahriar Nirjon asked: “After surgery, how could patients replace much or all of post-operative, in-home care with a device that reminds them to take medicine, to eat or drink appropriately or to perform other activities prescribed by physicians?”

The answer may be wireless fidelity waves commonly known as WiFi.

Shahriar Nirjon’s group is working with others, including School of Nursing faculty, to develop a smart device that learns a surgery patient’s movements and could replace some in-person post-surgery care by reminding the patient to drink more, change a dressing, etc. In this photo, Tamzeed Islam reviews a video clip of the testing. (Jon Gardiner/UNC-Chapel Hill)

In the AURA lab, Tamzeed Islam reviews a video clip of testing. (Jon Gardiner/UNC-Chapel Hill)

Nirjon and his collaborators have partnered on an NIH-funded project called AURA (Connecting Audio and Radio Sensing Systems). They will create an in-home system in which voice-assistant devices such as Amazon Echo and Google Home simultaneously collect and process WiFi data to help patients improve their care, decrease costs and relieve caregiving families.

The team includes fellow computer science Assistant Professor Mohit Bansai and an in-home testing team directed by School of Nursing Associate Professor Lixin “Lee” Song. Song is also a Lineberger Cancer Center fellow with expertise in health care delivery through mobile devices, symptom management and family-based cancer research.

Shahriar Nirjon

Shahriar Nirjon

Lixin Song

Lixin Song

AURA’s foundation is wireless signals, which we constantly interrupt through walking, talking, eating and daily activities, said Nirjon. The interruptions take unique shapes that Nirjon’s team capture as data and catalog with a unique identifier in a database.

As a patient moves around at home, a router transmits identifiers to a voice-assistant device programmed to recognize them. The device then tells the patient at appropriate times to do something like drink more, take medicine, change a dressing or move around more. The router also senses and records radio frequencies of the patient’s movements to personalize and refine the database based on rules set by doctors and nurses.

“The idea is to make embedded-systems intelligence that runs learning algorithms. These sensing systems would be capable of learning, adapting and evolving,” Nirjon said.

The virtual nurse can also store a patient’s health records for the entire care team’s use.

Song’s team will transition the work from Nirjon’s lab in Brooks Hall to tests with patients and their caregivers.

“With health care costs increasing and technologies rapidly developing, patients go home earlier and sicker,” Song said. “A lot of caregiving burdens fall on family members who often are not well prepared for the tasks that professionals have been providing. By developing AURA as an inexpensive tool, we hope to better triage patients based on the severity of their symptoms and complications, provide support from different sources and improve the quality of lives of cancer patients and their caregivers as well as reduce emergency room use and readmission.”

AURA’s interdisciplinary aspect shows in Song’s team, which includes Dr. Matthew Nielsen, a genitourinary cancer surgeon at UNC Health; biostatistician Xianming Tan at the Gillings School of Global Public Health; Rebecca S. McElyea, a UNC Health wound, ostomy and continence nurse and part-time researcher at the School of Nursing; and Shenmeng Xu, a 2019 doctoral graduate of the School of Information and Library Science and nursing intern postdoc.

The AURA team hopes to eliminate the need for patients to wear multiple monitors for, say, oxygen levels or air quality or heart rate. Such devices, Nirjon said, run advanced machine-learning algorithms that send data on a delayed round trip to a server for processing then back to the patient. “Even if it is a few milliseconds, it’s still a significant delay sometimes.”

Nirjon foresees a real-time system that erases those milliseconds, predicts a heart attack or asthma attack then releases a medication to stop it.

Shahriar Nirjon’s group is working with others, including School of Nursing faculty, to develop a smart device that learns a surgery patient’s movements and could replace some in-person post-surgery care by reminding the patient to drink more, change a dressing, etc. Pictured from left to right are: Yubo Luo, Bashima Islam, Shiwei Fang and Tamzeed Islam (Jon Gardiner/UNC-Chapel Hill)

In Carolina’s computer science department, Shahriar Nirjon’s lab members (l-r) Yubo Luo, Bashima Islam, Shiwei Fang and Tamzeed Islam are developing a smart device that learns a surgery patient’s movements and could replace some in-person post-surgery care. (Jon Gardiner/UNC-Chapel Hill)

Safer medication prescriptions for veterans and the elderly

At the Eshelman School of Pharmacy, Associate Professor Carolyn Thorpe has been mining health care data gold for most of the past decade. She looks at data to find the best use of medications for geriatric populations, especially those with complex and chronic conditions.

Beginning in 2012, in collaboration with her spouse, Joshua Thorpe, also an associate professor of pharmacy at UNC-Chapel Hill, Thorpe was part of one of the first research teams to analyze data from a national Veterans Administration (VA) medical record database that includes Medicare Part D prescription drug records. Their study showed that obtaining medications from multiple health systems can lead to the prescribing of too many medicines for a single participant and more than double the odds of exposure to a potentially unsafe medication, wch might cause adverse effects and harmful interactions.

Joshua Niznik

Joshua Niznik

Carolyn Thorpe

Carolyn Thorpe

Interdisciplinary data analysis is key to other Thorpe-led studies. With Joshua Niznik, assistant professor in the UNC School of Medicine and Eshelman School of Pharmacy, Thorpe and her colleagues are looking at the effects of discontinuing or decreasing the intensity of cholesterol, blood pressure and diabetes medications for patients who are approaching the end of life. For instance, does decreasing medications alter health outcomes such as being hospitalized for a heart attack?

“Our study focuses on nursing home residents who have multiple chronic conditions, many of whom are really frail and only have a couple of months or maybe a year or two of life left,” said Thorpe. “Many have advanced dementia and we’re interested in seeing how their chronic diseases are being managed and whether or not we can back off on some of their medications without causing harm.”

The study includes veterans throughout America and links data from the comprehensive physical, psychological and functional health assessment done when patients are admitted to a nursing home to their medication record.

In another study of Medicare beneficiaries in nursing homes, Niznik and Thorpe used datasets to examine the effects of discontinuing anti-dementia medications in patients who had already progressed to advanced dementia, and might not benefit from continued treatment. They found that discontinuing these medications was associated with a reduced risk of serious falls and did not increase the overall risk of negative events, such as hospitalizations or emergency room visits. They also determined that behavorial symptoms of dementia did not increase.

Delivering health care to North Carolina’s rural and low-income areas

Exhibit A for informatics expert Khairat’s passion for increasing health care access for vulnerable populations? The 5,000 patients in rural, low-income areas of North Carolina who received medical care through a the new UNC Virtual Urgent Care service that he’s helping to perfect through data analysis.

With a background in computer science, health policy management and health informatics, Khairat has worked in data science and informatics with the schools of medicine, nursing, information and library science and computer science department.

Saif Khairat

Saif Khairat

Now, he’s working with doctors and nurses at the UNC Health Virtual Care Center to evaluate the virtual service, which offers patients an on-demand talk with a physician either by phone or online.

Among the data collected from the first group of patients, the top diagnoses showed that people with urinary tract infections, ear pain and sinus infection tended to want a consultation by talking over a phone instead of going online. Also, males are more likely to prefer phone to online. The data show that the service saves people time traveling, time in waiting rooms and money.

But most interesting to Khairat was what he calls “reachability,” meaning the urgent care team provided care to people in parts of the state where UNC Health does not have physical presence.

About 5,000 people used the service after UNC Health promoted the service in rural areas on the state’s four compass points through various ways including billboards and a targeted push to 12,000 people.

“We used data science methods like geospatial analysis to run zip-code-level analysis,” Khairat said. “Who are these people? What are they known for? We found some zip codes were under the poverty line or had no access to health care or no Internet broadband connection.”

Khairat also looked at how the virtual care affected health disparities in areas with high concentrations of American Indians, African Americans and mothers with children who receive food stamps. “These are really vulnerable populations in rural areas, people that typically don’t have any access,” Khairat said.

About 75% of participants chose a phone call for the virtual consultation. “Some people said things like ‘I have to drive up to the bridge just to get a cell signal, let alone a video feed.’ That tells you that maybe they’re not equipped with an Internet connection or even a video camera,” Khairat said.

A follow-up survey included the question: What would you have done if this service was not available to you? “Delay my care” was the number one answer, topping the options of visiting an emergency room, urgent care or primary physician’s office.

“That makes us happy to know that this way of offering care helps and we will keep improving it,” Khairat said.

How scientists secure the data driving autism research [Martin Styner was interviewed for the story]

October 1, 2020

The box came from SPARK, the largest genetic study of autism to date. To participate, Maya will have to ship the family’s samples back to a DNA testing lab in Wisconsin. But she keeps wavering.

On the one hand, Maya applauds SPARK’s mission to speed autism research by collecting genetic data from more than 50,000 families affected by the condition. (SPARK is funded by the Simons Foundation, Spectrum’s parent organization.) She hopes the effort might lead to better means of early diagnosis and treatment. Mark did not know until college that he has autism; by contrast, their children, diagnosed at 23 and 32 months, benefitted from early therapy.

But Maya also worries about giving her family’s DNA and health information to a third party. When she was in graduate school, she was initially denied a job after a prospective employer found an article about her having Marfan syndrome, a genetic condition that affects connective tissue.

The SPARK data are stripped of identifiers, such as a person’s name and birth date. And with rare exceptions, none of the DNA data are shared without a participant’s consent. But Maya questions how well those protections work. Could unauthorized individuals get access to the data and find a way to identify her and her family? Could that affect her children’s future? Most autism research databases allow participants to later withdraw their data. But if those data have already been used in a study, they generally cannot be extracted because doing so could change the study’s results, experts say.

“I want to be really sure that the data will be anonymous,” Maya says. “I don’t want my decisions now to affect my child’s employability 10, 20 years from now.”

Maya is not alone in her unease. Many families who are enthusiastic about participating in autism research also fear that their personal health information could leak out online or get into the wrong hands, exposing them to stigma or discrimination. Their concern is not entirely unjustified: Privacy laws in the United States do nothing to stop a small employer or life-insurance company from discriminating against someone based on their genetic information. And even when data are anonymized, scientists have shown how hackers can match names to genomes and brain scans stored in databases.

But sharing data with a research institution is less risky than sharing them with healthcare providers or with many commercial genetic-testing companies, experts say. Research databases have more safeguards in place, such as data encryption and restricting data access to trusted researchers — measures that have largely dissuaded hackers so far. “Researchers are definitely the best and direct-to-consumer companies in general are definitely the worst, because there are dozens of these companies, and many either don’t have a privacy policy or don’t follow it,” says Mark Rothstein, director of the Institute for Bioethics, Health Policy and Law at the University of Louisville in Kentucky.

No matter where DNA or brain-imaging data go, they are never completely secure — sticking people like Maya with a difficult decision. For now, most participants should feel reassured. “If the scientific databases are properly protected, the risk of data theft is relatively low,” says Jean-Pierre Hubaux, who heads the data-security laboratory at Ecole Polytechnique Fédérale de Lausanne in Switzerland. But researchers need to stay ahead of that curve if they want to preserve their study participants’ trust.

Illustration shows a giant lab vial that is also a safe full of secure records, surrounded by researchers who are accessing the data within.

Identity crisis:

Autism research increasingly relies on big data, and as more studies share data, some privacy concerns only become more pressing. Larger databases potentially make for bigger targets, especially in combination with digital information that is publicly available.

The MSSNG project, run jointly by four groups, including the advocacy group Autism Speaks and Verily (formerly Google Life Sciences), has sequenced more than 10,000 whole genomes of autistic people and their family members. The National Database for Autism Research at the U.S. National Institutes of Health (NIH) stores information about more than 100,000 autistic people and their relatives, including sequences of their exomes (protein-coding regions of the genome), brain scans and behavioral profiles. The Simons Simplex Collection contains whole genomes from 2,600 trios, or families with one autistic child. And as of late 2019, SPARK — the study Maya may participate in — had exome sequences and genotyping data for more than 27,000 participants, 5,279 of them with autism. The study also has health, trait and behavioral data for more than 150,000 people, 59,000 of them on the spectrum.

Other servers house collections of brain scans. The Autism Brain Imaging Data Exchange (ABIDE), for example, pairs brain scans with clinical data from more than 1,000 autistic people and a similar number of controls. From 2012 to 2018, a project called EU-AIMS collected brain scans and whole-genome sequences from 450 people with autism and 300 ‘baby sibs’ — younger siblings of people with autism, who have elevated odds of being diagnosed with the condition themselves.

All participants in these research projects sign documents that outline how their data will be collected, de-identified and shared. This ‘informed consent’ process is supposed to let them weigh privacy and other risks before they sign up, and it is required by law in the U.S. and most other places. But these documents can be difficult to parse. “Even if you’re very well educated, [the language] is still probably not as clear as it could be,” says Kevin Pelphrey, a neuroscientist and autism researcher at the University of Virginia in Charlottesville.

Informed-consent documents also don’t provide the complete picture. For example, most studies specify that the data will be stripped of identifying information such as names, birth dates and cities of birth. Studies routinely replace those facts with alphanumeric codes, such as global unique identifiers. The codes provide an anonymous way to track individuals across studies, but they don’t make data secure. In fact, as the amount of digital data for each person grows, it becomes easier for outsiders to piece together a person’s identity and health background from different sources.

“I don’t want my decisions now to affect my child’s employability 10, 20 years from now.” Maya

Someone who has access to a person’s genome from one source can readily determine if that genome is present in another database, researchers showed in 2008. The team used genetic markers called single-nucleotide polymorphisms (SNPs) as benchmarks. They compared how often thousands of SNPs appear in a person’s genome with how often those same SNPs appear in both the database and in a population with similar ancestry. If the frequencies in the person’s genome are closer to those in the database than to those in the reference population, the person’s genome is likely to be in the database. If the database centers on a particular condition, the identified individual would be associated with that condition.

Even without access to a participant’s genome, it may be possible to identify the person. Another team of researchers used a computer program that extracts sequences of repeating genetic markers from anonymous genomic data to create genetic profiles of the Y chromosome of 50 men whose genomes were sequenced in the 1000 Genomes Project, a study of human genetic variation. The same profiles exist in a public genealogy database, linking them to family names. The team put the names together with each man’s age, hometown and family tree — as listed on the 1000 Genomes website — to identify them in public records.

Repositories of brain scans have similar vulnerabilities. Facial-recognition software, for example, can be used to match publicly available photos of people with features that incidentally show up in some brain scans, one 2019 study shows.

Countless other strategies that don’t call for high-level hacking skills can pin names and other information to genetic and health data. “Any person who has some background on genomics or has some background about statistics can do these types of things,” says Erman Ayday, a security and privacy researcher at Case Western Reserve University in Cleveland, Ohio.

Security breaches aside, health data can leak out in less sinister ways: Millions of times each year, people sign authorization forms that give employers and insurance providers permission to access their health records when they apply for certain jobs, such as a police officer, or when they request life insurance, workers’ compensation or Social Security disability benefits.

And more than 30 million people have sent their DNA to genetic-testing companies such as 23andMe. That company, along with six similar companies, has agreed to follow voluntary guidelines for protecting privacy, including promising not to share genetic data with employers or insurance companies without permission. But a 2018 survey of 55 similar testing companies in the U.S. revealed that many lack basic privacy protections or do not explain them; 40 companies did not state in their documentation who owns the genetic material or data, and only a third adequately described the security measures used to protect those data.

Illustration shows a man's face emerging from genetic data pattern.

Patchwork protections:

So far, major research databases have escaped the attention of rogue actors, experts say. “There are not really instances where malevolent forces have hacked these research databases and caused any real harm,” says Benjamin Berkman, a bioethicist at the NIH in Bethesda, Maryland. But that may be, in part, because healthcare providers with lackluster security are more tempting targets. Health providers account for more than 36 percent of all publicly known security breaches — the most of any single type of organization — according to an analysis of more than 9,000 data breaches from 2005 to 2018.

After the first high-profile demonstrations of de-identifying data showed up, the NIH and some research institutions tightened privacy protections — removing SNP frequencies from websites the public can access, for example, or removing some identifying information, such as ages, from the 1000 Genomes site. But in 2018, as it became evident that virtually no data breaches were actually taking place, the NIH loosened its rules again, providing public access to the genomic data it had taken off public sites a decade earlier. (Researchers leading genetic studies of specific groups can still request that the NIH limit public access.)

“Sometimes the science changes and we, meaning the people who are in charge of protecting the public, we overreact,” says Thomas Lehner, a scientific director at the New York Genome Center who used to coordinate genomics research at the National Institute of Mental Health.

Brain-scan data may also be less vulnerable than last year’s experiment suggests. Experts say that identifying members of the general public in a large database of brain scans is much harder than matching scans to a few dozen photos that were designed to be similar in luminance, size and other features, as happened in that study. Also, autism researchers can use software to remove facial features from brain images in databases — and some of these tools come bundled with image analysis programs. “It’s easy to just remove the face — nobody will ever reconstruct who’s who,” says Martin Styner, a computer scientist at the University of North Carolina at Chapel Hill.

“There are not really instances where malevolent forces have hacked these research databases and caused any real harm.” Benjamin Berkman

Many universities actively protect DNA and brain-scan data by restricting access to them: Researchers must apply for access through a university ethics committee and explain how they intend to use the data. And many studies, such as ABIDE, have protocols for making sure the data they collect from various research groups are de-identified or ‘defaced.’ “We give them scripts for defacing,” says Michael Milham, who directs the International Neuroimaging Data-Sharing Initiative, which supports ABIDE. “Before we ever share [data], we go through and check to make sure the defacing is as it should be.”

Beyond the technical challenges, decoding identities from anonymized data also breaks federal law. “If any of my colleagues tried to do something like identify a particular person, I would expect them to lose their jobs, pay an enormous fine and probably go to jail,” Pelphrey says. In 2010, a medical researcher at the University of California, Los Angeles spent four months in prison for looking into the confidential medical records of his boss, coworkers and celebrity clients such as Tom Hanks, Drew Barrymore and Arnold Schwarzenegger. The year before, in 2009, the University of North Carolina demoted a cancer researcher for negligence and cut her salary almost in half when a breast-imaging database she oversaw was hacked, putting the personal data of 100,000 women at risk. “[The lapse] had quite strong consequences, leading to her retirement,” Styner says.

Researchers who are granted access to large autism research databases such as MSSNG also sign agreements that specify harsh penalties. “Besides legal action, Autism Speaks would revoke privileges to the researchers and institution through our controlled-access point to the database,” says Dean Hartley, Autism Speaks’ senior director of discovery and translational science.

Some U.S. federal data-privacy laws may protect people from harm if their personal data fall into the wrong hands. The U.S. Genetic Information Nondiscrimination Act (GINA), for instance, prevents health insurance providers and large employers from discriminating against people based on a genetic predisposition to a particular condition. But the law does not apply to small businesses, to life or disability insurance providers, or to people who already have a health condition. The Affordable Care Act of 2010 provides more complete privacy protection than GINA by extending protection to people with a confirmed diagnosis and not just to those with a genetic predisposition.

Some states have passed laws to fill gaps in the federal laws and give people the right to seek redress for violations of their privacy. Still, many privacy and security experts remain concerned as more personal health data get shared across more databases. “There are a number of people who have been talking about [whether] we really need to look at GINA in the context of big data and the merging of these databases,” says Karen Maschke, a research scholar at The Hastings Center, a nonprofit bioethics research institute in Garrison, New York.

Even with stronger legal protections, law enforcement or courts can demand access to a research database. To shield the data from such requests, research institutions can obtain a ‘certificate of confidentiality’ from the U.S. Department of Health and Human Services. This protection is not iron-clad, however. Evidence for its effectiveness relies upon a small number of legal cases, and if researchers are unaware that they have the certificate, as many are, they will not invoke it, experts say. What’s more, the certificate becomes moot when laws require the reporting of information about infectious diseases, such as COVID-19, for the sake of public health.

Illustration shows a network of people with identities obscured, except for one figure that is emerging more clearly.

Saving a smile:

As an autism researcher and the parent of two autistic children, Pelphrey understands both sides of the privacy dilemma. Pelphrey and his autistic children have contributed their DNA through five separate studies to databases such as the National Database for Autism Research, and they remain open to future contributions. But he understands why some people hesitate to get involved. “I think a smart way for scientists to proceed is [to] think about what they would want their family doing,” Pelphrey says.

As part of that, researchers have the responsibility to explain the privacy protections they put in place, and to provide examples of how a participant’s health data might be used, he says. “We will make a point of going through the consent form and saying, ‘In this section about data sharing, this could mean data is shared with other researchers, and those researchers may be collaborating with companies,’” Pelphrey says. “We won’t list your name and identifying information, but it is your data that has pictures of your brain and information about your genome.”

Scientific institutions typically guard the data they store with multiple layers of security. Many autism databases are stored on cloud platforms that use security chips and keys along with data-encryption tools, while also allowing vetted researchers to copy and download data onto local servers. And experts are investigating even more secure ways of storing and sharing sensitive data, says Adrian Thorogood, a legal and privacy expert at the Global Alliance for Genomics and Health. One approach involves allowing access only via the cloud, blocking researchers from copying or downloading any data. Another strategy is to use ‘data stewards’ to provide information to researchers, who would not be able to directly access the data but could submit queries or models.

Data-privacy tools are also turning up in the software applications autism researchers use. The makers of one screening app, which flags key behaviors in videos captured by smartphone cameras, are developing a privacy filter to obscure sensitive information in the videos. The filter can, for example, obscure a person’s gender or maybe even her ethnicity while still capturing facial expressions useful for analyzing behavior. “If I want to detect a smile, I could filter the image such that only points corresponding to regions of the face relevant to a smile are preserved, each such point simply represented by a moving dot,” says Guillermo Sapiro, an engineering professor at Duke University in Durham, North Carolina, who leads the project.

Despite such progress, participants in genetic studies still shoulder a degree of risk to their privacy. In exchange, some hope to gain knowledge about their own genetic makeup, although many large autism research projects are not designed to turn up individual results.

In 2011, Maya and her family signed up for the Genes Related to Autism Spectrum Disorders study, designed to identify genetic differences between boys and girls with autism. They had hoped that their participation in the study would enable Maya’s husband and autistic son to get the genome sequencing recommended by their son’s physician. But participants in that study could only request that the researchers contact a doctor of their choice for follow-up testing if a clinically relevant genetic variant turns up — there is no option to get results directly, says lead investigator Lauren Weiss, a human geneticist at the University of California, San Francisco.

Sometimes participants are willing to take the privacy risks involved just to help move science forward. If Maya decides to participate in SPARK, she does not expect to directly benefit, she says, but hopes that such research fuels progress in the area of early autism diagnosis. “I don’t think I expect the research we participate in to help my family — research is a long process,” Maya says. “But if we can help families who haven’t yet had an autistic child, then that’s worth it.”

Meanwhile, the box of tubes sits unopened.

UNC student finds TikTok fame, joy through coding

September 22, 2020

What started as a video celebrating the completion of a long project for computer science class COMP110, Introduction to Programming, became a viral TikTok within hours. The 59-second video, posted by sophomore Ibrahim Shakhtour, has gotten the attention of around 862.6K views and 183.2K likes in less than a week.

Shakhtour did not originally plan on posting the TikTok, but he decided to share it with his 60 followers at the time and woke up the next morning TikTok famous.

“I’m like, ‘What on Earth is going on?’” Shakhtour said. “People were posting positive comments and telling me it looks great. It made me feel really good because I spent so long on that project.”

Shakhtour’s project took approximately 15 hours comprised of 286 lines of code until completion.

“There were some assignments that were very tedious and it’s difficult until you finally get the code to work,” Shakhtour said. “That’s basically like that rush of euphoria when you finally get (the code) to work. That’s what I love about comp sci.”

Many of the comments on Shakhtour’s video were users sharing their personal experiences with coding, as well as congratulating him on his hard work. One of the top comments acknowledged Shakhtour’s excitement throughout the finished project.

“I was so happy in the video, and that resonates with a lot of people,” Shakhtour said. “People could connect to the fact that after they finish something so difficult, that’s how happy they are, especially people who have programmed.”

The completed project Shakhtour had filmed was for the class’s Turtle Scene, a project where students can make a design by controlling a digital turtle by giving it instructions for its movement. The turtle drags a marker as it moves and is able to jump around the canvas, filling in the shapes.

The course is trying out the Python programming language for the first time, which teaching assistant professor Kris Jordan said is well suited for data science problems. The program includes the turtle feature which students have used to create artwork.

“One of the early projects we have this semester was a turtle graphics project,” Jordan said. “Students were to design and come up with a theme on their own of their own design for just producing some interesting work of art that has some requirements for how they structured their program.”

Ezri White, an undergraduate teacher assistant for the course, said that over the summer she started tutoring children with this program, specifically the feature of Python with the turtle program. Over the summer, White said they decided to have students do the turtle project.

“(Turtle is) so visual, it’s really easy to put two and two together between the code that you’re writing and then what’s actually happening on the screen,” White said. “I think it’s an exciting project for more visual learners as well.”

Jordan said that the tradition of programming with turtle graphics in introductory courses has a rich history over the past 50 years. He described the project as a way of looking at the process of writing programs that are structured and have repeating elements.

“Our idea with projects has been to leave them pretty open-ended,” White said. “So that students can really like get the most out of it and truly express some creativity. I don’t know if we were quite expecting the number of students that really put their full effort to this project this semester.”

White said that they invited students to say whether or not they wanted to submit their projects to be displayed somewhere, and soon received an overwhelming 115 students who wanted their designs to be shown.

The course now has a Museum of Turtle Art page on the course website, displaying the projects students have submitted to be displayed.

“The joy is palpable,” Jordan said. “I think for anyone who spends time programming, they are very quickly and easily able to relate to that sort of euphoric feeling of seeing your program actually work and doing what you wanted to do.”