Big Red Data

Magazine cover graphic

Data scientists never really know where their work is going to take them. David Shmoys, the Laibe/Acheson Professor of Business Management and Leadership Studies at Cornell Engineering, has applied his mathematical tools to topics ranging from woodpecker populations to bike sharing programs. And when a global pandemic broke out, he was ready to shift his attention to the biggest crisis of our time.

Nobody really saw the novel coronavirus coming, but data scientists at Cornell had still been preparing for the moment. Over the years, they have been developing models and mathematical techniques to address the most vexing problems in the world—whether pandemics, climate change, or transportation. At Cornell, data science is a collaborative effort involving researchers from many different fields including biology, the social sciences, physics and engineering. Much of the work happens in three key hubs: the School of Operations Research and Information Engineering (ORIE), the Center for Data Science for Enterprise and Society (where Shmoys is the director) and the Institute for Computational Sustainability. The work crosses disciplines and borders, with direct impacts on critical care units in New York City, hydroelectric dams on the Amazon, and many places in between.

Using “big data” techniques to address real-life problems is the guiding mission of all three entities, Shmoys says. As he explains, big data could be applied to all sorts of esoteric topics, but Cornell researchers aren’t running numbers just for the sake of mathematical challenge. “We’re using computational tools to improve decision-making capabilities.” Shmoys says. “We are especially attuned to problems that impact society.”

Data in a time of coronavirus

Of course, there’s no bigger problem right now than the novel coronavirus. As the pandemic gained steam, Cornell data scientists were called to action. In response to a request for expertise from the office of New York Governor Andrew Cuomo, Shmoys and colleagues have been tracking the spread and developing models to predict the need for ventilators and other vital pieces of equipment. Meanwhile, Peter Frazier, an associate professor in ORIE, has been crunching numbers on the best way to test large groups of people and potentially get them back to the workforce. “Even though we didn’t expect this pandemic, it’s very much in our wheelhouse,” Frazier says. “Our goal is to develop and apply math that’s useful and practical.”

The pandemic has raised questions that are as urgent as they are complex. For example, a wide range of variables can affect the need for ventilators in New York City, and the facts on the ground are in constant flux. In order to build models to predict the most likely outcomes, Shmoys and colleagues have collected data from around the world to learn more about transmission rates and the chances that an infection leads to significant illness. They’ve also taken a hyper-local view by tracking cases of coronavirus infections by zip code, a critical piece of information for understanding the demographics of the disease at a neighborhood-by-neighborhood level. “You can think of this epidemic as multiple micro-epidemics,” Shmoys says, each with its own unique challenges.

Shmoys notes that Cornell data scientists have a history of responding to public health crises. During the anthrax scare of 2001, John Muckstadt, then the Acheson/Laibe Professor of Business Management and Leadership in ORIE, worked closely with Dr. Nathaniel Hubert at Weill Cornell Medicine to develop possible approaches to mass antibiotic distribution against the potential biological weapon. “That set off a number of projects that explored both operational issues and system modeling approaches to improving medical care and understanding the progression of a disease,” Shmoys says. “It’s been a recurring theme of work within ORIE at Cornell.”

 

Thinking big about coronavirus testing

For his part, Frazier is exploring the possibility of “group testing,” an approach first developed in World War II to screen soldiers for syphilis. The basic idea, then as now, is to combine samples from a large number of people and test them all at once. If a combined sample from one hundred people tests negative for the novel coronavirus, it can be assumed that everyone who submitted a sample is free of the infection. In theory, those hundred people could then go about their day without fear of spreading the virus to others. If it’s positive, the testers would have to go back to the original samples to zero in on affected people.

The approach is relatively simple in concept, but it raises extremely complicated mathematical questions. For starters, what’s the optimum number of people to test at a time and how often do they need to be tested? The answers depend on many factors, including the prevalence of the virus. “If one out of ten people had the virus, testing one hundred people at a time would be a waste because almost every sample would come back positive,” Frazier says. “But if the infection rate is more like one in a thousand, testing large groups of people makes more sense.”

Things only get more complicated from there. Widespread group testing would raise thorny logistical issues, Frazier says. “If you’re collecting saliva from 320 million people once a week, and delivering it to one of the 12 labs in the U.S., that’s going to take a lot of cars and planes,” he says. If that day comes, Frazier and other Cornell data scientists will be ready to offer mathematical guidance. “We have a lot of experience dealing with uncertainty,” he says.

Frazier had been working on the mathematics of group testing in a much different context long before the new coronavirus came on the scene. Instead of searching for a sign of disease in a group of people, he was looking for faces in photographs. Pictures from a cousin’s wedding seemingly don’t have much in common with pandemics, but Frazier explains that the mathematical concepts behind the search is very similar. When looking for faces, it helps to sample multiple locations at once. If there’s no face in that sample—in other words, if that sample tests negative— the search can continue elsewhere. If it comes back positive, the computer can take a close look.

 

The wide world of computational sustainability

Just as Frazier and Shmoys apply their mathematical techniques to widely varying problems, Carla Gomes, professor of computer science and the director of the Institute for Computational Sustainability, takes a large view of a field that she helped pioneer. As the name implies, computational sustainability uses data science to address sustainability issues for human well-being. Gomes works on complicated projects including bird migrations, fishery quotas in Alaska and hydroelectric power in the Amazon, but she always follows the same rule when choosing a topic: She’ll only tackle the mathematics if she can collaborate with top scientists in the respective fields who can guide her through the core issues. “I work on problems when I have access to the highest levels of expertise,” she says.

For her work on hydroelectric dams, Gomes works closely with Alex Flecker, a professor of ecology and evolutionary biology, and a large international and interdisciplinary group of ecologists, hydrologists and social scientists, among other disciplines, many from the Amazon region. Together, they are using data science to understand how and where dams could be placed in the Amazon River basin to deliver the highest possible benefits with the least amount of environmental downsides. “Everybody thinks that hydropower is automatically clean energy, but there are a lot of trade-offs,” she says. As Gomes, Flecker and an international team of co-authors recently described in a paper in Nature Communications, the giant reservoirs created by dams can become sources of methane, a potent greenhouse gas. “If you don’t plan properly, hydroelectric energy can be dirtier than coal,” she says.

With hundreds of potential dam sites under consideration, choosing the best approach quickly becomes an exercise in astoundingly large numbers. “I have a computer with one terabyte of memory that’s completely dedicated to keeping track of possibilities,” Gomes says. “The number of potential combinations exceeds the number of atoms in the universe,” she says.

As part of her sustainability mission, Gomes is also collaborating with R. Bruce van Dover, Chair of the Department of Materials Science and Engineering, and others on developing materials for fuel cells, devices that turn fuel such as hydrogen into usable electricity. To aid in the search for the right materials, Gomes, van Dover and colleagues are developing a robot named SARA (scientific autonomous reasoning agent) that can use artificial intelligence algorithms to test and develop possible options.

 

Endless data

With so many possible applications for their mathematical approaches, data scientists will never run short of targets. If and when the coronavirus pandemic fades or fuel cells are perfected or Brazil installs its last hydroelectric dam, other problems will be waiting. And if another unexpected global crisis arises, the tools that data scientists have already built will almost certainly come into play again. “The advances in computational methods provide many opportunities,” Shmoys says.

More Spotlights