When Yudong Chen was a child in Guangzhou, China he liked to build things with his father. “My father was an electrical engineer and a self-taught carpenter,” says Chen. “He made all of the furniture in our house and I used to like to help him build things.” Today, Chen is an assistant professor in the School of Operations Research and Information Engineering at Cornell and he still likes to build things. Though these days the things he builds are more likely to be made of computer code than of wood.
Chen’s research interests include machine learning, high-dimensional and robust statistics, and convex optimization. He combines his knowledge of computer science with his strength in statistics in order to write algorithms that are useful when dealing with massive datasets. He does his work at the intersection of statistics and optimization. Chen does not refer to Big Data when talking about his work. Instead, he calls it Big and Noisy Data.
A clear example is how people use their personal computers. The data collected describing your use of your laptop would include every website you visited, every document you opened, every e-mail you read. It would encompass how long you spent on each screen you had open, how far down a page you scrolled, every ad you clicked on, and each ad you silenced with the mute button. There are times when you get up and walk away from your computer, but it is open to a particular page. There are other times you click on a link accidentally. In looking at the Big Data collected from your computer use, a researcher could not differentiate between intentional and unintentional clicks or intervals on a particular page. This is why Chen says the data is Big and Noisy.
Chen’s focus is on writing algorithms that can account for this “noise” in the data. If a particular data set has a lot of inaccurate or useless data, then it can be a waste of computing resources to analyze it with a slow algorithm capable of very high accuracy. The answers it comes up with may be useless due to the messiness of the original data. “My goal is to write algorithms that are accurate to the noise level of the particular data set,” says Chen.
One possible use of the sorts of algorithms Chen builds is to make recommendations. A company like Netflix or Amazon can use the massive set of data about you and your computer use to recommend items you may be interested in. The closer these recommendations come to your actual wants and preferences, the more you may use Netflix or Amazon.
In the end, Chen’s interest in the problem of large, messy data sets combines both the computational and statistical ways of thinking. “Traditionally, computer scientists focus on analyzing data thoroughly and efficiently,” says Chen. “I am interested in bringing a statistician’s approach to accounting for variance and noise. I find these sorts of problems very interesting.”
Chen received his undergraduate degree and his Master’s in control science and engineering from Tsinghua University in Beijing. He earned his Ph.D. from the University of Texas at Austin in electrical and computer engineering. He was at UC Berkeley for two years as a postdoctoral scholar before coming to Cornell in 2015.
Chen was excited to join the faculty at ORIE. “Cornell ORIE is very well known and the best of its kind,” says Chen. “Now here I am working with so many of the people I have looked up to. It is not intimidating; it is exciting.”
When he is not teaching or doing research, Chen likes to play soccer, ping pong, and basketball, run, and watch movies—recommended by Netflix.