Sramana Mitra: It is fascinating and fun what you are talking about. But let me understand the framework of how these are done. First, who is sponsoring the prize money, and who is providing the data?
Jeremy Howard: The same company in both cases – Heritage Provider Network. This network is run by Dr. Richard Merkin. Dr. Merkin is a visionary guy. He has been very successful in business, he is aware of the challenges in the U.S. healthcare industry, and he believes that the use of data could lead to much better health outcomes. His company – his initiative – is both making the data available for the competition and making the $3 million prize available as well.
SM: And in the second case? If GE is providing the prize money, where is the data coming from?
JH: The prize money is $250,000 from GE. GE has businesses that are very much focusing at the moment on something called the industrial Internet – or the Internet of things – which is the idea that pieces of equipment are full of sensors and networking. The GE Quest platform that we are running for them was actually launched at a conference by their CEO, Jeff Immelt. Jeff literally had next to him on the stage a $30 million or $60 million jet plane engine. Each engine that GE makes spits out 10 terabytes of data every day. In this case it is used trying to understand the impact of congestion and weather in the U.S. airspace. This comes from a company called FlightStats. They are a company that brings together all the radar and traffic control data from the U.S. and makes it available for people to help them with their flight planning optimization.
SM: What about the private competitions? What are examples of the private competitions that you have been or are running?
JH: I can’t really give you any details, since they are private competitions by definition. But in general they all tend to be using data that is more sensitive. This is for companies in need of critical algorithms, where they need the best results from the best modus operandi, and they need to keep the data secret.
SM: How about the types of problems you are trying to solve in these private competitions?
JH: They cover everything from finance, marketing, or retail optimization, pharmaceutical discoveries – every area that companies strategically compete on nowadays.
SM: How many data scientists or technologies are working on your platform and competing in these contests right now?
JH: I think we reached around 75,000 registered users on the Kaggle platform.
SM: How many of those actually enter contests?
JH: In general, there are far more people who download the data than ones who upload solutions, because people have a tendency not to upload solutions unless they have the sense that they are making good progress. Generally there are a few thousand people on more successful competitions who download the data and try to build models.
SM: How many will upload solutions?
JH: That depends on the complexity of the problem. The GE Flight Quest problem, for example, which requires simulating the entire U.S. airspace, is maybe the most complex competition. There are maybe 100 or 200 people who successfully navigated that.
SM: That is good. So 10% to 20% of the people who download the data set are coming up with solutions that are compelling to some degree. Those are impressive numbers.
JH: It is certainly impressive how many people have managed to come up with a solution to that problem. I was quite surprised. In some of our simpler competitions, 1,000 to 2,000 people will submit models. Then it is quite common for people to submit 10 or 20 models over the course of the competition as they come up with new ideas.