Sramana Mitra: What about the data repositories on which these algorithms are being run? Is that still fitting inside corporate data warehouses, and R software plugs in to them?
Jeremy Howard: The hard thing generally is training the algorithms, not so much running them. Training the algorithms basically is oversimplified – it is coefficients in an algorithm. Once you figured out what they are, what you are left with is a very simple mathematical formula.
SM: How do you do the simulation? Do you do your experiments to see what is working and what is not? If you are trying to train the algorithm, it needs to run on large data sets to see if what you are trying to do is working or not. Is that correct?
JH: We would think of them as large data sets, but they fit easily onto a laptop computer. Training these machine learning modules does not in generally improve if you use large computers.
SM: So, the sample set that you need to work with in order to test its efficacy is not too large – whatever space you need is on the laptop.
JH: That is right. We are still talking millions of [pieces of information], which to start with you would consider large, but it doesn’t require server [space].
SM: And corporations are providing those sample data sets to the contestants?
JH: There are two types of competitions – private and public. In both cases the sponsoring organization is providing the data set necessary to train and test the models. In the case of public competitions, the data set is made available to everybody on the Internet to download. In the case of private competitions, a subset of 10 or 15 of the most successful Kaggle competitors will be invited to compete in secret. Then they will have to sign a non-disclosure agreement before given access to that private data.
SM: Let’s talk about a few different metrics. How many public competitions on average are you running these days?
JH: We generally try to run five or six competitions any one time. We find that to be the perfect number in terms of maximizing engagement and interest.
SM: Can you give examples of the types of public competitions you are running right now or have run recently?
JH: Right now our largest two competitions are running. Our largest one in terms of prize money is called the Heritage Health Prize. The prize money for that is $3 million for the winner. The data they provide is anonymized health records from Californian patients, which covers their claims history, their lab results, and their prescriptions. The goal of the competition is to predict which of these patients will become hospitalized. This is because it is estimated that $40 billion in America is wasted on unnecessary hospitalization. The idea here is to come up with an algorithm to try to identify patients who more urgently need a higher level of care, to try and keep people out of hospitals when different previous healthcare could have helped them out.
The next-largest competition in terms of prize money and the largest in number of people signed up for it is the GE Flight quest. That one is for GE and it is looking at trying to improve the ability to estimate which flights will be delayed. People have a complete picture of the U.S. Airspace over a two-month period, and they have to make predictions for every flight, when each one will land and when each one will arrive at the gate.