Saed Syad: The second thing is that there are many issues related to those types of techniques, which is independence between variables. To come up with a solution to solve this issue, they found a remedy for it by creating enneagrams, just to find the correlation between different keywords. If I have a data set with 100 variables, I cannot just use 100 different enneagrams or a combination of the values I have. It is a huge amount of work, and there are endless combinations of those values. We needed something more, something that could support the majority of the predictive modeling we had, at least at a linear level.
The way I designed real-time learning machines is that I first cut the connection between data and model. Then there is another component called learner. Learner processes the data and updates a table, which we call basic element table (BET). One of the components of this basic element table is frequency. At this stage, basic element tables – if you just use then at count – are equal to the search engine structure. Besides count, we save some x, some y, some xy, some xy2, some xy3 or some xy4, if we need it. But any component we add to the basic element table should support the core six features. We can not add minimum or maximum to that table. Those are scalable incrementally but not scalable decrementally. We can find a solution for this, but generally those types of parameters are not scalable.
Using BET, we can add as many parameters we like if those parameters support those main six features, and having those components means we can now build a model without using the data. The learner is completely independent from the modeler. The modeler connects to the BET and independently can read that data and build a model. Learner is another machine or machines that can run in parallel to update that BET. The structure of that BET is basically a table – a connection between two variables – and one cell between the two variables in one separate core. That means the level of the parallelism is at the level of one cell. It is a completely linearly scalable structure. Having that BET gives us huge flexibility and speed. For example, we build a linear discriminant analysis model in order to predict the probability of click. Using LDA, we have about 50 to 60 variables. But the data in online marketing is very noisy data. But we can build a model between three to five variables. The outcome of that model is much more effective than having 50 to 60 variables because we remove all the noisy ones. To reach those three to five variables, we need to build between 500 and 1,000 LDA models. We build those in less than a minute.
SM: Your problem has a few different components. One is speed. You have to deliver results that have high probability of people clicking on them. You have to do this at scale, so there is a scalability issue. Then you have the learning issue of the program having to learn based on the behavior to be able to predict what the most likely ad is people are going to click on. What is the key strategy how you gain on time, scale, and accuracy of the prediction?
SS: For scalability we use RTLM (real-time machine learning) to read the data in a continuous way. That means we get the data, we put it in the system, and we process it. Streaming data is the way we scale the reading and processing section. We achieve accuracy with the power of RTLM, building thousands of models on the fly in less than a minute to find the best subset of those variables. We achieve speed by using a smaller set of that variable. We have a very small equation, and that is why we have a very high-speed scoring process. We can easily score 50,000 bits in a second.