Sramana Mitra: Let me ask you a definition question: How do you define big data? Business analytics has been around for a long time. A lot of what you are doing has been around for a long time in other business analytics formats. How do you define the difference between what is happening in the more generic analytics business world and in big data? What is the differentiating factor?
Oliver Downs: Great point. Because of the space we are in, it is partly the original big data. It is the reason why AT&T invented the original hierarchical database, for example. What I like to do is draw the modern analogy to people. Few people realize that on average, a single prepaid mobile customer generates around 29 pieces of transactional data that we can observe per day. That is quite small until you realize that the average Facebook user generates only three and the average Twitter user less than one. Facebook and Twitter might be the more common notion of what big data is. Of course, Facebook has around one billion users. When you look at carriers, they are in between a 70 million to 100 million user range, and you have more pieces of transactional information per customer and per day.
You will realize that this is truly a big data problem. For us, that means that a medium-sized carrier – perhaps in the 5 million to 10 million subscriber range – will generate about 10 petabytes of data per year. The scale has always been large. What have changed in the past five years are the tool set and approaches available to get into that data. It is much more common today to have an IDW [inverse distance weighting], which is on the daily cycle of computer aggregates; this is customer-level information on a daily basis. But it is very hard with conventional non–big data technologies to drill in to sequences of customer behavior and representations of patterns of events and customer behavior than when you have that kind of technology at the heart of your data management systems.
SM: Would you say that part of that tool and technology set that has become available or prevalent in the big data era includes learning technologies?
OD: I think what has happened is that the tool sets have expanded, and they don’t necessarily require a PhD-level data scientist to apply them. In fact, in many cases a reasonably sophisticated software developer may be able to apply those technologies, thanks to the advancement in machine learning tool kits that are now available. The challenge is that there is still a fragility to the success of those types of initiatives. Understanding the algorithm and understanding how it might fit or not fit a given problem has a huge impact on whether the application of machine learning to the problem in question will be successful.
SM: What is the status of the machine learning tool kits?
OD: There have been good developments. The very strong tool kits were historically proprietary. Think about tools like SAS and MATLAB. We have seen the advent of significant public funding in terms of data science technologies and the price of technology stack in [the programming language] R. What we now also see is a rapid development in machine learning technologies on top of Hadoop. There is a much better tool set to draw on and much better collectivity between those proprietary tools, which are very good for analyzing data and very poor at operationalizing the findings of the data analysis. On a Python-based technology, for example, you can go from interactive visualization on a scientist’s desktop to a highly performing scientific computing program running at scale.