Sramana Mitra: That is very interesting. Talk to me a little bit about the big data angle of this. What kinds of data scales and technology stacks are you dealing with?
Jim Swift: There are several different components to it. I can’t give you a perfect definition of big data. Whenever the data starts to get a little unruly for the problems you are trying to solve, and it is really testing the skills of your organization and the limits of the technology (the speed), I think that is when you start getting into big data. I have seen it in several different points back in the credit card industry in the 1990s. We were doing things that the technology just couldn’t support. We were not able to take a full file analysis. We had to take samples out of our complete databases so that we could run real-time stats and views on them. Then you would try to take a representative sample, take the results and you would try to project the counts and the trends back against the full population. But we didn’t have the horsepower or the storage reality to be able to do it on the false sample.
That has changed. Now storage is a lot less expensive, computers are faster, file management systems are better, and indexing is not our big limiter anymore, so we are able to do full file analysis. It is amazing; it wasn’t that long ago when we tried to pull out 55,000 record samples so that we could manage the data. I have seen it in all different cases. The last company I was part of built a massively parallel framework and had it in production in late 1999 when we decided we didn’t want to build index solutions. They were too slow, too limiting, and it took a week to build indexes. We decided we were going to load data up in memories strung across massively parallel PC notes in our data center. That way we would not have to build indexes. We would build a file scan and read every record in real time and be able to process the most complex queries subsequent. Now there are Hadoop and other things that run in a similar fashion.
To me, there are a couple of dimensions to it. One is natural language processing. There is just more and more data around us on the web all the time, and we are trying to make sense of it. We have to take all this unstructured data and we have to do a few things: We have to figure out what it is somehow and classify it. We have to figure out what it is talking about, which is a really big challenge – the entity resolution. Whether it is a business, a person, a place, a product or some other type of thing, we need to reliably resolve what entity we are talking about. What we are doing is picturing all of these kinds of unstructured text or records that we are pulling in – things like public records from thousands of sources, and overall many tens of thousands of data sources, with millions of records flying around at any given time. Once it comes in, we have to have very fast and efficient ETL processes to do what we need to do to transform the data into something that is more useful. That means matching it the right way, resolving the entities as I mentioned, “householding” it into the right things, and calculating whatever derivations and scores are on it. It needs to be really fast. The size of the data and what you have to do to manage it become key considerations.
Once I have all that, I have to make it useable. Wherever we present new data to the market, we find it takes time for people to adapt to it. Even when they understand the concept of the new information, they have to internalize it. They have to make it consumable for them and figure out how they are going to drive it into the business processes they use. Whether it is an application we are building or an XML API we are exposing to them, we have to follow that chain. It always seems that the data is bigger than your ability to handle it. I guess that is why there is so much innovation on the technology side.
SM: Are you using machine learning algorithms in any of your work?
JS: There are different forms of machine learning. I have seen some very high-end stuff in work that we have done in the intelligence community and other areas in the past. I used to be on the board of a company called Fetch, and we were doing some interesting things there. In the machine learning that we are doing, there are a lot of iterative things around the natural language processing and the analytical lookalike model tailoring we do. We are definitely doing some of the concepts in machine learning. It is such a broad field, and it is really exciting to do what we are going to be able to do by taking the people out of the loop of the iterations. I think it is going to accelerate that learning cycle.