Sramana Mitra: Imagine you are a young entrepreneur today starting a company. Where do you see some open problems that you think are worth working on?
Josh Rogers: There is a debate happening on what Hadoop is. I would suggest people to look at it as an operating system. Think about the set of services that this operating system is offering you and think about the applications you can build on top of that operating system to allow people to quickly get significant value out of it.
SM: So you are suggesting that people should look at the kinds of use cases you were describing, build applications around them, and sell those applications as opposed to an integration approach?
JR: Any sort of application that requires the ability to process huge amounts of data that historically would have been either too expensive or not technically feasible because of the scale requirement.
SM: What form does that take, from your point of view? You are working on a horizontal infrastructure layer. Is that correct?
JR: That is correct. We are focused on delivering the best-in-class ETL application that we run on top of Hadoop.
SM: If an entrepreneur wanted to build something like that and had deep domain knowledge about a particular industry problem, could they come to you and take advantage of that technology?
JR: Absolutely. There are a number of organizations that are looking to build applications on top of Hadoop, and one of the things they need is the ability to move data into the environment and to provide users an easy way to construct the data flows and data integration tasks that are required in the application. We can provide that.
SM: So you are looking for companies that are building on top of Hadoop to partner with in that case?
JR: That is correct. We are focused on the Hadoop space.
SM: Is there anything you would like to add?
JR: I would like to give you more history on sort and how it ties into our unique value proposition. You had indicated that your readers are fairly technical in nature, and there are some interesting technical components to that. If you look at the Map Reduce framework, technically speaking, it takes a problem or a question and allows you to do the data processing associated with answering that question in a distributed fashion. It breaks that problem down into two steps: the map step, which is breaking the problem down into smaller problems to compute an answer and then bring all of it together on the reduce side – hence the name Map Reduce.
If you look at those two phases in the process, there is a sort that happens on the map side, and what they call the reduce merge – where you bring the blocks together – which is called a big sort. What you have here is a compute framework that allows you to tackle much bigger questions, but it is very sort intensive.
What Syncsort has done is take steps to increase the functionality of the sort within the map produce framework by introducing APIs to the map sort and the reduce merge step that allow you to call out to an external sort engine. You can do that to any sort engine. So there is a benefit there both to the developer, who now has access to call out the high-performing sort, but also opens up new use cases. There are approaches to manipulating data where you may not want to go all the way through the Map Reduce framework, you may want to leverage things like hash joins or hash aggregations. The open source community makes that easier. The second thing is that it allows us not just to enable customers to leverage a high performance sort, it actually allows us or our customers to plug in a fully featured ETL tool directly into their Hadoop environments.
SM: Thank you very much. It’s been very interesting.
JR: You too.
This segment is part 5 in the series : Thought Leaders in Big Data: Interview with Josh Rogers, SVP of Data Integration Business at Syncsort
1 2 3 4 5