categories

HOT TOPICS

NEWSLETTER

If you are considering becoming a 1M/1M premium member and would like to join our mailing list to receive ongoing information, please sign up here.

Subscribe to our Feed

Thought Leaders in Big Data: Nenshad Bardoliwalla, VP of Products, Paxata (Part 4)

Posted on Friday, Oct 10th 2014

Nenshad Bardoliwalla: Typically what we find is that once we prepare the data, that data is then pulled by the consuming applications. However, it’s also important to note that we ourselves do persist with data. A very important part of our value proposition is the notion of data governance. We maintain and cache a copy of all the data that actually flows through our system, which means that we also keep every version of data that’s ever loaded into the system, every version of data that is transformed in our system, and every version of data that is exported. The moment a data element hits our system to the moment it goes out, we keep very strong lineage around the data as it moves through the data preparation life cycle. We maintain a copy but we also make that data available to the downstream consuming applications.

Sramana Mitra: We’re looking at very large volumes of data. You’re cleansing the data and maintaining a copy and passing on another copy upstream. There is an unprocessed data sitting in the Hadoop. There are multiple levels of cleansed data sitting at different points, right?

Nenshad Bardoliwalla: That is true. We have two offerings. We have an offering of our solution that is a multi-tenant cloud offering where an analyst can purchase the solution and just get going. Honestly, they don’t care about Hadoop or MongoDB. They just want an easy-to-use, interactive, self-service data preparation solution. For our on-premise customers, such as very large financial services institutions who have made the decision that their next generation enterprise analytics platform will be built on top of the Hadoop ecosystem, the design pattern we are seeing is that they are landing data into Hadoop. It’s their raw storage. We can then put Paxata on top of their Hadoop infrastructure because our execution environment is built on top of a technology called Apache Spark, which is being bundled in by many of the Hadoop distribution solutions like Cloudera. It’s becoming part of the core data infrastructure.

Then Paxata sits on top of that as an app and serves as a refinery for taking that raw data, turning it into something that’s usable, and then feeding these other applications. You’re correct in one sense that the data will move from source A into Hadoop. The data will then be transformed inside Paxata and then landed back in the same Hadoop cluster. But what our customers have told us again and again is that the cost of storage is trivial compared to the value of the governance of having all of the data stored.

Sramana Mitra: I don’t think there is any alternative. If you have to take raw data, you have to process raw data and put it somewhere. There’s no choice, really. It’s not like I’m criticizing you. I’m just trying to understand how it works.

This segment is part 4 in the series : Thought Leaders in Big Data: Nenshad Bardoliwalla, VP of Products, Paxata
1 2 3 4 5 6 7

Hacker News
() Comments

Featured Videos