Sramana Mitra: We have been talking to many people in this space. One thing is very clear – all the real opportunities to differentiate in terms of applications are in the layer where there is actual heuristics and business logic that can be created. That presupposes that you have metadata and rules based on metadata. You have to have that domain specific understanding of whatever problem you are trying to solve. In your case you are dealing with file sharing and “who has access to what.” In the case of Netflix, some of it has to do with recommendation engines that have to do with films. The metadata they collect is all related to film. It is very domain-specific logic based on how you organize your data, and the heuristics of what you do with that data. That is where we are seeing most of the new stuff in terms of the infrastructure layer. That already exists. The interesting stuff is all happening on the domain specific layer. Is that in tune with your observations?
DG: On the human-generated side, most of the metadata doesn’t exist. One of the most important pieces of our technology is our filter technology. The audit trail is not just there to collect for most organizations. It is actually surprising for most people that there is no record of who is accessing what file. There are native auditing functionalities on all those platforms, but nobody turns them on. One example is a native auditing function. Very rarely is that enabled. The client didn’t have file system auditing initially. Also it has not enabled that function. AIX auditing is very seldom enabled.
The audit trail isn’t necessarily there for a lot of people. Let’s stay it is. Then there is another problem. A lot of organizations are starting to build out Hadoop and different clusters, but they are still much more familiar with vanilla hardware. When you are dealing with terabytes of data, if you start collecting audit activities, metadata can very quickly start to dwarf the data itself. That is an important point. How do we normalize this stuff and create an intelligent data structure to really represent massive amounts of metadata at scale and get meaningful value out of it on this vanilla hardware? It is not that easy to get and harness this metadata on this human-generated data. There are a lot of different applications where it is more available and it is a bit more straight forward. But on the human-generated stuff it is a little bit more of a new thing.
SM: What you are trying to say is that the way this data is stored and organized it is not easy to query because this metadata has not been put in place.
DG: That is exactly right. Let’s leave the audit trail aside. Let’s look at the querying of metadata of access controls. The access control list is a group or a user, usually a group, that has permissions on a directory. To understand what that means, you have to research whatever that group is referring to, wherever that is stored, etc. Sometimes that group has subgroups in addition to its members. There are a ton of functional relationships out there, and on the file system itself, when you query a directory, all you get is the group. You are not really getting that metadata in any useful format. What there is a lot of work in is taking that metadata out of the file systems and the directory services and putting them together in a way their relationships are easy to map.
If you just did this in string comparisons, what we see is that each terabyte of data has 50,000 folders, 2,500 of which have unique permissions. On each of those folders, there are three to five active directory groups, and in each of those groups there are 15 to 20 members. On one terabyte of data you have around 150,000 functional relationships – just talking about access control. What if I remove this person from this group, what are they going to lose access to? To do that with strings is not going to work. You have to distill that to do it on normal computer infrastructure. There is a ton of work to extract this metadata out of these systems and put it in a usable framework that allows you to query it and make sense out of it.