Sramana Mitra: Give us some information on what kinds of metrics those are.
Michael Wu: If you send a message, that message can be a blog post, an idea for people to vote on, etc. So the message has a different context. Not all messages are of the same type. That is what I mean by different context of the message. A person could create 100 ideas but has never posted a single blog post. Another user may write 50 blog posts but have created only three ideas. Those are different behaviors that give you an idea of what people like to do. Some people like to share ideas and like to express themselves. Some people like to answer questions. They post a lot of answers, but they don’t participate much on the ideas part or blogging part.
We also track how people post messages. Is it through mobile devices, tablets or desktops, for example? You can get a sense of what these people like to do and how they do it. If our clients believe it is helpful for them to have users answer a lot of questions, or maybe one client says “I believe that it is really helpful for people to create videos. I want to reward people who upload videos,” they can use gamification to drive that behavior. There are 400 different actions a user can do. Each of those actions is tracked and is attributable to that user. Typically, we log the user who took the action, where in the community he took the action, when he took the action and how, whether it is through devices, through syndication, etc. We also integrate with Facebook. That gives us an idea of how users engage with our client.
SM: And you put these communities on the URL of your client?
MW: Yes. It is on their own domain. If it is AT&T, for example, it would be community.att.com or www.att.com/community. It is under the client’s domain.
SM: What do you do on your end on the data side? What kind of big data infrastructure do you use to capture and analyze all this?
MW: The events and logs are all captured and stored in HDFS – basically Hadoop. Typically, we do some pre-aggregation, writing scripts. Then we store those aggregates in a NoSQL solution. Those NoSQL solutions power the reporting engine that eventually services this aggregate data to our clients. I don’t think anyone has the capacity to look at big data in its raw form. It needs to be aggregated in some way. Hadoop is actually very slow in that respect – the responsiveness is not very high. When you write a query, the query time takes several minutes. That is not very fast and not good for real-time querying. That is why we use pre-aggregated steps to pre-process data, and then we store the results in a NoSQL solution.