Friday, 25 of May of 2012

Category » Analytics Skills

What data science can’t do

While in London recently, I attended some of the Big Data Week events. One big draw was a Community Meetup featuring a panel of Big Data celebs. One question that came up: “How important is business knowledge?” To make a long story short, the panel did not band together and rush to the defense of business knowledge. There’s the difference between the philosophy of data mining, which was created to empower the business person, and the emerging culture of data science.

As nearly as I can recall her words, bit.ly’s Hilary Mason said that smart people could solve any problem. She did use the words “smart people,” and that was a real thorn in my side. Hilary is a force in the analytics community and she is much to be admired, yet my experience leaves me at odds with her on this one. It’s not that I doubt that people can adapt to new business situations or unfamiliar issues. On that point, I’m a believer. But it shouldn’t be done in a vacuum.

What’s wrong with letting an analyst dive into a problem without business knowledge? For one thing, it’s inefficient. The patterns that the analyst finds may not be meaningful – like predicting an event based on factors that only happen afterward. Or making assumptions that aren’t reasonable for the situation. And then, there is reinventing the wheel. Just recently I heard about a fabulous new analytic technology, complete with a group of fans, and I was dying to see it in action. After a huge buildup, I finally got to see the stuff in action, and my heart fell. It was nearly identical to something I worked with in the nineties that was a big flop with clients. The developers clearly hadn’t researched the history of their market.

It’s not realistic to expect that every project must be tackled by an analyst who is an expert in the business behind it. Maybe that wouldn’t even be desirable. But if the analyst doesn’t know the business, then working closely with someone who does is a must.


Graph Databases and Analytics

Graph Databases and Analytics

While in London, I attended a talk by Nicki Watt and Michal Bachman, two Open Credo software developers who shared their experience building a recommendation engine based on the Neo4J graph database. I asked why they had chosen that particular platform, and got a simple answer – they didn’t, the decision had been made before they came on to the project. But they did explain some useful things about what graph databases are and what they do well, not to mention what they don’t do so well.

Graph databases are said to be “schema-less”. They don’t have the relatively rigid structure that we expect in relational databases. Instead, they can store a wide variety of information, from numbers to video and more, organized in a relatively flexible structure described by a changeable graph. Neo4J is only one of many such databases, others that you may encounter include MondoDB, AllegroGraph and FlockDB. The advantages of the graph structure include rapid creation of and changes to a database, and excellent performance for many routine operations.

What graph databases aren’t made for is analytics. They don’t lend themselves to operations that might require aggregating large quantities of data, or random sampling, or classical statistical analysis. Analytics can easily bog graph databases down to a standstill.
There are practical situations where you can work around these limitations and end up with good results. So, for example, when making a recommendation, the trick might be to use a relatively small number of easily accessible cases and choose the best among them. Think of how people find partners – getting to know the people in the vicinity and evaluating them as potential partners, rather than traveling far and wide in search of an optimum mate. Another strategy includes serving an old result while waiting for a new one to be calculated, so the user never experiences a long wait for response. Graph databases perform well for transactional applications and those where a quick analysis of a modest number of similar cases fills the bill.

So what about classical statistical analysis, data mining and exploration? What about operations research? My take is that we will still do best to keep as much of that work away from transactional systems as possible, and that planning to create and maintain a relational database for analyst use should be part of the process when architecting new applications.


Planning for ROI in Text Analytics

New article on Smart Data Collective now, Planning for ROI in Text Analytics.


Text analytics glossary

I’ve been working on a glossary for text analytics terms, still have some work to do on that. Let me know if there are particular terms you’d like to see included.

In the meanwhile, here’s one from the folks at Clarabridge.


Leave a comment

Sentiment Map Rendering

ESRI is the 800 pound gorilla in the world of maps. Back when I worked for an equally weighty analytics software vendor, I was deeply jealous that a competitor had a partnership with ESRI, while my company partnered with – how shall I put this – a lesser mapping vendor. Our maps stunk, while ESRI’s maps were functional and cool.

Now, through my text analytics work with LinguaSys, I have the good fortune to partner with ESRI to bring text analytics and mapping together. ESRI’s Mansour Raad has been creating functional and cool geographic visualizations of sentiment, and he’s bringing them to the world. Next week he’ll be speaking at ESRI Developer Summit, inspiring fellow developers with a new example that we created – a visual study of sentiment toward the US Transportation Security Administration (TSA).

So often sentiment analysis ends up as a simple pie or bar chart – what percent like my brand, what percent hate it? That type of analysis doesn’t give the client any basis for action. Instead, imagine putting open-ended comments – from surveys, social media, service requests, and other sources – on a map. And enhancing the data with analytics – not limited to simple positive/negative, but with subtleties such as whether the writer is using indicators of disapproval or emotion, obscenities or even references to Nazism. Imagine using color, interactive behavior and other indicators to visualize and identify meaningful patterns in the data. It’s pretty, but it’s more than pretty. Done right, this kind of visualization enables decision makers to derive actionable information.

Read more about Mansour Raad’s work in his post, Enter the Fifth Dimension; Sentiment Map Rendering


You’re not analyzing unstructured data

In a recent post called Why Nobody Is Actually Analyzing Unstructured Data, Bill Franks explained an important point. Statisticians, data miners, analysts of any quantitative stripe, don’t analyze unstructured data. Oh we may do text analytics, video or audio analytics, but when we do, there is always a step that converts the unstructured text, video or audio to some sort of structured data first. Once that’s done, we can use our ordinary analytics toolkit on the newborn structured data.

It’s worth reading the article. Bill makes his point with a terrific example of how fingerprint data gets the structure needed for comparisons.

My working definition of text analytics is now “the process of converting text into some simpler form of data, such as a category or score.” It’s true that there are usually many more steps needed to produce an actionable analysis, but once the text is simplified, the rest isn’t so much text analytics as it is just analytics.


More analytics than they asked for

A prospective client came to me, very unhappy. A large investment in software had failed to enable the company to obtain the information they really wanted. They gave me a list of a dozen or so metrics, and told me that if I could prove my product could produce those numbers, they would buy.

The request was reasonable. The metrics were straightforward, not particularly hard to calculate. They were conversion rates – how many first time customers return for a second purchase, how many return a third time, and so on. Frankly, the product they already had was adequate to do the job. The problem was more a training failure than a software failure.

But there was another, more important, thing missing. The metrics were interesting, but not actionable. The customer was requesting a report, and I was in the predictive analytics business. I could and would provide the report, but I knew that I could do better for them. Doing better for the client called for introducing the concept of predictive analytics and showing proof that it could be done with the data available. And this time, the client’s staff would have to be properly trained.

The client didn’t seem to be doing a very good job of getting the basic information that was supposed to be included in their loyalty program registration. Wondering if it mattered, I took the data they had, and tested a simple model to see if the registration data was predictive of spending. It was, and the connection was dramatic – this was the proof I would need to make my case. The registration data would tell a lot about future spending – so it could be used to identify likely big spenders and use that information for marketing. Knowing that the information was valuable, the client also learned that it was worthwhile to make an effort to get that data consistently. And the model identified key factors associated with spending – factors that could be used for customer acquisition planning to increase profitability.

The customer liked what they saw. They got training and took it seriously. And now, even while times are tough and other retailers are contracting, that client is growing.

Do you have stories of doing right by a client by doing more than they asked for? Please share.


Leave a comment

Responses to Big Data Blasphemy

Here are a couple of interesting comments that have come up for my recent article on Smart Data Collective: “Big Data Blasphemy: Why Sample?”

From Simon Geletta, Associate professor at Des Moines University, commenting in the SAS Analytics & BI group on LinkedIn:

“That (everything being equal) bigger samples result in narrower confidence intervals is not a matter of opinion… The argument for sampling as presented in this blog, would benefit (become worth following) if the blogger can demonstrate that estimations that are based on sampe (from a bad sampling frame) yield better results as compared to estimations that are based on the bad sampling frame itself.”
My thoughts:

Yes, it is a matter of fact that bigger samples result in narrower confidence intervals.

The case for sampling does not depend on samples from a data resource producing better estimates than using all of the data available. There are cases in which the sample data can be more carefully inspected, corrected and otherwise cleaned, and in those cases estimates may, indeed, be better than those which would be made using a larger, dirtier, data resource. However, even when there is no improvement in data quality for a sample, there is always the issue of balancing the resources consumed with the value of the information obtained. The question is not whether the estimate obtained using a sample is better than the estimate that would be obtained by using all available data. Rather, it is a matter of obtaining “good-enough” information to address the business problem at hand, and of doing so in a manner that does not waste resources which could yield greater returns if used in some other way.

From Blaise Egan, Lead Network Infrastructure Analyst at British Telecommunications PLC, commenting in the Predictive Analytics group on LinkedIn:

“This reflects my own experience encountering data miners with a background in computer science rather than statistical science. To some extent they have been hoodwinked by misleading sales material from vendors of large-scale computing systems, both hardware and software.

It’s an important message that you’re putting out.”


Big Data Blasphemy: Why Sample?

New post on Smart Data Collective today, “Big Data Blasphemy: Why Sample?”


Where will data scientists come from?

An EMC industry survey on data science indicates that 34% of respondents believe the best new source of data science talent is students studying computer science, and 12% believe the best source is today’s business intelligence professionals. Think about that – nearly half of the respondents feel that the best source for new experts in data science are people who typically have no exposure to statistical analysis.

Who are these respondents? The methodology section describes them as “497 data scientists and business intelligence professionals from around the world… pre-screened for information technology decision making authority”.

Pre-screened for IT decision-making authority? Since when are IT decision makers experts in drawing meaningful insight from data?

How about the other half of respondents? Most expected the best source of new talent to be either students (24%) or current professionals (27%) in in fields other than computer science. No details on which other fields.