Wednesday, 22 of February of 2017

Category » Events

Text Analytics Summit West 2013

Just back from Text Analytics Summit West 2013!

Heard some terrific talks.

One of the things I particularly liked was Mark Eduljee’s concise set of seven principles for useful analysis. I’ll be writing about the details soon!

Mingzhu Lu mentioned embarrassingly parallel computing, another topic begging for more explanation – maybe I should do a distributed computing piece similar to my Bluffer’s Guide to NoSQL Databases.

Janine Johnson led a GATE workshop. This gave participants the opportunity to see GATE (a text analytics tool for developers) in action, and get a good sense of how developers can work with it. Some of the crowd installed GATE and tried it hands-on. The rest of us watched as Janine demonstrated – and I, for one, saw more of what the tool could do in the two hour workshop that I probably could have worked out on my own in two days. It probably would have taken me more than two hours just to install and get it running!

Leave a comment

New book coming out

Returning from my summer away from blogging, with news.

IBM SPSS Modeler Cookbook, my new book with coauthors Keith McCormick, Dean Abbott, Tom Khabaza and Scott Mutchler is expected out in November from Packt Publishing.

This is a how-to book for those who have a little experience and want to sharpen their skills.

1 comment

Seminar: Powering Predictive Analytics With Big-Data

My upcoming seminar:
Powering Predictive Analytics With Big-Data
February 27

Leave a comment


Today marks one year since this blog began. In the first year, I published 100 posts (this is number 101) and had some rather remarkable responses, most of which did not appear as comments here. To my amazement, total strangers now walk up to me at conferences to discuss things I’ve written. One post, a little story about an old boss of mine, had an amazing spike in visitors – it turned out that the famed Avinash Kaushik had tweeted the link. A post suggesting that the Strata conference’s claims of commitment to diversity are a load of crap was picked up by a video blog devoted to analytics, who named me one of their Top Data Women of the Week. Go figure.

Leave a comment

O’Reilly Strata: Deluded About Diversity?

O’Reilly Conferences’ Strata 2012 New York City agenda features women in fifteen speaking slots, up from 10 last year. That’s a 50% increase! What a commitment to diversity! O’Reilly really takes inclusion seriously, don’t they? Well, no, they don’t.

Have look at these bold statements of commitment to diversity, which are excerpted from a diversity statement posted on the Strata website and some of its communications:

O’Reilly Media believes in spreading the knowledge of innovators. We believe that innovation is enhanced by a variety of perspectives, and our goal is to create an inclusive, respectful conference environment that invites participation from people of all races, ethnicities, genders, ages, abilities, religions, and sexual orientation.

We’re actively seeking to increase the diversity of our attendees, speakers, and sponsors through our calls for proposals, other open submission processes, and through dialogue with the larger communities we serve.

Sounds good. It’s got a great beat, you can dance to it! But the proof is in the pudding. What’s the result? It’s challenging to investigate some aspects of inclusion – it hardly seems appropriate to inquire about a speaker’s religion or sexual orientation, for instance. But gender is a fairly public matter, and there is data to support the discussion, so let’s have at it with that in mind.

Of all the STEM (Science, Technology, Engineering and Mathematics) fields, computing has the dubious distinction of being the only area where participation by women has been dropping consistently over the long term. We’ve lost a lot of ground since 1987, when 42% of American software developers were women. Today, 25% of the computing workforce is female.

Still, women in computing are no rarity. When you walk into a tech workplace or conference, expect that about 1 in every 4 people you meet will be female. If not, ask, “Why not?”

At the other end of the spectrum are mathematics and statistics, STEM fields where women have the greatest participation levels. The number of women in those professions equals the number of men. Yes, you read that right – the number of women mathematicians and statisticians equals the number of men.

So, if you walk into a workplace or conference and the focus is on analytics, expect that about 1 in every 2 people you meet will be a woman. If not, ask, “Why not?”

Now let’s say you are headed to a technology conference, with a focus on analytics. What would you expect?

Paul Doscher attended the conference last Spring and reported, “Attendance remains around 90% men, 5% women and 5% unknown.” Another source estimated the last New York City conference attendance at 20% women. Clearly these are not dependable estimates, but the implication is that there were not heaps of women present, certainly nothing like what you’d expect if the attendees were representative of the analytics community as a whole.

We have better information regarding speakers, as they are all listed in the agenda. Using the highly complex techniques of picking out the female speakers and counting, I determined that 15 speaking slots went to women at the upcoming conference. (To be clear, some talks had multiple speakers, and some people are speaking more than once, but I counted every name listing equally. The actual number of women involved is fewer than 15.)

Using the even more sophisticated technique of counting the total and dividing, I determined that women represented about 12% of the speaking slots, and again, some of the women had more than one slot. So the representation of women among the speakers is far less than the proportion of women in the analytics professions, far less even than the proportion of women in computing today.

But wait, there’s more! There are many more speaking slots at this year’s conference – which encompasses Strata and Hadoop World – than last year. The total is about twice as large. The 10 slots taken by women last year represented about 16% of the total. So that’s 16% last year, 12% this year. In other words, the proportion of women speaking at the conference actually dropped.

What a remarkable embarrassment for an organization so deeply committed to diversity.

O’Reilly’s diversity statement goes so far as to list a variety of things we can do to help them achieve diversity. Among their suggestions is this tidbit:

…Suggest ways that the onsite conference experience can be more welcoming and supportive, free from intimidation and marginalization (send an email to

Hmmm. That’s suuuuch a temptation.

Now, I should disclose that I proposed a talk for the upcoming conference, and it was rejected (or as they put it, “not accepted.”) No biggie, I’m just one speaker, and perhaps they were unimpressed with my topic, my position, or me. Fair enough. But a couple of days later, a colleague, also female and very knowledgeable, mentioned that she also proposed a talk and was rejected. Her topic sounded like one that would be quite relevant to the Strata audience. And then another highly qualified woman mentioned that she had the same experience.

In data mining, we have a technical term for that. It’s called a “pattern.”

My rejection letter claims that there were nine proposals for every available slot. Taking that as gospel, let me conjecture that the reject pile contains many proposals from qualified women who would like to speak. You can claim any one person or proposal isn’t good enough, but if you exhibit a pattern of rejecting qualified women while women remain seriously underrepresented among speakers, people are going to think you are (gasp) not committed to diversity.

Which, I fear, O’Reilly is not.

Leave a comment

Big Data: A Big Trap for Product Development?

Kathleen Morrissey, a Partner at Strategy 2 Market (s2m) will present “Big Data: A Big Trap for Product Development?” at Chicago Product Management Association on Thursday, August 9, 2012.

What an important topic to tackle! How easy it can be to invest a fortune in a solution in search of a problem. I’ll be in the audience, and I hope some of my data-focused colleagues will also attend and add some life to the discussion.

Two pioneering women programmers

National Center on Women and Information Technology is wrapping up its summit today. I’ve attended much of the first two days, and the presentations on research and projects related to women’s opportunities in computing have been some of the best I’ve encountered anywhere. Lots to write about in those presentations! Let me begin with a little story about two remarkable programmers, Lucy Simon Rakov and Patricia Palombo.

Lucy Simon Rakov and Patricia Palombo were the recipients of the NCWIT Pioneer Award, and girl, were they ever pioneers! These two women were programmers for the Mercury space program, the first to send a person into space and home again. They did all of this with about 120 KB of raw computing power! (The next time I hear some would-be Steve Jobs tell me his code is elegant, I’m gonna laugh in his face.)

Mark Guzdial has a nice post on these great women on his Computing Education Blog:

NCWIT Pioneer Awards to two women of Project Mercury: Following their passions

My trans-analytic voyage

New piece in a new publication: My trans-analytic Voyage: Text Analytics on Both Sides of the Atlantic contrasts my observations at analytics conferences in the US and Europe.

Chicago Web Analytics Meetup has a new home

The Chicago Web, Game and Social Media Analytics Meetup has been around several years and has developed a substantial membership. Now, the group has a new home. Thoughtworks, a global IT consultancy based in Chicago, will host meetings at their headquarters at 200 East Randolph. Last week, I presented “Crossing the Language Chasm: Extracting Information from Foreign-Language Text” for the group at the new location, and it was a pleasure. The space is roomy, comfortable and a great match for this use. The meeting was well attended, and I expect that the new space will help to build attendance.

If you didn’t get to attend the presentation, you can read the original article on Smart Data Collective:

Crossing the Language Chasm: Extracting Information from Foreign-Language Text

Graph Databases and Analytics

Graph Databases and Analytics

While in London, I attended a talk by Nicki Watt and Michal Bachman, two Open Credo software developers who shared their experience building a recommendation engine based on the Neo4J graph database. I asked why they had chosen that particular platform, and got a simple answer – they didn’t, the decision had been made before they came on to the project. But they did explain some useful things about what graph databases are and what they do well, not to mention what they don’t do so well.

Graph databases are said to be “schema-less”. They don’t have the relatively rigid structure that we expect in relational databases. Instead, they can store a wide variety of information, from numbers to video and more, organized in a relatively flexible structure described by a changeable graph. Neo4J is only one of many such databases, others that you may encounter include MondoDB, AllegroGraph and FlockDB. The advantages of the graph structure include rapid creation of and changes to a database, and excellent performance for many routine operations.

What graph databases aren’t made for is analytics. They don’t lend themselves to operations that might require aggregating large quantities of data, or random sampling, or classical statistical analysis. Analytics can easily bog graph databases down to a standstill.
There are practical situations where you can work around these limitations and end up with good results. So, for example, when making a recommendation, the trick might be to use a relatively small number of easily accessible cases and choose the best among them. Think of how people find partners – getting to know the people in the vicinity and evaluating them as potential partners, rather than traveling far and wide in search of an optimum mate. Another strategy includes serving an old result while waiting for a new one to be calculated, so the user never experiences a long wait for response. Graph databases perform well for transactional applications and those where a quick analysis of a modest number of similar cases fills the bill.

So what about classical statistical analysis, data mining and exploration? What about operations research? My take is that we will still do best to keep as much of that work away from transactional systems as possible, and that planning to create and maintain a relational database for analyst use should be part of the process when architecting new applications.