While there has been no shortage of commentary about what makes data scientists tick (see my brief history of data science), a new in-depth discussion of how they work was just published by the Harvard Business Review, just in time to inform this debate.
In "Data Scientist: The Sexiest Job of the 21st Century," business and technology consultant Tom Davenport and D.J. Patil, a data scientist in residence at Greylock Partners, tell us about Jonathan Goldman, the data scientist who came up with the "people you may know" feature on LinkedIn: "He began forming theories, testing hunches, and finding patterns that allowed him to predict whose networks a given profile would land in." Based on this and other observations of data scientists, Davenport and Patil generalize about how they work:
What data scientists do is make discoveries while swimming in data... [their] dominant trait is intense curiosity -- a desire to go beneath the surface of a problem, find the questions at its heart, and distill them into a very clear set of hypotheses that can be tested. This often entails the associative thinking that characterizes the most creative scientists in any field.
Perhaps it's becoming clear that the word 'scientist' fits this emerging role... their greatest opportunity to add value is not in creating reports or presentations for senior executives but in innovating with customer-facing products and processes.
Davenport and Patil don't provide a concise definition of a data scientist, but for the purpose of this debate, let me offer a working definition: A data scientist is an engineer who employs the scientific method and applies data-discovery tools to find new insights in data. The scientific method -- the formulation of a hypothesis, the testing, the careful design of experiments, the verification by others -- comes from their knowledge of and training in statistics. The application (and tweaking) of tools comes from their engineering or, more specifically, computer science and programming background. The best data scientists are product-and-process innovators and, sometimes, developers of new data-discovery tools.
You need humans to do this kind of work. Data science as a discipline is in its infancy and as it evolves we'll no doubt see some activity done manually today automated in the future (e.g., data cleansing). But a person -- with the right qualifications -- will always have to be there and tell the machine what to do.
Even more important, making a statement about automation (of any activity and profession) in the future assumes that what we do today will stay the same forever. This is obviously a ridiculous assumption, nowhere more ridiculous than in the domain in question, the application of computer technology. While there was excited talk about how automation will replace software engineers, some members of this soon-to-be endangered species went on to develop Hadoop and other foundation technologies and tools of the big-data ecosystem. There will always be new challenges and it will always be humans, not machines, identifying the need and building the solutions.
The need for communications -- i.e., the ability to explain to clueless business executives what the data means -- is another argument sometimes voiced in support of the notion that tools can't replace data scientists. I tend to be a bit more generous toward business executives (having been one in the past) so I would suggest that the discoveries expected of data scientists don't only happen when the data scientist is communing with the data.
New insights often emerge when data scientists and business executives (or anyone else with a strong domain expertise) discuss and brainstorm what questions to ask, what the results of the analysis actually mean, and what the next iteration should be. This is what lies behind the requirement for people skills or business acumen often included in the basket of skills expected of a data scientist. The ability to communicate the results of the analysis is indeed important, but it can be replaced, at least to some extent, by good data visualization tools. But brainstorming with a machine? I don't think so.
Finally, whenever we discuss humans vs. machines, advocates for machines usually don't fail to mention the human follies and foibles that are obviously absent from inanimate matter. In our context, the famous dictum from Marissa Mayer, then a Google search executive and now Yahoo CEO, comes to mind: "Data is apolitical." At her former employer, Google (and one would assume now at Yahoo), data, not politics, drives all decisions. Really? As Michael Schrage, a research fellow at MIT Sloan School's Center for Digital Business, noted in an HBR blog, "some data are apparently more apolitical than others: the closure of Google Labs, for example, as well as its $12.5 billion purchase of Motorola Mobility are likely not models of data-driven 'best-practice.' "
The bigger point is that data could be "political" because people use it and people have agendas. They can ask the wrong questions, design the wrong tests, and make the wrong interpretations to fit their agendas and what they want the answer to be. The answer to human biases is not to replace humans with machines or data (which anyway could be guided by human biases). The answer is the scientific method. This is how and why science has made progress in the last 350 years, insulating scientific inquiry from human bias.
Which is why the new profession of data scientists promises to inject -- to some extent -- new objectivity into the decision-making process whenever it is possible to conduct an experiment and draw conclusions from the data. But the formulation of the hypotheses, the design of the experiments, and the interpretation of the results will not be done by machines. I'm certain machines will also fail to invent new products and processes.
To what extent do you see automation playing out in the world of big-data and data science? Read our Point post, and share your opinions on this debate on the message boards.