Data scientists are professionals who use the most appropriate tools and methodologies to get their jobs done. The best data scientists avail themselves of the complete set of knowledge- and pattern-discovery approaches that involve statistical analysis.
How should we refer to the sum total of data science techniques? Often, they are lumped under the term "advanced analytics." This phrase is deliberately vague in that it is intended as a catch-all for everything from statistical analysis and data mining to predictive modeling, natural language processing, support vector machines, and so on.
In the popular mind, most of this scope is known as "data mining," often with a pejorative spin that focuses on privacy violation and surveillance applications. To my mind, that's a bit like calling every species of bird a "vulture." The reason is that data mining is applied to structured data only and typically involves specific techniques, such as regression analysis and decision trees, that are not typically used when the content being analyzed is unstructured.
Increasingly, the term "machine learning" is also beginning to acquire a catch-all status. Or, at the very least, machine learning has become a convenient handle that today's data scientists use to refer to the wide range of leading-edge techniques for automating knowledge and pattern discovery from fresh data, much of it unstructured. People's working definitions of machine learning seem to be creeping into broader, vaguer territory.
That's my impression from reading the recent article "Learning and Teaching Machine Learning: A Personal Journey." In it, author Joseph R. Barr of San Diego State University and True Bearing Analytics discusses both the history of machine learning and his own education in the topic. He states that "it's safe to regard machine learning, data mining, predictive analysis, and advanced analytics as more or less synonymous."
I'm not sure that lumping machine learning with all of these other techniques makes sense. As noted above, machine learning primarily applies to unstructured data, whereas data mining is specific to structured data sets. Also, machine learning, like data mining, is principally concerned with finding diverse patterns in historical data, whereas predictive analysis focuses specifically on finding those predictive patterns that can be tested empirically through gathering of fresh data in the future. And whereas machine learning, data mining, and predictive analysis are all narrowly scoped, advanced analytics is a broader scope that includes them all.
It seems to me that machine learning has one foot in data science and the other in computer science. That's how I interpret what Barr has to say here: "Machine learning grew out of several not-necessarily disjoint mathematical subjects, notable among these are mathematical statistics, computing and algorithm, information theory, and mathematical optimization.... In those ancient times, machine learning was bundled with AI.... [M]ost topics in machine learning lie in the convex hull of (the theories of) probability, combinatorics, convexity and optimization, statistics, information, and computing. To this list I would add the three extra dimensions: heuristics, empirics, and applications."
That's a lot to bite off and chew on! As this discussion makes clear, machine learning has a formidable learning curve, for which years of classroom and laboratory work at the university level may prove essential. And that in fact is the crux of Barr's article: His own machine learning schooling as a professional data scientist plus the challenges he now faces defining the right machine learning curriculum for tomorrow's data scientists.
The definitional scope creep afflicting the machine learning arena mirrors these challenges. The disparate disciplines under this umbrella will continue to cross-fertilize in innovative ways that will stretch every data scientist's thinking as well as the terminology they use to define machine learning.
This story, "What's machine learning? It depends on who you ask" was originally published by InfoWorld.