Big-data analysis crucial for its predictive power but confined to human intuition till now has been overtaken by computer systems that may effectively business analytics, a fast growing field in IT software arena.
MIT researchers, Max Kanter and his guide Kalyan Veeramachaneni, have succeeded to take on the human element out of big-data analysis, with their new system that not only searches for patterns but designs the feature set, too and they have displayed their prototype in three data science competitions, in which it competed against human teams.
Of the 906 teams participating in the 3 competitions, the researchers’ “Data Science Machine” finished ahead of 615 and in 2 of the 3 competitions, the predictions made by the Data Science Machine were 94% and 96% as accurate as the winning submissions.
In the third, the figure was a more modest 87% but taking into consideration the teams of humans typically took months to work on their prediction algorithms, the Data Science Machine took just 2 and 12 hours to produce each of its entries.
“We view the Data Science Machine as a natural complement to human intelligence,” says Max Kanter, whose MIT master’s thesis in computer science is the basis of the Data Science Machine. “There’s so much data out there to be analyzed. And right now it’s just sitting there not doing anything. So maybe we can come up with a solution that will at least get us started on it, at least get us moving.”
Kanter and Veeramachaneni of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), describe the Data Science Machine will present their paper next week at the IEEE International Conference on Data Science and Advanced Analytics. Veeramachaneni co-leads the Anyscale Learning for All group at CSAIL, which applies machine-learning techniques to practical problems in big-data analysis.
They use a couple of tricks to manufacture candidate features for data analyses. One is to exploit structural relationships inherent in database design. Databases typically store different types of data in different tables, indicating the correlations between them using numerical identifiers. The Data Science Machine tracks these correlations, using them as a cue to feature construction.
For instance, one table might list retail items and their costs; another might list items included in individual customers’ purchases. The Data Science Machine would begin by importing costs from the first table into the second. Then, taking its cue from the association of several different items in the second table with the same purchase number, it would execute a suite of operations to generate candidate features: total cost per order, average cost per order, minimum cost per order, and so on. As numerical identifiers proliferate across tables, the Data Science Machine layers operations on top of each other, finding minima of averages, averages of sums, and so on.
It also looks for so-called categorical data, which appear to be restricted to a limited range of values, such as days of the week or brand names. It then generates further feature candidates by dividing up existing features across categories.
Once it’s produced an array of candidates, it reduces their number by identifying those whose values seem to be correlated. Then it starts testing its reduced set of features on sample data, recombining them in different ways to optimize the accuracy of the predictions they yield.