DO NOT ADD CONTENT ABOVE HERE

NGData_Full-Color-Mobile
Thought Leadership

On Data Wrangling & Machine Learning

Luigi Vacca, Data Scientist at NGDATA, October 28, 2014

Recently, I came across a piece in The New York Times that examined the prevalence of laborious, manual data formatting and how unwieldy data “wrangling” can quickly become. It’s important to recognize that there will always be some hands-on work required in data preparation, as data certainly requires analysis by real, human, data scientists. But machine learning can serve as an invaluable tool for those struggling with data wrangling, because oddly enough, it requires less formatting in order to produce productive results. And, more importantly, machine learning operating in online mode can drastically improve the productivity of data scientists.

I’d argue that one of the most important contributions of machine learning for recommender systems and pattern recognition is the ability to operate in an online mode. You see, traditionally, datasets are stored in a database and modelled in one batch. However, this approach has several limitations. First, the entire dataset must be available before it can be fed to a model, and after the model has been created, it typically remains the same for any future applications. This then requires storing huge amounts of data in one place, and one must then wait until the entire dataset has been collected. Second, batch mode creates limits on the amount of data that data scientists can process due to random access memory (RAM) limitations and disk access speed.

Services

Thanks to online mode, data can be processed as it is ingested, in parallel form. This allows for the processing of Terabyte datasets, as the data can be updated in real time, as needed. Moreover, as a bonus, online mode has the ability to adapt to the new data should the data’s behavior change over time (this is called “adaptive learning”).

I appreciate The New York Time’s attention to the challenges and productivity obstacles of data formatting. However, I urge organizations to focus instead on the benefits of machine learning (operating in online mode, of course) because by automating much of the data formatting process, data scientists will be free to devote their time to higher-level tasks more worthy of their intelligence and more accurate data analysis to leverage.