17 February 2017. A software package in development aims to simplify the process of cleaning up large data sets, known as curation, to improve the quality of analysis from those data. The software, called Vizier, is being developed by a University at Buffalo computer science lab, funded by a three-year $2.7 million award from National Science Foundation awarded in January 2017.
Vizier is a creation of Buffalo’s Online Data Interactions or ODin Lab in Amherst, New York, led by Oliver Kennedy, a computer science and engineering professor. Kennedy and ODin Lab colleagues study databases, with the goal of making self-service analytics possible for subject matter experts, who neither want nor need to also become experts at database planning and design.
In this project, Kennedy and colleagues seek to make it easier for keepers of large data sets to prepare for the rigors of database analytics by first locating the messy errors that find their way into even the best kept data collections. This curation process is needed to organize the data, remove duplicates, and find missing and erroneous entries, as the data sets are refined and merged. While curation is needed to prevent crippling analytical errors later on, it’s also slow and costly.
In its proposal, the ODin Lab team gave as an example the mass of data generated by taxis in New York City. Meters in taxis capture data on some 500,000 trips, transporting 600,000 people each day. The meters collect GPS data on pick-up and drop-off locations, along with times, fares, and tip amounts, all accumulated by the city’s Taxi and Limousine Commission. The city government uses these data to better understand living and working patterns, with implications for transportation and housing policies. A quality review of the data set made by one of the co-investigators, however highlighted errors such as negative values for miles traveled and fares, GPS coordinates outside the U.S., and a tip valued at $938.02.
Vizier, says its developers, will make it possible to highlight those kinds of errors and improve data quality as part of a routine workflow, rather than as a separate laborious step in compiling a database. The software will offer an interface similar to familiar automated spreadsheets and notebooks, with built-in data cleaning steps for data curation. Vizier will also track step-by-step history of curation processes, providing an audit trail if previous steps in the curation need to be reversed. Capturing the history will also offer the ability to give recommendations for further curation steps based on the context of what came before.
“We are creating a tool,” says Kennedy in a university statement, “that’ll let you work with the data you have, and also unobtrusively make helpful observations like ‘Hmm… have you noticed that two out of a million records make a 10 percent difference in this average?”
Co-principal investigators for the Vizier project are Juliana Freire, professor of computer science and engineering at New York University, and Boris Glavic, assistant professor in the Department of Computer Science at the Illinois Institute of Technology. The team plans to release the software as a free, open-source package.
Read more:
- UC San Francisco, Intel Partner on Health Analytics
- Commercial Genome Service Launches Based on Open Data
- Computational Drug Discovery Company Launches
- Stanford, Google Partner on Genomic Health Analysis
- Grid Computing Power Applied to Zika Research
* * *
You must be logged in to post a comment.