Ask A Data Scientist: The Worst Parts of Being A Data Scientist
by Pat Lapomarda, on August 18, 2016
I recently found this question on Quora that I get asked all the time:
What are things that Data Scientists have to spend time on that they'd rather not?
The context clearly calls out the most likely suspect:
“Data munging and ETL are often called out, but that encompasses a broad set of things. Within data munging and outside it, where are data scientists still spending a lot of time that doesn’t feel like the best use of their skills?”
There are some great responses to this question, but when I read this, I immediately thought:
Data Janitor
When working in the data space, most professionals find themselves thinking that’s exactly what they are doing too: cleaning up data that isn’t collected, organized, transformed, or analyzed correctly. The ones listed by Sean Owen really hit home, but a unique one I’ve ran into is:
Examining the values of the data to correct the unit of measurement.
This happens when an application allows the user to specify a value in whole dollars or in millions using a related unit of measurement field. It’s quite common for users to forget to set the unit and value consistently, so the value may be in millions, but the unit is set to whole dollars.
The only way to find this is by re-calculating some ratios that the system stores, involving a field that’s always in either whole dollars or millions, to ensure that the value and unit are consistent.
Let me know if you have any unique data munging experiences, and click here to read other data scientists' answers on Quora!