Ask A Data Scientist: Better Data or Better Algorithms?
by Pat Lapomarda, on July 21, 2016
While recently at the Open Data Science Conference in Boston, I was reminded of the debate within data science: data vs. algorithms. Usually this debate is about the quantity of data vs. the quality of the algorithm, but I was asked a slight deviation on this: Which would you rather have: a better algorithm or better data quality?
Several years ago, I was working on a scoring project and made an interesting discovery. This was a small business risk score and one of the factors we tested in our model was Industry.
While more than 30% of the firms were agricultural, according to the industry classification in the Customer Relationship Management system, this was significantly more than what either Dun & Bradstreet (3%) or the US Census (0.3%) reported for the agriculture industry.
This 10 or 100 times multiple didn't make sense- the lender didn't focus on agriculture as an industry.
What could be going on?
After some research, we found that the industry was coded by sales people when they were first contacted the company in the CRM system.
Industry was a drop-down list; the first option on the list was Agriculture.
It was a required field.
Mystery solved!
There was no validation for the industry coded by sales and no incentive for them to get it right - there were only some prohibited industries and Agriculture wasn't one of them. As a result, we were able to not only use an industry coded at the point of underwriting (using D&B, which still is 10X higher than the US Census), but also eliminate a step in the sales process, allowing sales to do more selling!
If we depended upon this industry classification to train an algorithm, it would never have predicted well, so when someone asks me "Would you rather have a better algorithm or better data?", I pick better data without a doubt!