Retail Competitive Intelligence via Web Scraping

A professor at Yale recently asked me the following question: “If you were going to improve the current state of business analytics, where would you focus?”  It’s a good question, but the improvement isn’t in analytics, it’s in the foundation.  Any advanced analytics project is only as good as the data that it is built upon.  Analytics usually have two types of data issues:

1) Bad data, or “Garbage In, Garbage Out”
This is the better-known, and more easily-diagnosed challenge for analytics.  If you have an inaccurate or imprecise measure, there are pretty standard ways of dealing with them. And these days data is getting better and better as costs decrease, so this is no longer the biggest challenge.

2) Missing Data, causing Omitted Variable Bias
Business analytics projects never have all the factors that could impact the results, and OVB needs to be diagnosed with every single model.  Since it is unavoidable and omnipresent, we can only try to estimate the magnitude of the problem when evaluating the outputs.

So if I were to point to the single place to focus efforts to improve results, it wouldn’t be in improving known data sets.  It would be bolstering data sources, and more specifically, finding metrics that represent competitive data.  And as with everything else these days, it’s the internet to the rescue.

The most significant useful piece of data that isn’t in consumer models is usually competitive price, assortment, discounting, product launches, and advertising.  Two leading providers of this data are RivalWatch, and Fetch.  Both scrub websites to gather data that can then represent each of these factors for both you and your competitors.  And supplemented with syndicated competitive media and print data, at least one large category of omitted variables can be addressed, and models can be much more robust.