Screen scrapping

One thing I believe in is constant change, and constant learning, ideally one thing per day. Sometimes it’s learning how to cook the best eggs benedict ever for breakfast, or sometimes it’s helping out a friend with a special request.

Today I was asked by a colleague if I could help extract data from a web site. As an architect, the first thing I look at in the “code” is a clean separation of what is presentation and what is data. Obviously, I did not find that, which made me realise how frameworks which render html mixed with data are bad bad bad. Why can’t everything follow MVVM with some binding of some sort.

Anyways, we needed a solution and what we wipped up was screen scrapping.

My first attempt was to write a small html page that loads jquery, does an ajax call to hit the webpage we needed data from and then extract it from it’s DOM… Turns out it was an easy to execute but I was met an error : CORS headers not found for file://dlg/extractData.html. GRRRRR

New strategy !

I opened Chrome, did a search for screen scrapping extensions and behold I found this.

An extension that would allow me to navigate to any page, lookup how it is built, understand it’s usage of CSS selectors and voilà. Any page that reuses the same css selector to represent repeating data (as in a list) can be extracted to a json or csv file.

Well, thanks DL for getting me to learn something new today !


No comments yet.