A couple of readers suggested that a full-fledged example might be a good follow-up to my previous post introducing the iterator pattern. This is a good suggestion as there are few meaningful examples of the iterator pattern that demonstrate its intent and usefulness. This is probably due to the iterator pattern being a built-in construct in most programming languages. However, built-in iterators are designed to traverse native collections such as Arrays and Dictionaries. To traverse a custom data structure, we need to develop an iterator from the ground-up. In this example, we will develop a webpage scraper, like Googlebot, that recursively harvests information from web pages.
Why is an iterator pattern a good candidate for use in developing a webpage scraper? As described in my previous post, the iterator pattern provides a uniform way to traverse and access elements in a collection. A web page is a collection of elements. To harvest the elements ( tags ) we need to traverse and access the elements in the collection (HTML). The iterator pattern light bulb should go off at this point.
Having a uniform interface to access different elements in the web page is very desirable. Why? because there are multiple ways to traverse and access different elements. We can develop several concrete iterators to access different tags. In this example, we will develop two concrete iterators: one to access hyperlinks and the other to access images.
The example will be developed in two parts. My initial novice attempt will be described first ( I’ll call this version 1 ). I initially treated a web page as an XML document, so that E4X could be used to traverse and identify elements. However, this introduced a major limitation in that only well-formed web pages could be scraped. This didn’t mean that the web pages had to be declared as XHTML Strict per se, but each page had to be structured according to the rules defined in Section 2.1 of the XML 1.0 Recommendation. So, any malformed web pages with missing closing tags, or funky characters would fail the test. In my second attempt (version 2), I treated web pages as text documents and used regular expressions to identify elements. This introduced another more serious limitation in that my knowledge of regular expression pattern matching was minimal. So, version 2 was more of an adventure in slaying the regular expression dragon than anything else. However, the utility of the iterator was amply demonstrated as I could extend the scraper app to meet the new reqruiement without changing any existing code – the ultimate test in reusability. Here is the initial class diagram.
Webpage scraper – version 1

Class diagram of version 1
Continue reading ‘Iterator Pattern Example: Developing a Webpage Scraper’


Bill Sanders
Recent Comments