Archive for the 'Iterator' Category

Iterator Pattern Example: Developing a Webpage Scraper

A couple of readers suggested that a full-fledged example might be a good follow-up to my previous post introducing the iterator pattern. This is a good suggestion as there are few meaningful examples of the iterator pattern that demonstrate its intent and usefulness. This is probably due to the iterator pattern being a built-in construct in most programming languages. However, built-in iterators are designed to  traverse  native collections such as Arrays and Dictionaries. To traverse a custom data structure, we need to develop an iterator from the ground-up. In this example, we will develop a webpage scraper, like  Googlebot, that recursively harvests information from web pages.

Why is an iterator pattern a good candidate for use in developing a webpage scraper? As described in my previous post, the iterator pattern provides a uniform way to traverse and access elements in a collection. A web page is a collection of elements. To harvest the elements ( tags ) we need to traverse and access the elements in the collection (HTML). The iterator pattern light bulb should go off at this point.

Having a uniform interface to access different elements in the web page is very desirable. Why? because there are multiple ways to traverse and access different elements. We can develop several concrete iterators to access different tags. In this example, we will develop two concrete iterators: one to access hyperlinks and the other to access images.

The example will be developed in two parts. My initial novice attempt will be described first ( I’ll call this version 1 ). I initially treated a web page as an XML document, so that E4X could be used to traverse and identify elements. However, this introduced a major limitation in that only well-formed web pages could be scraped.  This didn’t mean that the web pages had to be declared as XHTML Strict per se, but each page had to be structured according to the rules defined in Section 2.1 of the XML 1.0 Recommendation. So, any malformed web pages with missing closing tags, or funky characters would fail the test. In my second attempt (version 2), I treated web pages as text documents and used regular expressions to identify elements. This introduced another more serious limitation in that my knowledge of regular expression pattern matching was minimal. So, version 2 was more of an adventure in slaying the regular expression dragon than anything else. However, the utility of the iterator was amply demonstrated as I could extend the scraper app to meet the new reqruiement without changing any existing code – the ultimate test in reusability. Here is the initial class diagram.

Webpage scraper – version 1

Class diagram of web scraper example

Class diagram of version 1

Continue reading ‘Iterator Pattern Example: Developing a Webpage Scraper’

  • Share/Bookmark

The Iterator Pattern: Flexible implementation of collections

We tend to think of the iterator pattern as simply a mechanism to access a collection of items in some sort of order. However, the motivation for the iterator pattern goes quite a bit beyond this everyday requirement. It is specifically designed to hide how you decide to structure the collection, but still provide a uniform interface to traverse and access its elements. We will look at a couple of examples where we change how the collection is implemented but keep the client oblivious to the change.

Let’s take a look at the class diagram to determine the relationships between the classes in the iterator pattern.


Continue reading ‘The Iterator Pattern: Flexible implementation of collections’

  • Share/Bookmark