Iterator Pattern Example: Developing a Webpage Scraper
A couple of readers suggested that a full-fledged example might be a good follow-up to my previous post introducing the iterator pattern. This is a good suggestion as there are few meaningful examples of the iterator pattern that demonstrate its intent and usefulness. This is probably due to the iterator pattern being a built-in construct in most programming languages. However, built-in iterators are designed to traverse native collections such as Arrays and Dictionaries. To traverse a custom data structure, we need to develop an iterator from the ground-up. In this example, we will develop a webpage scraper, like Googlebot, that recursively harvests information from web pages.
Why is an iterator pattern a good candidate for use in developing a webpage scraper? As described in my previous post, the iterator pattern provides a uniform way to traverse and access elements in a collection. A web page is a collection of elements. To harvest the elements ( tags ) we need to traverse and access the elements in the collection (HTML). The iterator pattern light bulb should go off at this point.
Having a uniform interface to access different elements in the web page is very desirable. Why? because there are multiple ways to traverse and access different elements. We can develop several concrete iterators to access different tags. In this example, we will develop two concrete iterators: one to access hyperlinks and the other to access images.
The example will be developed in two parts. My initial novice attempt will be described first ( I’ll call this version 1 ). I initially treated a web page as an XML document, so that E4X could be used to traverse and identify elements. However, this introduced a major limitation in that only well-formed web pages could be scraped. This didn’t mean that the web pages had to be declared as XHTML Strict per se, but each page had to be structured according to the rules defined in Section 2.1 of the XML 1.0 Recommendation. So, any malformed web pages with missing closing tags, or funky characters would fail the test. In my second attempt (version 2), I treated web pages as text documents and used regular expressions to identify elements. This introduced another more serious limitation in that my knowledge of regular expression pattern matching was minimal. So, version 2 was more of an adventure in slaying the regular expression dragon than anything else. However, the utility of the iterator was amply demonstrated as I could extend the scraper app to meet the new reqruiement without changing any existing code – the ultimate test in reusability. Here is the initial class diagram.
Webpage scraper – version 1

Class diagram of version 1
Two concrete iterators called HyperlinkIterator and ImageIterator traverse and access elements in the XHTMLPage concrete aggregate.
Class Hierarchy
The com.as3dp.patterns.iterator.scraper package contains the concrete aggregate and concrete iterators. The package facilitates encapsulation by hiding implementation details from the client. This will be apparent when we look at some code later on. Here is a screenshot of the project panel in Flash CS3 showing the class files and hierarchy.

Project panel in Flash CS3 showing the class hierarchy for version 1
The Interfaces
The public interfaces for the aggregate and iterator are quite straightforward. The method signature in the IIterableAggregate interface is different from the previous post. The createIterator method now takes an argument specifying the type of iterator. This allows the client flexibility to get different types of iterators to traverse the same collection.
package com.as3dp.interfaces.iterator
{
public interface IIterableAggregate
{
function createIterator( type : String = null ) : IIterator
}
}
package com.as3dp.interfaces.iterator
{
public interface IIterator
{
function reset() : void
function next() : *
function hasNext() : Boolean
}
}
The Concrete Aggregate: XHTMLPage
The XHTMLPage class concrete aggregate is the most complex, so let’s deal with it first. Web pages are loaded asynchonously in AS3. The client not only has to specify the URL of the web page to load, but will have to wait until it is loaded. The client can specify a callback function that will be called when the page is loaded. If this is the case, XHTMLPage will need the ability to dispatch an event to call the callback function. To meet all these requirements without re-inventing the wheel, we can implement XHTMLPage by extending the URLLoader class (this choice will eventually break encapsulation – can you see why? More on that later ).
package com.as3dp.patterns.iterator.scraper
{
import flash.errors.*;
import flash.events.*;
import flash.net.*;
import com.as3dp.interfaces.iterator.*;
import com.as3dp.patterns.iterator.scraper.*;
public class XHTMLPage extends URLLoader implements IIterableAggregate
{
public static const HYPERLINKS :String = 'Iterate over hyperlinks';
public static const IMAGES :String = 'Iterate over images';
public static const LOADED :String = 'page loaded';
internal var xml : XML = null;
internal var url : String;
public function XHTMLPage( pageURL:String, callbackFn:Function )
{
url = pageURL;
var urlRequest:URLRequest = new URLRequest( url );
addEventListener( IOErrorEvent.IO_ERROR, loadError );
addEventListener( Event.COMPLETE, loadComplete );
addEventListener( LOADED, callbackFn ); // register client as listener
load( urlRequest );
}
private function loadComplete( evt:Event ) : void
{
try // validate XML
{
xml = new XML( evt.target.data );
} catch( e:Error ) {
trace( e.message )
}
dispatchEvent( new Event( LOADED ) ); // dispatch event to client
}
private function loadError( evt:Event ) : void
{
trace( evt );
}
public function createIterator( type : String = null ) : IIterator
{
if ( xml )
{
switch( type )
{
case null:
case HYPERLINKS:
return new HyperlinkIterator( this );
break;
case IMAGES:
return new ImageIterator( this );
break;
default:
throw new ArgumentError('Invalid iterator type specified');
}
} else {
throw new IOError('Cannot create an iterator for a page that is not loaded');
}
}
}
}
The constructor takes two parameters: the URL of the page and the callback function. Passing the callback function, as opposed to requiring the client to register a listener, simplifies things as the client is registered as a listener for the LOADED event in the constructor ( line 26 ). The loadComplete method is interesting in that it first assigns the web page source to a variable of type XML (line 34). If the page source is well formed, a LOADED Event will be dispatched (line 40). If not, an exception will be thrown.
Design Pattern aficionados will notice that the iterator is very similar to the Factory Method Pattern. In fact, the createIterator method is a parameterized factory method that returns a concrete iterator. It takes a type parameter that indicates the type of iterator that will be returned. The factory method prevents the client from instantiating an iterator directly (using the new keyword) and reduces the coupling between the client and iterator. Herein lies the advantage of the factory method. Iterators can be extended, swapped out and otherwise changed without the clients knowledge. There is also an additional check (line 48) to prevent an iterator from being returned unless the page is loaded.
Also note that the aggregate passes itself as an argument when constructing the iterator (lines 54 and 57) - a form of dependency injection. This is why the xml and and url properties are defined as internal. They are visible to the iterator, as it is within the same package, but invisible to the client, which is outside the package. This sums up the unique relationship between aggregates and iterators: they know and depend on each other – signified by the arrows going both ways in the class diagram.
The Concrete Iterators: HyperlinkIterator and ImageIterator
We can now develop the two concrete iterators that traverse the anchor <a> and image <img> elements . The advantage of injecting the concrete aggregate into the iterator through its constructor is evident in the next() method. The iterator has access to all the properties of the concrete aggregate ( not only the collection ). The next() method checks if the hyperlink is an absolute or relative URL and does some cleanup before returning it. Having access to the url property declared in the concrete aggregate is essential in this case.
package com.as3dp.patterns.iterator.scraper
{
import com.as3dp.interfaces.iterator.*;
public class HyperlinkIterator implements IIterator
{
private var col : XHTMLPage;
private var index : Number;
public function HyperlinkIterator( aCollection : XHTMLPage )
{
col = aCollection;
if ( col.xml.namespace('') != undefined )
{
// set the default XML namespace in current scope
default xml namespace = col.xml.namespace('');
}
reset();
}
public function reset() : void
{
index = -1;
}
public function next() : *
{
var url:String = col.xml..a[ ++index ].@href;
// clean up URL
if ( url.charAt( 0 ) == '/' )
{
return col.url + url;
} else {
return url;
}
}
public function hasNext() : Boolean
{
return ( index < col.xml..a.length() - 1 );
}
}
}
package com.as3dp.patterns.iterator.scraper
{
import com.as3dp.interfaces.iterator.*;
public class ImageIterator implements IIterator
{
private var col : XHTMLPage;
private var index : Number;
public function ImageIterator( aCollection : XHTMLPage )
{
col = aCollection;
if ( col.xml.namespace('') != undefined )
{
// set the default XML namespace in current scope
default xml namespace = col.xml.namespace('');
}
reset();
}
public function reset() : void
{
index = -1;
}
public function next() : *
{
var url:String = col.xml..img[ ++index ].@src;
// clean up URL
if ( url.charAt( 0 ) == '/' )
{
return col.url + url;
} else {
return url;
}
}
public function hasNext() : Boolean
{
return ( index < col.xml..img.length() - 1 );
}
}
}
The Client
First, I needed a dummy website to test our little scraper. I never had any use for Apple’s iWeb application before this, but I must say that it creates a nice dummy website in no time flat!
Now we can develop some client code to scrape one of its pages for hyperlinks and image links. Main is the document class for the Flash CS3 document.
package
{
import flash.display.Sprite;
import flash.events.*;
import flash.net.*;
import com.as3dp.interfaces.iterator.*;
import com.as3dp.patterns.iterator.scraper.*;
/**
* Main Class
* @ purpose: Document class for movie
*/
public class Main extends Sprite
{
private var webPage : IIterableAggregate;
public function Main()
{
var url:String = 'http://tester.as3dp.com/Welcome.html';
// create a concrete aggregate - an XHTML page
webPage = new XHTMLPage( url, pageLoaded );
}
private function pageLoaded( evt:Event ) : void
{
trace( "r***URLs of all hyperlinks: r");
// get an iterator for hyperlinks
var itr:IIterator = webPage.createIterator( XHTMLPage.HYPERLINKS );
// iterate over hyperlinks
while ( itr.hasNext() )
{
trace ( itr.next() );
}
trace( "r*** URLs of all images: r");
// get an iterator for images
itr = webPage.createIterator( XHTMLPage.IMAGES );
// iterate over image links
while ( itr.hasNext() )
{
trace ( itr.next() );
}
}
}
}
The client side of web scraping is straightforward. Create an XHTMLPage aggregate by passing the page URL and callback function. Within the callback function create iterators to access hyperlinks and image links. The power of the iterator is exemplified by the simple interface used to traverse and access elements in what could be a very complex collection.
The Output
1 2 3 4 5 6 7 8 9 | ***URLs of all hyperlinks: About%20Me.html Photos.html Movie.html *** URLs of all images: Welcome_files/44143A3_a.jpg |
Source file webscraper-ver1.zip
Limitations! limitations!
More often than not, you will encounter a malformed web page: one that is not structured according to the rules defined in Section 2.1 of the XML 1.0 Recommendation. The primary culprit would be missing tags or incorrectly nested elements. This causes the XML validation check in the XHTMLPage class to throw an exception like the following:
Error #1090: XML parser failure: element is malformed.
The scraper also doesn’t handle redirects. For example, http://tester.as3dp.com redirects to http://tester.as3dp.com/Welcome.html. However, unlike a web browser, our scraper does not check for this. It will happily load the initial page and proclaim that it doesn’t have any images or hyperlinks.
Design decision broke encapsulation!
Remember my original decision to subclass URLLoader to implement XHTMLPage. Well, that was a bad choice as URLLoader has a public property called data that exposes the loaded web page, the collection in this case, to the client. This defeats the original intent of the iterator pattern to hide implementation details of the concrete aggregate. The danger here is that a meddlesome client can mess with the collection while traversing it, and cause all sorts of havoc. Encapsulation is a good thing.
The larger lesson here is a principle that is bandied about quite freely but don’t see too many examples of why it should be the case:
Favor object composition over class inheritance
This is a a great example as to why you should look into object composition even though subclassing would seem to be the easy solution. You are at the mercy of the parent class when you inherit as all its public properties and methods will be exposed. Kind of like standing in the middle of the street in your skivvies. Of course you can override the public methods, but the the simpler solution is composition, especially if the parent class is a beast with many public methods.
So, let’s address some of these issues and build a better scraper. We will not change existing code, but will extend the application to add new concrete aggregates and iterators.
Building a better web scraper – version 2

Web scraper class diagram for version 2
We will now treat the web page source as a text string instead of an xml object. It is still a collection, but we cannot use E4X to traverse and access elements. We will use regular expressions for this purpose. Now we have a concrete aggregate called HTMLPage and two concrete iterators HyperlinkIteratorRegex and ImageIteratorRegex.
The similarity to the Factory Method pattern should be quite evident here as you will see two parallel class hierarchies that depend on each other to make things work. An aggregate and iterator set that uses E4X and another set that uses Regular Expressions to do its magic.
Slaying the Regular Expression Dragon
Version 1 helped me realize very quickly that URLs can take many forms, all of which are quite excruciating to deal with. Much time was spent trying to figure out how to parse URLs and decompose them into bare components. Without turning this already lengthy post into a discussion on regular expressions, I’m going to list the resources that were most helpful.
The tools:
- RegExr the online regular expression testing tool ( and its Adobe Air variant RegExr Desktop ) developed by Grant Skinner. This tool is cool in so many ways. The primary one is that it explains in plain english what each miniscule part of the regular expression is doing and what it matches.
- RegExhibit ( OSX only ) is very useful to test multiple matches and groups of tokens.
Tutorials:
- Learning to use Regular Expressions, first published by IBM DeveloperWorks, updated by David Mertz. This taught me one of the most important lessons in pattern matching: reformulating the problem. Instead of asking “what am I trying to match” ask yourself “what do I need to avoid matching.”
- Parsing Incomplete or Malformed URLs with Regular Expressions, a definitive resource by Vince Filby.
The first step was to develop a utility class to parse URLs. I was flailing around looking for a URL parsing regular expression before finding Vince Filby’s writeup. His method is to treat the URL as a set of components:
foo://example.com:8042/over/there?name=ferret#nose _/ ______________/_________/ _________/ __/ | | | | | scheme authority path query fragment
And construct a regular expression that captures each component using capture groups.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?
It was pretty simple to develop a utility class based on this regular expression. The parentDirPath() method returns the parent directory of a given URL. This is required to figure out the fully qualified URL of relative URLs on web pages.
package com.as3dp.utils
{
public class ParseURL
{
private var components : Object;
public function ParseURL( anURL:String )
{
var pattern:RegExp = /^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?/i;
components = pattern.exec( anURL );
if ( ! components ) throw new URIError( 'Malformed URL - cannot parse' );
}
public function get input():String
{
return components.input;
}
public function get scheme():String
{
return components[ 2 ];
}
public function get authority():String
{
return components[ 4 ];
}
public function get path():String
{
return components[ 5 ];
}
public function get query():String
{
return components[ 7 ];
}
public function get fragment():String
{
return components[ 9 ];
}
public function get parentDirPath():String
{
var dirPat:RegExp = /.*/(?=[^/.]+.[^/.]+$)/i;
var dircomp:Object = dirPat.exec( components[ 5 ] );
var dirPath:String = ( dircomp ) ? dircomp[ 0 ] : components[ 5 ];
return ( dirPath.search( //$/ ) >= 0 )? dirPath : dirPath + '/';
}
public function toString():String
{
return String( components );
}
}
}
New class hierarchy for version 2

Flash CS3 Project Panel showing the new class hierarchy
The new Concrete Aggregate: HTMLPage
The primary difference here is that HTMLPage treats the web page as a text string as opposed to an XML structure. I also squashed the encapsulation issue by subclassing EventDispatcher and composing an URLLoader object to load the URL.
package com.as3dp.patterns.iterator.scraper
{
import flash.errors.*;
import flash.events.*;
import flash.net.*;
import com.as3dp.interfaces.iterator.*;
import com.as3dp.patterns.iterator.scraper.*;
public class HTMLPage extends EventDispatcher implements IIterableAggregate
{
public static const HYPERLINKS :String = 'Iterate over hyperlinks';
public static const IMAGES :String = 'Iterate over images';
public static const LOADED :String = 'page loaded';
internal var htmlsrc : String = null;
internal var url : String;
public function HTMLPage( pageURL:String, callbackFn:Function )
{
url = pageURL;
var urlRequest:URLRequest = new URLRequest( url );
var loader:URLLoader = new URLLoader();
loader.addEventListener( IOErrorEvent.IO_ERROR, ioErrorHandler );
loader.addEventListener( Event.COMPLETE, loadComplete );
loader.load( urlRequest );
addEventListener( LOADED, callbackFn ); // register client as listener
}
private function loadComplete( evt:Event ) : void
{
htmlsrc = evt.target.data;
dispatchEvent( new Event( LOADED ) ); // dispatch event to client
}
private function ioErrorHandler( evt:IOErrorEvent ):void {
trace( 'ioErrorHandler: ', evt );
}
public function createIterator( type : String = null ) : IIterator
{
if ( htmlsrc )
{
switch( type )
{
case null:
case HYPERLINKS:
return new HyperlinkIteratorRegex( this );
break;
case IMAGES:
return new ImageIteratorRegex( this );
break;
default:
throw new ArgumentError('Invalid iterator type specified');
}
} else {
throw new IOError('Cannot create an iterator for a page that is not loaded');
}
}
}
}
Two new concrete iterators: HyperlinkIteratorRegex and ImageIteratorRegex
The new iterators are very similar to each other. The constructor builds an array of all URLs using regular expression pattern matching. Note the repeated application of the exec method in the RegExp class to find multiple matches. The next() method utilizes the ParseURL utility class to convert all relative and absolute URLs into fully qualified ones.
package com.as3dp.patterns.iterator.scraper
{
import com.as3dp.interfaces.iterator.*;
import com.as3dp.patterns.iterator.scraper.*;
import com.as3dp.utils.*;
public class HyperlinkIteratorRegex implements IIterator
{
private var col : HTMLPage;
private var aList : Array;
private var index : Number;
private var parsedURL : ParseURL;
public function HyperlinkIteratorRegex( aCollection : HTMLPage )
{
col = aCollection;
parsedURL = new ParseURL( col.url );
var pattern:RegExp = "/= 0 ) // absolute URL
{
return parsedURL.scheme + '://' + parsedURL.authority + hyperlink;
}
else // relative URL
{
return parsedURL.scheme + '://' + parsedURL.authority +
parsedURL.parentDirPath + hyperlink;
}
}
public function hasNext() : Boolean
{
return ( index < aList.length - 1 );
}
}
}
The new Client
The client can now do some serious scraping. The pageLoaded callback method recursively traverses each hyperlink and harvests the images on each web page.
package
{
import flash.display.Sprite;
import flash.events.*;
import flash.net.*;
import com.as3dp.interfaces.iterator.*;
import com.as3dp.patterns.iterator.scraper.*;
/**
* Main Class
* @ purpose: Document class for movie
*/
public class Main extends Sprite
{
private var url:String = 'http://tester.as3dp.com/Welcome.html';
private static const MAXPAGES:uint = 100; // max # of web pages to traverse
private static var webPageList:Array = [];
private static var imageList:Array = [];
public function Main()
{
// load a concrete aggregate - an HTML page
new HTMLPage( url, pageLoaded );
}
/**
* pageLoaded method
* @ purpose: Event handler method - called when HTML page is loaded
*/
private function pageLoaded( evt:Event ) : void
{
var webPage:IIterableAggregate = IIterableAggregate( evt.target );
// get an iterator for hyperlinks
var linkItr:IIterator = webPage.createIterator( XHTMLPage.HYPERLINKS );
// iterate over the hyperlinks
while ( linkItr.hasNext() )
{
var hyperlink:String = linkItr.next();
if ( ( ! inList( hyperlink, webPageList )) && ( webPageList.length <= MAXPAGES ) )
{
// get an iterator for images
var imageItr:IIterator = webPage.createIterator( XHTMLPage.IMAGES );
while ( imageItr.hasNext() )
{
// iterate over image links
var imageLink:String = imageItr.next();
if ( ! inList( imageLink, imageList ) )
{
trace ( imageLink );
}
}
new HTMLPage( hyperlink, pageLoaded ); // load new HTML page
}
}
}
/**
* inList Utility Method
* @ purpose: Check if passed String (URL) is in the passed Array
* If not, push it into the Array.
*/
private function inList ( anURL:String, aList:Array ) : Boolean
{
for each (var item in aList)
{
if ( anURL == item ) return true;
}
aList.push( anURL );
return false;
}
}
}
Limiting the recursive traversal is important. Ideally, the web scraper should be limited to scraping pages in a single host or to stop at a particular hierarchical depth on a site. However, I took the easy way out by limiting by the number of web pages treaversed. The inList method serves an important purpose by keeping track of all hyperlinks and image links. This is to prevent traversal of pages already accessed and also to stop the recursion when when the MAXPAGES limit is reached.
Output
1 2 3 4 5 6 | http://tester.as3dp.com/Welcome_files/44143A3_a.jpg
http://tester.as3dp.com/Movie_files/movie_stripe.jpg
http://tester.as3dp.com/Photos_files/AA041434_a.jpg
http://tester.as3dp.com/Photos_files/image.png
ioErrorHandler: [IOErrorEvent type="ioError" bubbles=false cancelable=false eventPhase=2
text="Error #2032: Stream Error. URL: javascript:void(0)"] |
Now it traverses the whole site. But there is an error as one of the hyperlinks is a javascript link. Looks like we have to modify the regular expression to only match <a> tags that don’t have onclick attributes. I’ll leave that for one of you regular expression wizards. I’m sure there are more errors due to edge cases.
I wanted to walk through a couple of iterations of developing an application that utilizes the iterator pattern. We managed to discuss some important OOP concepts along the way, which is a bonus. This is a more functional example than the ones we typically develop on this blog. It’s always a hit-or-miss as to whether the detail and the length of the post get in the way of understanding the core pattern – comments welcome!
References:
Gamma, Erich; Richard Helm, Ralph Johnson, and John Vlissides (1995). Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley.
Related posts:

Bill Sanders
Recent Comments