1 files changed, 302 insertions, 0 deletions
diff --git a/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org b/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org
new file mode 100644
index 0000000..4043ca6
--- /dev/null
+++ b/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org
@@ -0,0 +1,302 @@
+#+BEGIN_QUOTE
+  I've choose to write about a feed aggregator because it's one of the
+  things I'm working on at [[http://rtgi.eu/][RTGI]] (with web crawler
+  stuffs, gluing datas with search engine, etc)
+#+END_QUOTE
+
+For the feed aggregator, I will use *Moose*, *KiokuDB* and our
+*DBIx::Class* schema. Before we get started, I'd would like to give a
+short introduction to Moose and KiokuDB.
+
+Moose is a "A postmodern object system for Perl 5". Moose brings to OO
+Perl some really nice concepts like roles, a better syntax, "free"
+constructor and destructor, ... If you don't already know Moose, check
+[[http://www.iinteractive.com/moose/][it here]] for more information.
+
+KiokuDB is a Moose based frontend to various data stores [...] Its
+purpose is to provide persistence for "regular" objects with as little
+effort as possible, without sacrificing control over how persistence is
+actually done, especially for harder to serialize objects. [...] KiokuDB
+is meant to solve two related persistence problems:
+
+-  Store arbitrary objects without changing their class definitions or
+   worrying about schema details, and without needing to conform to the
+   limitations of a relational model.
+-  Persisting arbitrary objects in a way that is compatible with
+   existing data/code (for example interoperating with another app using
+   *CouchDB* with *JSPON* semantics).
+
+I will store each feed entry in KiokuDB. I could have chosen to store
+them as plain text in JSON files, in my DBIx::Class model, etc. But as I
+want to show you new and modern stuff, I will store them in Kioku using
+the DBD's backend.
+
+*** And now for something completely different, code!
+
+First, we will create a base module named *MyAggregator*.
+
+#+BEGIN_EXAMPLE
+    % module-setup MyAggregator
+#+END_EXAMPLE
+
+We will now edit *lib/MyAggregator.pm* and write the following code:
+
+#+BEGIN_SRC perl
+    package MyAggregator;
+    use Moose;
+    1;
+#+END_SRC
+
+As you can see, there is no =use strict; use warnings= here: Moose
+automatically turns on these pragmas. We don't have to write the new
+method either, as it's provided by Moose.
+
+For parsing feeds, we will use *XML::Feed*, and we will use it in a
+Role. If you don't know what roles are:
+
+#+BEGIN_QUOTE
+  Roles have two primary purposes: as interfaces, and as a means of code
+  reuse. Usually, a role encapsulates some piece of behavior or state
+  that can be shared between classes. It is important to understand that
+  roles are not classes. You cannot inherit from a role, and a role
+  cannot be instantiated.
+#+END_QUOTE
+
+So, we will write our first role, *lib/MyAggregator/Roles/Feed.pm*:
+
+#+BEGIN_SRC perl
+    package MyAggregator::Roles::Feed;
+    use Moose::Role;
+    use XML::Feed;
+    use feature 'say';
+
+    sub feed_parser {
+        my ($self, $content) = @_;
+        my $feed = eval { XML::Feed->parse($content) };
+        if ($@) {
+            my $error = XML::Feed->errstr || $@;
+            say "error while parsing feed : $error";
+        }
+        $feed;
+    }
+    1;
+#+END_SRC
+
+This one is pretty simple. It will read a content, try to parse it, and
+return a XML::Feed object. If it can't parse the feed, the error will be
+shown, and the result will be set to undef.
+
+Now, a second role will be used to fetch the feed, and do basic caching,
+*lib/MyAggregator/Roles/UserAgent.pm*:
+
+#+BEGIN_SRC perl
+    package MyAggregator::Roles::UserAgent;
+    use Moose::Role;
+    use LWP::UserAgent;
+    use Cache::FileCache;
+    use URI;
+
+    has 'ua' => (
+        is      => 'ro',
+        isa     => 'Object',
+        lazy    => 1,
+        default => sub { LWP::UserAgent->new(agent => 'MyUberAgent'); }
+    );
+    has 'cache' => (
+        is   => 'rw',
+        isa  => 'Cache::FileCache',
+        lazy => 1,
+        default =>
+            sub { Cache::FileCache->new({namespace => 'myaggregator',}); }
+    );
+
+    sub fetch_feed {
+        my ($self, $url) = @_;
+
+        my $req = HTTP::Request->new(GET => URI->new($url));
+        my $ref = $self->cache->get($url);
+        if (defined $ref && $ref->{LastModified} ne '') {
+            $req->header('If-Modified-Since' => $ref->{LastModified});
+        }
+
+        my $res = $self->ua->request($req);
+        $self->cache->set(
+            $url,
+            {   ETag         => $res->header('Etag')          || '',
+                LastModified => $res->header('Last-Modified') || ''
+            },
+            '5 days',
+        );
+        $res;
+    }
+    1;
+#+END_SRC
+
+This role has 2 attributes: *ua* and *cache*. The *ua* attribute is our
+UserAgent. 'lazy' means that it will not be constructed until I call
+=$self->ua->request=.
+
+I use *Cache::FileCache* for doing basic caching so I don't fetch or
+parse the feed if it's unnecessary, and I use the Etag and Last-Modified
+header to check the validity of my cache.
+
+The only method of this role is *fetch\_feed*. It will fetch an URL if
+it's not already in the cache, and return a *HTTP::Response* object.
+
+Now, I create an Entry class in *lib/MyAggregator/Entry.pm*:
+
+#+BEGIN_SRC perl
+    package MyAggregator::Entry;
+    use Moose;
+    use Digest::SHA qw(sha256_hex);
+    has 'author'  => (is => 'rw', isa => 'Str');
+    has 'content' => (is => 'rw', isa => 'Str');
+    has 'title'   => (is => 'rw', isa => 'Str');
+    has 'id'      => (is => 'rw', isa => 'Str');
+    has 'date'    => (is => 'rw', isa => 'Object');
+    has 'permalink' => (
+        is       => 'rw',
+        isa      => 'Str',
+        required => 1,
+        trigger  => sub {
+            my $self = shift;
+            $self->id(sha256_hex $self->permalink);
+        }
+    );
+    1;
+#+END_SRC
+
+Here the *permalink* has a trigger attribute: each entry has a unique
+*ID*, constructed with a sha256 value from the *permalink*. So, when we
+fill the *permalink* accessor, the *ID* is automatically set.
+
+We can now change our *MyAggregator* module like this:
+
+#+BEGIN_SRC perl
+    package MyAggregator;
+    use feature ':5.10';
+    use MyModel;
+    use Moose;
+    use MyAggregator::Entry;
+    use KiokuDB;
+    use Digest::SHA qw(sha256_hex);
+    with 'MyAggregator::Roles::UserAgent', 'MyAggregator::Roles::Feed';
+
+    has 'context' => (is => 'ro', isa => 'HashRef');
+    has 'schema' => (
+        is      => 'ro',
+        isa     => 'Object',
+        lazy    => 1,
+        default => sub { MyModel->connect($_[0]->context->{dsn}) },
+    );
+    has 'kioku' => (
+        is      => 'rw',
+        isa     => 'Object',
+        lazy    => 1,
+        default => sub {
+            my $self = shift;
+            KiokuDB->connect($self->context->{kioku_dir}, create => 1);
+        }
+    );
+
+    sub run {
+        my $self = shift;
+
+        my $feeds = $self->schema->resultset('Feed')->search();
+        while (my $feed = $feeds->next) {
+            my $res = $self->fetch_feed($feed->url);
+            if (!$res || !$res->is_success) {
+                say "can't fetch " . $feed->url;
+            }
+            else {
+                $self->dedupe_feed($res, $feed->id);
+            }
+        }
+    }
+
+    sub dedupe_feed {
+        my ($self, $res, $feed_id) = @_;
+
+        my $feed = $self->feed_parser(\$res->content);
+        return if (!$feed);
+        foreach my $entry ($feed->entries) {
+            next
+                if $self->schema->resultset('Entry')
+                    ->find(sha256_hex $entry->link);
+            my $meme = MyAggregator::Entry->new(
+                permalink => $entry->link,
+                title     => $entry->title,
+                author    => $entry->author,
+                date      => $entry->issued,
+                content   => $entry->content->body,
+            );
+
+
+            $self->kioku->txn_do(
+                scope => 1,
+                body  => sub {
+                    $self->kioku->insert($meme->id => $meme);
+                }
+            );
+            $self->schema->txn_do(
+                sub {
+                    $self->schema->resultset('Entry')->create(
+                        {   entryid   => $meme->id,
+                            permalink => $meme->permalink,
+                            feedid    => $feed_id,
+                        }
+                    );
+                }
+            );
+        }
+    }
+    1;
+#+END_SRC
+
+-  the with function composes roles into a class. So my MyAggregator
+   class has a fetch\_feed and parse\_feed methods, and all the
+   attributes of our roles
+-  context is a HashRef that contains the configuration
+-  schema is our MyModel schema
+-  kioku is a connection to our kiokudb backend
+
+Two methods in this object: =run= and =dedupe=.
+
+The =run= method gets the list of feeds (line 28, via the =search=). For
+each feed return by the search, we try to fetch it, and if it's
+successful, we dedupe the entries. To dedupe the entries, we check if
+the permalink is alread in the database (line 45, via the =find=). If we
+already have this entry, we skip this one, and do the next one. If it's
+a new entry, we create a *MyAggregator::Entry* object, with the content,
+date, title, ... we store this object in kiokudb (line 55, we create a
+transaction, and do our insertion in the transaction), and create a new
+entry in the MyModel database (line 61, we enter in transaction too, and
+insert the entry in the database).
+
+And to run this, a little script:
+
+#+BEGIN_SRC perl
+    use strict;
+    use MyAggregator;
+    use YAML::Syck;
+    my $agg = MyAggregator->new(context => LoadFile shift);
+    $agg->run;
+#+END_SRC
+
+so we can run our aggregator like this
+=perl bin/aggregator.pl conf.yaml=
+
+And it's done :) We got a really basic aggregator now. If you want to
+improve this one, you would like to improve the dedupe process, using
+the permalink, the date and/or the title, as this one is too much basic.
+In the next article we will write some tests for this aggregator using
+Test::Class.
+
+big thanks to [[http://bunniesincyberspace.wordpress.com/][tea]] and
+[[http://code.google.com/p/tinyaml/][blob]] for reviewing and fixing my
+broken english in the first 2 parts.
+
+[[http://git.lumberjaph.net/p5-ironman-myaggregator.git/][The code is
+available on git server]].
+
+Part 3 and 4 next week.