diff options
Diffstat (limited to '')
| -rw-r--r-- | posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org | 302 |
1 files changed, 302 insertions, 0 deletions
diff --git a/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org b/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org new file mode 100644 index 0000000..4043ca6 --- /dev/null +++ b/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org @@ -0,0 +1,302 @@ +#+BEGIN_QUOTE + I've choose to write about a feed aggregator because it's one of the + things I'm working on at [[http://rtgi.eu/][RTGI]] (with web crawler + stuffs, gluing datas with search engine, etc) +#+END_QUOTE + +For the feed aggregator, I will use *Moose*, *KiokuDB* and our +*DBIx::Class* schema. Before we get started, I'd would like to give a +short introduction to Moose and KiokuDB. + +Moose is a "A postmodern object system for Perl 5". Moose brings to OO +Perl some really nice concepts like roles, a better syntax, "free" +constructor and destructor, ... If you don't already know Moose, check +[[http://www.iinteractive.com/moose/][it here]] for more information. + +KiokuDB is a Moose based frontend to various data stores [...] Its +purpose is to provide persistence for "regular" objects with as little +effort as possible, without sacrificing control over how persistence is +actually done, especially for harder to serialize objects. [...] KiokuDB +is meant to solve two related persistence problems: + +- Store arbitrary objects without changing their class definitions or + worrying about schema details, and without needing to conform to the + limitations of a relational model. +- Persisting arbitrary objects in a way that is compatible with + existing data/code (for example interoperating with another app using + *CouchDB* with *JSPON* semantics). + +I will store each feed entry in KiokuDB. I could have chosen to store +them as plain text in JSON files, in my DBIx::Class model, etc. But as I +want to show you new and modern stuff, I will store them in Kioku using +the DBD's backend. + +*** And now for something completely different, code! + +First, we will create a base module named *MyAggregator*. + +#+BEGIN_EXAMPLE + % module-setup MyAggregator +#+END_EXAMPLE + +We will now edit *lib/MyAggregator.pm* and write the following code: + +#+BEGIN_SRC perl + package MyAggregator; + use Moose; + 1; +#+END_SRC + +As you can see, there is no =use strict; use warnings= here: Moose +automatically turns on these pragmas. We don't have to write the new +method either, as it's provided by Moose. + +For parsing feeds, we will use *XML::Feed*, and we will use it in a +Role. If you don't know what roles are: + +#+BEGIN_QUOTE + Roles have two primary purposes: as interfaces, and as a means of code + reuse. Usually, a role encapsulates some piece of behavior or state + that can be shared between classes. It is important to understand that + roles are not classes. You cannot inherit from a role, and a role + cannot be instantiated. +#+END_QUOTE + +So, we will write our first role, *lib/MyAggregator/Roles/Feed.pm*: + +#+BEGIN_SRC perl + package MyAggregator::Roles::Feed; + use Moose::Role; + use XML::Feed; + use feature 'say'; + + sub feed_parser { + my ($self, $content) = @_; + my $feed = eval { XML::Feed->parse($content) }; + if ($@) { + my $error = XML::Feed->errstr || $@; + say "error while parsing feed : $error"; + } + $feed; + } + 1; +#+END_SRC + +This one is pretty simple. It will read a content, try to parse it, and +return a XML::Feed object. If it can't parse the feed, the error will be +shown, and the result will be set to undef. + +Now, a second role will be used to fetch the feed, and do basic caching, +*lib/MyAggregator/Roles/UserAgent.pm*: + +#+BEGIN_SRC perl + package MyAggregator::Roles::UserAgent; + use Moose::Role; + use LWP::UserAgent; + use Cache::FileCache; + use URI; + + has 'ua' => ( + is => 'ro', + isa => 'Object', + lazy => 1, + default => sub { LWP::UserAgent->new(agent => 'MyUberAgent'); } + ); + has 'cache' => ( + is => 'rw', + isa => 'Cache::FileCache', + lazy => 1, + default => + sub { Cache::FileCache->new({namespace => 'myaggregator',}); } + ); + + sub fetch_feed { + my ($self, $url) = @_; + + my $req = HTTP::Request->new(GET => URI->new($url)); + my $ref = $self->cache->get($url); + if (defined $ref && $ref->{LastModified} ne '') { + $req->header('If-Modified-Since' => $ref->{LastModified}); + } + + my $res = $self->ua->request($req); + $self->cache->set( + $url, + { ETag => $res->header('Etag') || '', + LastModified => $res->header('Last-Modified') || '' + }, + '5 days', + ); + $res; + } + 1; +#+END_SRC + +This role has 2 attributes: *ua* and *cache*. The *ua* attribute is our +UserAgent. 'lazy' means that it will not be constructed until I call +=$self->ua->request=. + +I use *Cache::FileCache* for doing basic caching so I don't fetch or +parse the feed if it's unnecessary, and I use the Etag and Last-Modified +header to check the validity of my cache. + +The only method of this role is *fetch\_feed*. It will fetch an URL if +it's not already in the cache, and return a *HTTP::Response* object. + +Now, I create an Entry class in *lib/MyAggregator/Entry.pm*: + +#+BEGIN_SRC perl + package MyAggregator::Entry; + use Moose; + use Digest::SHA qw(sha256_hex); + has 'author' => (is => 'rw', isa => 'Str'); + has 'content' => (is => 'rw', isa => 'Str'); + has 'title' => (is => 'rw', isa => 'Str'); + has 'id' => (is => 'rw', isa => 'Str'); + has 'date' => (is => 'rw', isa => 'Object'); + has 'permalink' => ( + is => 'rw', + isa => 'Str', + required => 1, + trigger => sub { + my $self = shift; + $self->id(sha256_hex $self->permalink); + } + ); + 1; +#+END_SRC + +Here the *permalink* has a trigger attribute: each entry has a unique +*ID*, constructed with a sha256 value from the *permalink*. So, when we +fill the *permalink* accessor, the *ID* is automatically set. + +We can now change our *MyAggregator* module like this: + +#+BEGIN_SRC perl + package MyAggregator; + use feature ':5.10'; + use MyModel; + use Moose; + use MyAggregator::Entry; + use KiokuDB; + use Digest::SHA qw(sha256_hex); + with 'MyAggregator::Roles::UserAgent', 'MyAggregator::Roles::Feed'; + + has 'context' => (is => 'ro', isa => 'HashRef'); + has 'schema' => ( + is => 'ro', + isa => 'Object', + lazy => 1, + default => sub { MyModel->connect($_[0]->context->{dsn}) }, + ); + has 'kioku' => ( + is => 'rw', + isa => 'Object', + lazy => 1, + default => sub { + my $self = shift; + KiokuDB->connect($self->context->{kioku_dir}, create => 1); + } + ); + + sub run { + my $self = shift; + + my $feeds = $self->schema->resultset('Feed')->search(); + while (my $feed = $feeds->next) { + my $res = $self->fetch_feed($feed->url); + if (!$res || !$res->is_success) { + say "can't fetch " . $feed->url; + } + else { + $self->dedupe_feed($res, $feed->id); + } + } + } + + sub dedupe_feed { + my ($self, $res, $feed_id) = @_; + + my $feed = $self->feed_parser(\$res->content); + return if (!$feed); + foreach my $entry ($feed->entries) { + next + if $self->schema->resultset('Entry') + ->find(sha256_hex $entry->link); + my $meme = MyAggregator::Entry->new( + permalink => $entry->link, + title => $entry->title, + author => $entry->author, + date => $entry->issued, + content => $entry->content->body, + ); + + + $self->kioku->txn_do( + scope => 1, + body => sub { + $self->kioku->insert($meme->id => $meme); + } + ); + $self->schema->txn_do( + sub { + $self->schema->resultset('Entry')->create( + { entryid => $meme->id, + permalink => $meme->permalink, + feedid => $feed_id, + } + ); + } + ); + } + } + 1; +#+END_SRC + +- the with function composes roles into a class. So my MyAggregator + class has a fetch\_feed and parse\_feed methods, and all the + attributes of our roles +- context is a HashRef that contains the configuration +- schema is our MyModel schema +- kioku is a connection to our kiokudb backend + +Two methods in this object: =run= and =dedupe=. + +The =run= method gets the list of feeds (line 28, via the =search=). For +each feed return by the search, we try to fetch it, and if it's +successful, we dedupe the entries. To dedupe the entries, we check if +the permalink is alread in the database (line 45, via the =find=). If we +already have this entry, we skip this one, and do the next one. If it's +a new entry, we create a *MyAggregator::Entry* object, with the content, +date, title, ... we store this object in kiokudb (line 55, we create a +transaction, and do our insertion in the transaction), and create a new +entry in the MyModel database (line 61, we enter in transaction too, and +insert the entry in the database). + +And to run this, a little script: + +#+BEGIN_SRC perl + use strict; + use MyAggregator; + use YAML::Syck; + my $agg = MyAggregator->new(context => LoadFile shift); + $agg->run; +#+END_SRC + +so we can run our aggregator like this +=perl bin/aggregator.pl conf.yaml= + +And it's done :) We got a really basic aggregator now. If you want to +improve this one, you would like to improve the dedupe process, using +the permalink, the date and/or the title, as this one is too much basic. +In the next article we will write some tests for this aggregator using +Test::Class. + +big thanks to [[http://bunniesincyberspace.wordpress.com/][tea]] and +[[http://code.google.com/p/tinyaml/][blob]] for reviewing and fixing my +broken english in the first 2 parts. + +[[http://git.lumberjaph.net/p5-ironman-myaggregator.git/][The code is +available on git server]]. + +Part 3 and 4 next week. |
