diff options
Diffstat (limited to '')
| -rw-r--r-- | posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org | 302 |
1 files changed, 0 insertions, 302 deletions
diff --git a/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org b/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org deleted file mode 100644 index 4043ca6..0000000 --- a/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org +++ /dev/null @@ -1,302 +0,0 @@ -#+BEGIN_QUOTE - I've choose to write about a feed aggregator because it's one of the - things I'm working on at [[http://rtgi.eu/][RTGI]] (with web crawler - stuffs, gluing datas with search engine, etc) -#+END_QUOTE - -For the feed aggregator, I will use *Moose*, *KiokuDB* and our -*DBIx::Class* schema. Before we get started, I'd would like to give a -short introduction to Moose and KiokuDB. - -Moose is a "A postmodern object system for Perl 5". Moose brings to OO -Perl some really nice concepts like roles, a better syntax, "free" -constructor and destructor, ... If you don't already know Moose, check -[[http://www.iinteractive.com/moose/][it here]] for more information. - -KiokuDB is a Moose based frontend to various data stores [...] Its -purpose is to provide persistence for "regular" objects with as little -effort as possible, without sacrificing control over how persistence is -actually done, especially for harder to serialize objects. [...] KiokuDB -is meant to solve two related persistence problems: - -- Store arbitrary objects without changing their class definitions or - worrying about schema details, and without needing to conform to the - limitations of a relational model. -- Persisting arbitrary objects in a way that is compatible with - existing data/code (for example interoperating with another app using - *CouchDB* with *JSPON* semantics). - -I will store each feed entry in KiokuDB. I could have chosen to store -them as plain text in JSON files, in my DBIx::Class model, etc. But as I -want to show you new and modern stuff, I will store them in Kioku using -the DBD's backend. - -*** And now for something completely different, code! - -First, we will create a base module named *MyAggregator*. - -#+BEGIN_EXAMPLE - % module-setup MyAggregator -#+END_EXAMPLE - -We will now edit *lib/MyAggregator.pm* and write the following code: - -#+BEGIN_SRC perl - package MyAggregator; - use Moose; - 1; -#+END_SRC - -As you can see, there is no =use strict; use warnings= here: Moose -automatically turns on these pragmas. We don't have to write the new -method either, as it's provided by Moose. - -For parsing feeds, we will use *XML::Feed*, and we will use it in a -Role. If you don't know what roles are: - -#+BEGIN_QUOTE - Roles have two primary purposes: as interfaces, and as a means of code - reuse. Usually, a role encapsulates some piece of behavior or state - that can be shared between classes. It is important to understand that - roles are not classes. You cannot inherit from a role, and a role - cannot be instantiated. -#+END_QUOTE - -So, we will write our first role, *lib/MyAggregator/Roles/Feed.pm*: - -#+BEGIN_SRC perl - package MyAggregator::Roles::Feed; - use Moose::Role; - use XML::Feed; - use feature 'say'; - - sub feed_parser { - my ($self, $content) = @_; - my $feed = eval { XML::Feed->parse($content) }; - if ($@) { - my $error = XML::Feed->errstr || $@; - say "error while parsing feed : $error"; - } - $feed; - } - 1; -#+END_SRC - -This one is pretty simple. It will read a content, try to parse it, and -return a XML::Feed object. If it can't parse the feed, the error will be -shown, and the result will be set to undef. - -Now, a second role will be used to fetch the feed, and do basic caching, -*lib/MyAggregator/Roles/UserAgent.pm*: - -#+BEGIN_SRC perl - package MyAggregator::Roles::UserAgent; - use Moose::Role; - use LWP::UserAgent; - use Cache::FileCache; - use URI; - - has 'ua' => ( - is => 'ro', - isa => 'Object', - lazy => 1, - default => sub { LWP::UserAgent->new(agent => 'MyUberAgent'); } - ); - has 'cache' => ( - is => 'rw', - isa => 'Cache::FileCache', - lazy => 1, - default => - sub { Cache::FileCache->new({namespace => 'myaggregator',}); } - ); - - sub fetch_feed { - my ($self, $url) = @_; - - my $req = HTTP::Request->new(GET => URI->new($url)); - my $ref = $self->cache->get($url); - if (defined $ref && $ref->{LastModified} ne '') { - $req->header('If-Modified-Since' => $ref->{LastModified}); - } - - my $res = $self->ua->request($req); - $self->cache->set( - $url, - { ETag => $res->header('Etag') || '', - LastModified => $res->header('Last-Modified') || '' - }, - '5 days', - ); - $res; - } - 1; -#+END_SRC - -This role has 2 attributes: *ua* and *cache*. The *ua* attribute is our -UserAgent. 'lazy' means that it will not be constructed until I call -=$self->ua->request=. - -I use *Cache::FileCache* for doing basic caching so I don't fetch or -parse the feed if it's unnecessary, and I use the Etag and Last-Modified -header to check the validity of my cache. - -The only method of this role is *fetch\_feed*. It will fetch an URL if -it's not already in the cache, and return a *HTTP::Response* object. - -Now, I create an Entry class in *lib/MyAggregator/Entry.pm*: - -#+BEGIN_SRC perl - package MyAggregator::Entry; - use Moose; - use Digest::SHA qw(sha256_hex); - has 'author' => (is => 'rw', isa => 'Str'); - has 'content' => (is => 'rw', isa => 'Str'); - has 'title' => (is => 'rw', isa => 'Str'); - has 'id' => (is => 'rw', isa => 'Str'); - has 'date' => (is => 'rw', isa => 'Object'); - has 'permalink' => ( - is => 'rw', - isa => 'Str', - required => 1, - trigger => sub { - my $self = shift; - $self->id(sha256_hex $self->permalink); - } - ); - 1; -#+END_SRC - -Here the *permalink* has a trigger attribute: each entry has a unique -*ID*, constructed with a sha256 value from the *permalink*. So, when we -fill the *permalink* accessor, the *ID* is automatically set. - -We can now change our *MyAggregator* module like this: - -#+BEGIN_SRC perl - package MyAggregator; - use feature ':5.10'; - use MyModel; - use Moose; - use MyAggregator::Entry; - use KiokuDB; - use Digest::SHA qw(sha256_hex); - with 'MyAggregator::Roles::UserAgent', 'MyAggregator::Roles::Feed'; - - has 'context' => (is => 'ro', isa => 'HashRef'); - has 'schema' => ( - is => 'ro', - isa => 'Object', - lazy => 1, - default => sub { MyModel->connect($_[0]->context->{dsn}) }, - ); - has 'kioku' => ( - is => 'rw', - isa => 'Object', - lazy => 1, - default => sub { - my $self = shift; - KiokuDB->connect($self->context->{kioku_dir}, create => 1); - } - ); - - sub run { - my $self = shift; - - my $feeds = $self->schema->resultset('Feed')->search(); - while (my $feed = $feeds->next) { - my $res = $self->fetch_feed($feed->url); - if (!$res || !$res->is_success) { - say "can't fetch " . $feed->url; - } - else { - $self->dedupe_feed($res, $feed->id); - } - } - } - - sub dedupe_feed { - my ($self, $res, $feed_id) = @_; - - my $feed = $self->feed_parser(\$res->content); - return if (!$feed); - foreach my $entry ($feed->entries) { - next - if $self->schema->resultset('Entry') - ->find(sha256_hex $entry->link); - my $meme = MyAggregator::Entry->new( - permalink => $entry->link, - title => $entry->title, - author => $entry->author, - date => $entry->issued, - content => $entry->content->body, - ); - - - $self->kioku->txn_do( - scope => 1, - body => sub { - $self->kioku->insert($meme->id => $meme); - } - ); - $self->schema->txn_do( - sub { - $self->schema->resultset('Entry')->create( - { entryid => $meme->id, - permalink => $meme->permalink, - feedid => $feed_id, - } - ); - } - ); - } - } - 1; -#+END_SRC - -- the with function composes roles into a class. So my MyAggregator - class has a fetch\_feed and parse\_feed methods, and all the - attributes of our roles -- context is a HashRef that contains the configuration -- schema is our MyModel schema -- kioku is a connection to our kiokudb backend - -Two methods in this object: =run= and =dedupe=. - -The =run= method gets the list of feeds (line 28, via the =search=). For -each feed return by the search, we try to fetch it, and if it's -successful, we dedupe the entries. To dedupe the entries, we check if -the permalink is alread in the database (line 45, via the =find=). If we -already have this entry, we skip this one, and do the next one. If it's -a new entry, we create a *MyAggregator::Entry* object, with the content, -date, title, ... we store this object in kiokudb (line 55, we create a -transaction, and do our insertion in the transaction), and create a new -entry in the MyModel database (line 61, we enter in transaction too, and -insert the entry in the database). - -And to run this, a little script: - -#+BEGIN_SRC perl - use strict; - use MyAggregator; - use YAML::Syck; - my $agg = MyAggregator->new(context => LoadFile shift); - $agg->run; -#+END_SRC - -so we can run our aggregator like this -=perl bin/aggregator.pl conf.yaml= - -And it's done :) We got a really basic aggregator now. If you want to -improve this one, you would like to improve the dedupe process, using -the permalink, the date and/or the title, as this one is too much basic. -In the next article we will write some tests for this aggregator using -Test::Class. - -big thanks to [[http://bunniesincyberspace.wordpress.com/][tea]] and -[[http://code.google.com/p/tinyaml/][blob]] for reviewing and fixing my -broken english in the first 2 parts. - -[[http://git.lumberjaph.net/p5-ironman-myaggregator.git/][The code is -available on git server]]. - -Part 3 and 4 next week. |
