summaryrefslogtreecommitdiff
path: root/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org
diff options
context:
space:
mode:
authorFranck Cuny <franckcuny@gmail.com>2016-08-04 11:45:44 -0700
committerFranck Cuny <franckcuny@gmail.com>2016-08-04 11:45:44 -0700
commit585b48b6a605cb71ef99dd767880e1b7ee5bf24e (patch)
treec65377350d12bd1e62e0bdd58458c1044541c27b /posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org
parentUse Bullet list for the index. (diff)
parentMass convert all posts from markdown to org. (diff)
downloadlumberjaph-585b48b6a605cb71ef99dd767880e1b7ee5bf24e.tar.gz
Merge branch 'convert-to-org'
Diffstat (limited to 'posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org')
-rw-r--r--posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org302
1 files changed, 302 insertions, 0 deletions
diff --git a/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org b/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org
new file mode 100644
index 0000000..4043ca6
--- /dev/null
+++ b/posts/2009-04-28-a-simple-feed-aggregator-with-modern-perl-part-2.org
@@ -0,0 +1,302 @@
+#+BEGIN_QUOTE
+ I've choose to write about a feed aggregator because it's one of the
+ things I'm working on at [[http://rtgi.eu/][RTGI]] (with web crawler
+ stuffs, gluing datas with search engine, etc)
+#+END_QUOTE
+
+For the feed aggregator, I will use *Moose*, *KiokuDB* and our
+*DBIx::Class* schema. Before we get started, I'd would like to give a
+short introduction to Moose and KiokuDB.
+
+Moose is a "A postmodern object system for Perl 5". Moose brings to OO
+Perl some really nice concepts like roles, a better syntax, "free"
+constructor and destructor, ... If you don't already know Moose, check
+[[http://www.iinteractive.com/moose/][it here]] for more information.
+
+KiokuDB is a Moose based frontend to various data stores [...] Its
+purpose is to provide persistence for "regular" objects with as little
+effort as possible, without sacrificing control over how persistence is
+actually done, especially for harder to serialize objects. [...] KiokuDB
+is meant to solve two related persistence problems:
+
+- Store arbitrary objects without changing their class definitions or
+ worrying about schema details, and without needing to conform to the
+ limitations of a relational model.
+- Persisting arbitrary objects in a way that is compatible with
+ existing data/code (for example interoperating with another app using
+ *CouchDB* with *JSPON* semantics).
+
+I will store each feed entry in KiokuDB. I could have chosen to store
+them as plain text in JSON files, in my DBIx::Class model, etc. But as I
+want to show you new and modern stuff, I will store them in Kioku using
+the DBD's backend.
+
+*** And now for something completely different, code!
+
+First, we will create a base module named *MyAggregator*.
+
+#+BEGIN_EXAMPLE
+ % module-setup MyAggregator
+#+END_EXAMPLE
+
+We will now edit *lib/MyAggregator.pm* and write the following code:
+
+#+BEGIN_SRC perl
+ package MyAggregator;
+ use Moose;
+ 1;
+#+END_SRC
+
+As you can see, there is no =use strict; use warnings= here: Moose
+automatically turns on these pragmas. We don't have to write the new
+method either, as it's provided by Moose.
+
+For parsing feeds, we will use *XML::Feed*, and we will use it in a
+Role. If you don't know what roles are:
+
+#+BEGIN_QUOTE
+ Roles have two primary purposes: as interfaces, and as a means of code
+ reuse. Usually, a role encapsulates some piece of behavior or state
+ that can be shared between classes. It is important to understand that
+ roles are not classes. You cannot inherit from a role, and a role
+ cannot be instantiated.
+#+END_QUOTE
+
+So, we will write our first role, *lib/MyAggregator/Roles/Feed.pm*:
+
+#+BEGIN_SRC perl
+ package MyAggregator::Roles::Feed;
+ use Moose::Role;
+ use XML::Feed;
+ use feature 'say';
+
+ sub feed_parser {
+ my ($self, $content) = @_;
+ my $feed = eval { XML::Feed->parse($content) };
+ if ($@) {
+ my $error = XML::Feed->errstr || $@;
+ say "error while parsing feed : $error";
+ }
+ $feed;
+ }
+ 1;
+#+END_SRC
+
+This one is pretty simple. It will read a content, try to parse it, and
+return a XML::Feed object. If it can't parse the feed, the error will be
+shown, and the result will be set to undef.
+
+Now, a second role will be used to fetch the feed, and do basic caching,
+*lib/MyAggregator/Roles/UserAgent.pm*:
+
+#+BEGIN_SRC perl
+ package MyAggregator::Roles::UserAgent;
+ use Moose::Role;
+ use LWP::UserAgent;
+ use Cache::FileCache;
+ use URI;
+
+ has 'ua' => (
+ is => 'ro',
+ isa => 'Object',
+ lazy => 1,
+ default => sub { LWP::UserAgent->new(agent => 'MyUberAgent'); }
+ );
+ has 'cache' => (
+ is => 'rw',
+ isa => 'Cache::FileCache',
+ lazy => 1,
+ default =>
+ sub { Cache::FileCache->new({namespace => 'myaggregator',}); }
+ );
+
+ sub fetch_feed {
+ my ($self, $url) = @_;
+
+ my $req = HTTP::Request->new(GET => URI->new($url));
+ my $ref = $self->cache->get($url);
+ if (defined $ref && $ref->{LastModified} ne '') {
+ $req->header('If-Modified-Since' => $ref->{LastModified});
+ }
+
+ my $res = $self->ua->request($req);
+ $self->cache->set(
+ $url,
+ { ETag => $res->header('Etag') || '',
+ LastModified => $res->header('Last-Modified') || ''
+ },
+ '5 days',
+ );
+ $res;
+ }
+ 1;
+#+END_SRC
+
+This role has 2 attributes: *ua* and *cache*. The *ua* attribute is our
+UserAgent. 'lazy' means that it will not be constructed until I call
+=$self->ua->request=.
+
+I use *Cache::FileCache* for doing basic caching so I don't fetch or
+parse the feed if it's unnecessary, and I use the Etag and Last-Modified
+header to check the validity of my cache.
+
+The only method of this role is *fetch\_feed*. It will fetch an URL if
+it's not already in the cache, and return a *HTTP::Response* object.
+
+Now, I create an Entry class in *lib/MyAggregator/Entry.pm*:
+
+#+BEGIN_SRC perl
+ package MyAggregator::Entry;
+ use Moose;
+ use Digest::SHA qw(sha256_hex);
+ has 'author' => (is => 'rw', isa => 'Str');
+ has 'content' => (is => 'rw', isa => 'Str');
+ has 'title' => (is => 'rw', isa => 'Str');
+ has 'id' => (is => 'rw', isa => 'Str');
+ has 'date' => (is => 'rw', isa => 'Object');
+ has 'permalink' => (
+ is => 'rw',
+ isa => 'Str',
+ required => 1,
+ trigger => sub {
+ my $self = shift;
+ $self->id(sha256_hex $self->permalink);
+ }
+ );
+ 1;
+#+END_SRC
+
+Here the *permalink* has a trigger attribute: each entry has a unique
+*ID*, constructed with a sha256 value from the *permalink*. So, when we
+fill the *permalink* accessor, the *ID* is automatically set.
+
+We can now change our *MyAggregator* module like this:
+
+#+BEGIN_SRC perl
+ package MyAggregator;
+ use feature ':5.10';
+ use MyModel;
+ use Moose;
+ use MyAggregator::Entry;
+ use KiokuDB;
+ use Digest::SHA qw(sha256_hex);
+ with 'MyAggregator::Roles::UserAgent', 'MyAggregator::Roles::Feed';
+
+ has 'context' => (is => 'ro', isa => 'HashRef');
+ has 'schema' => (
+ is => 'ro',
+ isa => 'Object',
+ lazy => 1,
+ default => sub { MyModel->connect($_[0]->context->{dsn}) },
+ );
+ has 'kioku' => (
+ is => 'rw',
+ isa => 'Object',
+ lazy => 1,
+ default => sub {
+ my $self = shift;
+ KiokuDB->connect($self->context->{kioku_dir}, create => 1);
+ }
+ );
+
+ sub run {
+ my $self = shift;
+
+ my $feeds = $self->schema->resultset('Feed')->search();
+ while (my $feed = $feeds->next) {
+ my $res = $self->fetch_feed($feed->url);
+ if (!$res || !$res->is_success) {
+ say "can't fetch " . $feed->url;
+ }
+ else {
+ $self->dedupe_feed($res, $feed->id);
+ }
+ }
+ }
+
+ sub dedupe_feed {
+ my ($self, $res, $feed_id) = @_;
+
+ my $feed = $self->feed_parser(\$res->content);
+ return if (!$feed);
+ foreach my $entry ($feed->entries) {
+ next
+ if $self->schema->resultset('Entry')
+ ->find(sha256_hex $entry->link);
+ my $meme = MyAggregator::Entry->new(
+ permalink => $entry->link,
+ title => $entry->title,
+ author => $entry->author,
+ date => $entry->issued,
+ content => $entry->content->body,
+ );
+
+
+ $self->kioku->txn_do(
+ scope => 1,
+ body => sub {
+ $self->kioku->insert($meme->id => $meme);
+ }
+ );
+ $self->schema->txn_do(
+ sub {
+ $self->schema->resultset('Entry')->create(
+ { entryid => $meme->id,
+ permalink => $meme->permalink,
+ feedid => $feed_id,
+ }
+ );
+ }
+ );
+ }
+ }
+ 1;
+#+END_SRC
+
+- the with function composes roles into a class. So my MyAggregator
+ class has a fetch\_feed and parse\_feed methods, and all the
+ attributes of our roles
+- context is a HashRef that contains the configuration
+- schema is our MyModel schema
+- kioku is a connection to our kiokudb backend
+
+Two methods in this object: =run= and =dedupe=.
+
+The =run= method gets the list of feeds (line 28, via the =search=). For
+each feed return by the search, we try to fetch it, and if it's
+successful, we dedupe the entries. To dedupe the entries, we check if
+the permalink is alread in the database (line 45, via the =find=). If we
+already have this entry, we skip this one, and do the next one. If it's
+a new entry, we create a *MyAggregator::Entry* object, with the content,
+date, title, ... we store this object in kiokudb (line 55, we create a
+transaction, and do our insertion in the transaction), and create a new
+entry in the MyModel database (line 61, we enter in transaction too, and
+insert the entry in the database).
+
+And to run this, a little script:
+
+#+BEGIN_SRC perl
+ use strict;
+ use MyAggregator;
+ use YAML::Syck;
+ my $agg = MyAggregator->new(context => LoadFile shift);
+ $agg->run;
+#+END_SRC
+
+so we can run our aggregator like this
+=perl bin/aggregator.pl conf.yaml=
+
+And it's done :) We got a really basic aggregator now. If you want to
+improve this one, you would like to improve the dedupe process, using
+the permalink, the date and/or the title, as this one is too much basic.
+In the next article we will write some tests for this aggregator using
+Test::Class.
+
+big thanks to [[http://bunniesincyberspace.wordpress.com/][tea]] and
+[[http://code.google.com/p/tinyaml/][blob]] for reviewing and fixing my
+broken english in the first 2 parts.
+
+[[http://git.lumberjaph.net/p5-ironman-myaggregator.git/][The code is
+available on git server]].
+
+Part 3 and 4 next week.