1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
|
---
layout: post
summary: In which I talk about Web::Scraper.
title: Modules I like Web::Scraper
---
For [$work](http://rtgi.fr) I need to write scrapers. It used to be boring and painful. But thanks to [miyagawa](http://search.cpan.org/~miyagawa/), this is not true anymore. [Web::Scraper](http://search.cpan.org/perldoc?Web::Scraper) offer a nice API: you can write your rules using XPath, you can chaine rules, a nice and simple syntax, etc.
I wanted to export my data from my last.fm account but there is no API for this, so I would need to scrap them. All the data are available [as a web page](http://www.last.fm/user/franckcuny/tracks) that list your music. So the scraper need to find how many pages, and find the content on each page to extract a list of your listening.
For the total of pages, it's easy. Let's take a look at the HTML code and search for something like this:
{% highlight html %}
<a class="lastpage" href="/user/franckcuny/tracks?page=272">272</a>
{% endhighlight %}
the information is in a class **lastpage**.
Now we need to find our data: I need the artist name, the song name and the date I played this song.
All this data are in a **table**, and each new entry is in a **td**.
{% highlight html %}
<tr id="r9_1580_1920248170" class="odd">
[...]
<td class="subjectCell">
<a href="/music/Earth">Earth</a>
<a href="/music/Earth/_/Sonar+and+Depth+Charge">Sonar and Depth Charge</a>
</td>
[...]
<td class="dateCell last">
<abbr title="2009-05-13T15:18:25Z">13 May 3:18pm</abbr>
</td>
{% endhighlight %}
It's simple: information about a song are stored in **subjectcell**, and the artist and song title are each in a tag **a**. The date is in a **dateCell**, and we need the **title** from the **abbr** tag.
The scraper we need to write is
{% highlight perl %}
my $scrap = scraper {
process 'a[class="lastpage"]', 'last' => 'TEXT';
process 'tr', 'songs[]' => scraper {
process 'abbr', 'date' => '@title';
process 'td[class="subjectCell"]', 'song' => scraper {
process 'a', 'info[]' => 'TEXT';
};
}
};
{% endhighlight %}
The first rule extract the total of page. The second iter on each **tr** and store the content in an array named **songs**. This **tr** need to be scraped. So we look the the **abbr** tag, and store in **date** the property **title**. Then we look for the song and artitst information. We look for the **td** with a class named **subjectCell**, a extract all links.
Our final script will look like this:
{% highlight perl %}
#!/usr/bin/perl -w
use strict;
use feature ':5.10';
use Web::Scraper;
use URI;
use IO::All -utf8;
my $username = shift;
my $output = shift;
my $scrap = scraper {
process 'a[class="lastpage"]', 'last' => 'TEXT';
process 'tr', 'songs[]' => scraper {
process 'abbr', 'date' => '@title';
process 'td[class="subjectCell"]', 'song' => scraper {
process 'a', 'info[]' => 'TEXT';
};
}
};
my $url = "http://www.last.fm/user/" . $username . "/tracks?page=";
scrap_lastfm(1);
sub scrap_lastfm {
my $page = shift;
my $scrap_uri = $url . $page;
say $scrap_uri;
my $res = $scrap->scrape(URI->new($scrap_uri));
my $lastpage = $res->{last};
foreach my $record (@{$res->{songs}}) {
my $line = join("\t", @{$record->{song}->{info}}, $record->{date});
$line . "\n" >> io $output;
}
$page++;
scrap_lastfm($page) if $page <= $lastpage;
}
{% endhighlight %}
You can use this script like this:
{% highlight bash %}
% perl lastfmscraper.pl franckcuny store_data.txt
{% endhighlight %}
|