summaryrefslogtreecommitdiff
path: root/_posts/2009-06-06-modules-i-like-web-scraper.md
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--_posts/2009-06-06-modules-i-like-web-scraper.md (renamed from _posts/2009-06-06-modules-i-like-web-scraper.textile)16
1 files changed, 8 insertions, 8 deletions
diff --git a/_posts/2009-06-06-modules-i-like-web-scraper.textile b/_posts/2009-06-06-modules-i-like-web-scraper.md
index ab78fae..e2808cc 100644
--- a/_posts/2009-06-06-modules-i-like-web-scraper.textile
+++ b/_posts/2009-06-06-modules-i-like-web-scraper.md
@@ -1,12 +1,12 @@
---
layout: post
-category: perl
+summary: In which I talk about Web::Scraper
title: modules I like Web::Scraper
---
-For "$work":http://rtgi.fr I need to write scrapers. It used to be boring and painful. But thanks to "miyagawa":http://search.cpan.org/~miyagawa/, this is not true anymore. "Web::Scraper":http://search.cpan.org/perldoc?Web::Scraper offer a nice API: you can write your rules using XPath, you can chaine rules, a nice and simple syntax, etc.
+For [$work](http://rtgi.fr) I need to write scrapers. It used to be boring and painful. But thanks to [miyagawa](http://search.cpan.org/~miyagawa/), this is not true anymore. [Web::Scraper](http://search.cpan.org/perldoc?Web::Scraper) offer a nice API: you can write your rules using XPath, you can chaine rules, a nice and simple syntax, etc.
-I wanted to export my data from my last.fm account but there is no API for this, so I would need to scrap them. All the data are available "as a web page":http://www.last.fm/user/franckcuny/tracks that list your music. So the scraper need to find how many pages, and find the content on each page to extract a list of your listening.
+I wanted to export my data from my last.fm account but there is no API for this, so I would need to scrap them. All the data are available [as a web page](http://www.last.fm/user/franckcuny/tracks) that list your music. So the scraper need to find how many pages, and find the content on each page to extract a list of your listening.
For the total of pages, it's easy. Let's take a look at the HTML code and search for something like this:
@@ -14,11 +14,11 @@ For the total of pages, it's easy. Let's take a look at the HTML code and search
<a class="lastpage" href="/user/franckcuny/tracks?page=272">272</a>
{% endhighlight %}
-the information is in a class *lastpage*.
+the information is in a class **lastpage**.
Now we need to find our data: I need the artist name, the song name and the date I played this song.
-All this data are in a *table*, and each new entry is in a *td*.
+All this data are in a **table**, and each new entry is in a **td**.
{% highlight html %}
<tr id="r9_1580_1920248170" class="odd">
@@ -33,7 +33,7 @@ All this data are in a *table*, and each new entry is in a *td*.
</td>
{% endhighlight %}
-It's simple: information about a song are stored in *subjectcell*, and the artist and song title are each in a tag *a*. The date is in a *dateCell*, and we need the *title* from the *abbr* tag.
+It's simple: information about a song are stored in **subjectcell**, and the artist and song title are each in a tag **a**. The date is in a **dateCell**, and we need the **title** from the **abbr** tag.
The scraper we need to write is
@@ -49,7 +49,7 @@ my $scrap = scraper {
};
{% endhighlight %}
-The first rule extract the total of page. The second iter on each *tr* and store the content in an array named *songs*. This *tr* need to be scraped. So we look the the *abbr* tag, and store in *date* the property *title*. Then we look for the song and artitst information. We look for the *td* with a class named *subjectCell*, a extract all links.
+The first rule extract the total of page. The second iter on each **tr** and store the content in an array named **songs**. This **tr** need to be scraped. So we look the the **abbr** tag, and store in **date** the property **title**. Then we look for the song and artitst information. We look for the **td** with a class named **subjectCell**, a extract all links.
Our final script will look like this:
@@ -96,5 +96,5 @@ sub scrap_lastfm {
You can use this script like this:
{% highlight bash %}
-perl lastfmscraper.pl franckcuny store_data.txt
+% perl lastfmscraper.pl franckcuny store_data.txt
{% endhighlight %}