diff options
Diffstat (limited to '')
| -rw-r--r-- | _posts/2009-06-06-modules-i-like-web-scraper.md (renamed from _posts/2009-06-06-modules-i-like-web-scraper.textile) | 16 |
1 files changed, 8 insertions, 8 deletions
diff --git a/_posts/2009-06-06-modules-i-like-web-scraper.textile b/_posts/2009-06-06-modules-i-like-web-scraper.md index ab78fae..e2808cc 100644 --- a/_posts/2009-06-06-modules-i-like-web-scraper.textile +++ b/_posts/2009-06-06-modules-i-like-web-scraper.md @@ -1,12 +1,12 @@ --- layout: post -category: perl +summary: In which I talk about Web::Scraper title: modules I like Web::Scraper --- -For "$work":http://rtgi.fr I need to write scrapers. It used to be boring and painful. But thanks to "miyagawa":http://search.cpan.org/~miyagawa/, this is not true anymore. "Web::Scraper":http://search.cpan.org/perldoc?Web::Scraper offer a nice API: you can write your rules using XPath, you can chaine rules, a nice and simple syntax, etc. +For [$work](http://rtgi.fr) I need to write scrapers. It used to be boring and painful. But thanks to [miyagawa](http://search.cpan.org/~miyagawa/), this is not true anymore. [Web::Scraper](http://search.cpan.org/perldoc?Web::Scraper) offer a nice API: you can write your rules using XPath, you can chaine rules, a nice and simple syntax, etc. -I wanted to export my data from my last.fm account but there is no API for this, so I would need to scrap them. All the data are available "as a web page":http://www.last.fm/user/franckcuny/tracks that list your music. So the scraper need to find how many pages, and find the content on each page to extract a list of your listening. +I wanted to export my data from my last.fm account but there is no API for this, so I would need to scrap them. All the data are available [as a web page](http://www.last.fm/user/franckcuny/tracks) that list your music. So the scraper need to find how many pages, and find the content on each page to extract a list of your listening. For the total of pages, it's easy. Let's take a look at the HTML code and search for something like this: @@ -14,11 +14,11 @@ For the total of pages, it's easy. Let's take a look at the HTML code and search <a class="lastpage" href="/user/franckcuny/tracks?page=272">272</a> {% endhighlight %} -the information is in a class *lastpage*. +the information is in a class **lastpage**. Now we need to find our data: I need the artist name, the song name and the date I played this song. -All this data are in a *table*, and each new entry is in a *td*. +All this data are in a **table**, and each new entry is in a **td**. {% highlight html %} <tr id="r9_1580_1920248170" class="odd"> @@ -33,7 +33,7 @@ All this data are in a *table*, and each new entry is in a *td*. </td> {% endhighlight %} -It's simple: information about a song are stored in *subjectcell*, and the artist and song title are each in a tag *a*. The date is in a *dateCell*, and we need the *title* from the *abbr* tag. +It's simple: information about a song are stored in **subjectcell**, and the artist and song title are each in a tag **a**. The date is in a **dateCell**, and we need the **title** from the **abbr** tag. The scraper we need to write is @@ -49,7 +49,7 @@ my $scrap = scraper { }; {% endhighlight %} -The first rule extract the total of page. The second iter on each *tr* and store the content in an array named *songs*. This *tr* need to be scraped. So we look the the *abbr* tag, and store in *date* the property *title*. Then we look for the song and artitst information. We look for the *td* with a class named *subjectCell*, a extract all links. +The first rule extract the total of page. The second iter on each **tr** and store the content in an array named **songs**. This **tr** need to be scraped. So we look the the **abbr** tag, and store in **date** the property **title**. Then we look for the song and artitst information. We look for the **td** with a class named **subjectCell**, a extract all links. Our final script will look like this: @@ -96,5 +96,5 @@ sub scrap_lastfm { You can use this script like this: {% highlight bash %} -perl lastfmscraper.pl franckcuny store_data.txt +% perl lastfmscraper.pl franckcuny store_data.txt {% endhighlight %} |
