Jun 5, 2010

[Tutorial]How I automatically grabbed songs from Zing MP3 and upload to ListenMe ?

Background

Zing Mp3 is huge, damn huge in term of song database and song submitters
ListenMe.net (my hobby music sharing website) is (slightly) less huge and I'm more or less the only song submitter
I usually upload the song I like manually, but I also want to have a way to quickly build up ListenMe.net song database with very little efforts from me

Solution

Automatically "grab" songs from Zing MP3 and upload to ListenMe.net
Tools:

cURL : to retrieve web page from PHP
Regular expression : to parse the selected part in the HTML of the page. This tool is excellent for testing your regular expression.

Details

The below snipset will "open" Zing MP3 search page and look for $artist, $artist can be 'MLTR', 'Linkin' or 'Ưng+Hoàng+Phúc'

$indexURL = "http://mp3.zing.vn/tim-kiem/bai-hat.html?t=artist&amp;amp;amp;p=1&amp;amp;amp;q=".$artistCURL;
$result = songActions::curlURL($indexURL);

We now want to loop for all the result pages. But first, we want to know how many pages are there ?

we now use regular expression to grab the number between "Tìm được" and "bài", which is 2174. Another observation is that each result page contains 20 items, therefore we can calculate $totalPages

//find total found songs
preg_match('%Kết quả tìm được <strong>(.*?)</strong> bài hát%', $result, $matches);
$totalStubs = $matches[1];

//calculate number of pages, 20 song / pages
$totalPages = ceil($totalStubs / 20);

Now, we go to each of the page

for ($p = 1; $p <= $totalPages; $p++)

and see this

How do we get the title and direct link for each of the song ? Again, regular expression. First, look at the HTML source code

we can try this

//get items for current page

preg_match_all('%<a class="_trackLink" tracking="_frombox=search_song" title="[^>]*" href="(/bai-hat/.*?.html)">(.*?)</a>(?s)(.*?)(\d*?)kb/s(?s)(.*?)t nghe: (.*?)</p>%',
$result, $matches, PREG_PATTERN_ORDER);

//$matches[1] : songURL
//$matches[2] : songTitle
//$matches[4] : bitrate
//$matches[6] : hit

this will retrieve relative URL of the song, title, bitrate and the view count.

Why do we need bitrate and viewcount ?

To later decide which version of the song has higest quality since people usually upload duplicated songs. For me, I prefer to pick the one with highest bitrate then highest viewcount

We now process all the songs on that page and construct a Stub object (that contains all the information needed to later retrieve the real song)

//get key from song URL
preg_match('%/bai-hat/.*[/](.*?).html%', $matches[1][$i], $regs);
$stub->stubkey = $regs[1];

$stub->title = ListenMe::convertFirstCharacterOfUnicodeStringToLowerCase(trim($matches[2][$i]));

$stub->bitrate = intval($matches[4][$i]);
$stub->views = str_replace(".", "", intval($matches[6][$i])); //REMOVE THE DOT INSIDE THE HIT
$stub->status = 1;

Now, like I already mentioned, there are cases when people upload the same song so many times (on Zing MP3, songs usually has 10+ duplicated versions)

Which one I choose to upload to ListenMe.net ? The one with higher bitrate and higher view count

$toBeAdded = 1;

if (count($validStubs) > 0)
{
foreach ($validStubs as $st) {
similar_text(ListenMe::convertUnicodeStringToLowercase(trim($st->title)), ListenMe::convertUnicodeStringToLowercase(trim($stub->title)),$percent);

if ($percent > $matchThreshold) {
if ($st->views >= $stub->views) { //don't add
$toBeAdded = 0;
} else { //add and remove existing stub
$st->status = 0;
}
}
}
}

$validStubs is to store the "chosen" songs (which passed our test), for the new song, we check if it matches any song in $validStubs, if there is a match, we compare the hit count (view)

Note the use of similar_text function, this is to match song that slightly different (typo), e.g. "Take me to your heart" and "Take me to ur heart"

Once finish looping all the pages, we then save all the stubs to DB and later upload them to ListenMe, again using cURL and the direct links retrieved in this step

Now, that Zing redirects the download link, we need to enable CURLOPT_FOLLOWLOCATION


curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);// allow redirects
curl_setopt($ch, CURLOPT_MAXREDIRS , 1);// redirect 1 level

Subscribe to Think.Forget.Do