[Tutorial]How I *automatically* grabbed songs from Zing MP3 and upload to ListenMe ?

Background

  • Zing Mp3 is huge, damn huge in term of song database and song submitters

  • ListenMe.net (my hobby music sharing website) is (slightly) less huge and I'm more or less the only song submitter

  • I usually upload the song I like manually, but I also want to have a way to quickly build up ListenMe.net song database with very little efforts from me

Solution

  • Automatically "grab" songs from Zing MP3 and upload to ListenMe.net

  • Tools:

  • cURL : to retrieve web page from PHP

  • Regular expression : to parse the selected part in the HTML of the page. This tool is excellent for testing your regular expression.

Details

  • The below snipset will "open" Zing MP3 search page and look for $artist, $artist can be 'MLTR', 'Linkin' or 'Ưng+Hoàng+Phúc'
$indexURL = "http://mp3.zing.vn/tim-kiem/bai-hat.html?t=artist&p=1&q=".$artistCURL;
$result = songActions::curlURL($indexURL);
  • We now want to loop for all the result pages. But first, we want to know how many pages are there ?

we now use regular expression to grab the number between "Tìm được" and "bài", which is 2174. Another observation is that each result page contains 20 items, therefore we can calculate $totalPages

//find total found songs
preg_match('%Kết quả tìm được <strong>(.*?)</strong> bài hát%', $result, $matches);
$totalStubs = $matches[1];

//calculate number of pages, 20 song / pages
$totalPages = ceil($totalStubs / 20);
  • Now, we go to each of the page
for ($p = 1; $p <= $totalPages; $p++)

and see this

  • How do we get the title and direct link for each of the song ? Again, regular expression. First, look at the HTML source code

we can try this

//get items for current page

preg_match_all('%<a class="_trackLink" tracking="_frombox=search_song" title="[^>]*" href="(/bai-hat/.*?.html)">(.*?)</a>(?s)(.*?)(\d*?)kb/s(?s)(.*?)t nghe: (.*?)</p>%',
$result, $matches, PREG_PATTERN_ORDER);

//$matches[1] : songURL
//$matches[2] : songTitle
//$matches[4] : bitrate
//$matches[6] : hit

this will retrieve relative URL of the song, title, bitrate and the view count.

Why do we need bitrate and viewcount ?

To later decide which version of the song has higest quality since people usually upload duplicated songs. For me, I prefer to pick the one with highest bitrate then highest viewcount

  • We now process all the songs on that page and construct a Stub object (that contains all the information needed to later retrieve the real song)
//get key from song URL
preg_match('%/bai-hat/.*[/](.*?).html%', $matches[1][$i], $regs);
$stub->stubkey = $regs[1];

$stub->title = ListenMe::convertFirstCharacterOfUnicodeStringToLowerCase(trim($matches[2][$i]));

$stub->bitrate = intval($matches[4][$i]);
$stub->views = str_replace(".", "", intval($matches[6][$i])); //REMOVE THE DOT INSIDE THE HIT
$stub->status = 1;
  • Now, like I already mentioned, there are cases when people upload the same song so many times (on Zing MP3, songs usually has 10+ duplicated versions)

Which one I choose to upload to ListenMe.net ? The one with higher bitrate and higher view count

$toBeAdded = 1;

if (count($validStubs) > 0)
{
foreach ($validStubs as $st) {
similar_text(ListenMe::convertUnicodeStringToLowercase(trim($st->title)), ListenMe::convertUnicodeStringToLowercase(trim($stub->title)),$percent);

if ($percent > $matchThreshold) {
if ($st->views >= $stub->views) { //don't add
$toBeAdded = 0;
} else { //add and remove existing stub
$st->status = 0;
}
}
}
}

$validStubs is to store the "chosen" songs (which passed our test), for the new song, we check if it matches any song in $validStubs, if there is a match, we compare the hit count (view)

Note the use of similar_text function, this is to match song that slightly different (typo), e.g. "Take me to your heart" and "Take me to ur heart"

  • Once finish looping all the pages, we then save all the stubs to DB and later upload them to ListenMe, again using cURL and the direct links retrieved in this step

Now, that Zing redirects the download link, we need to enable CURLOPT_FOLLOWLOCATION


curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);// allow redirects
curl_setopt($ch, CURLOPT_MAXREDIRS , 1);// redirect 1 level

Subscribe to Think.Forget.Do

Sign up now to get access to the library of members-only issues.
Jamie Larson
Subscribe