Tag: Python

2 Posts

POJ site spider
Description Design a crawler for problems of poj.org, crawl 100 problem details, and save as *.json file. Preparation The lab can be divided into these sections: Send a request and receive a responseParse the response and save the desired dataRepeatedly doing 1 and 2. Observation on url pattern The URLs of problems on poj all follow this format: http://poj.org/problem?id=<id> where <id> is an integer starting at 1000. So when we only need to send a request to this pattern, when <id> ranges from 1000 to 1099. Strategies against anti-crawler As a friendly crawler, we should not put much pressure on poj's server. This is something we should think about before we start coding. To avoid being banned, here are some simple strategies: Use fake UserAgents.Sleep for some seconds between 2 requests.If still get banned, we have to implement an ip-pool. Coding Parse response and save the data we want We can manually obtain the HTML code of websites using Firefox's view-source function. So we will start with the parsing. In this section, we will import…
Crawl song lyrics in NetEase
Introduction Beginning to learn Python, @Zei_Wai and I want to do some simple tasks, instead of diving into the sea of the documentation of Python. Then we chose this: generating a WordCloud picture of the lyrics of NeteaseMusic Songlist. Using urllib package First we found an api of Netease: https://music.163.com/api/song/lyric?id=<song.id>&lv=1&kv=1&tv=-1 where <song.id> is the id of songs in Netease database. When we use web client to play music, the URL of music playing page contains this id. For a simple example, if we want to get the lyrics of Bad Apple!! we just need to do like this: Then we just need to get the songs' id in a automatic way, then we can generate a text file. But when we use this method to send request to Netease Songlist page, what it returns doesn't contain songid. This is the code we use:[ref]【python爬虫自学笔记】-----爬取网易云歌单中歌曲歌词[/ref] import json import requests import re import urllib from bs4 import * def get_html(url): myheaders = { "User-Agent": "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11 ", "Referer": "http://music.163.com/", "Host": "music.163.com" } try:…