Tag: Spider

1 Posts

POJ site spider
Description Design a crawler for problems of poj.org, crawl 100 problem details, and save as *.json file. Preparation The lab can be divided into these sections: Send a request and receive a responseParse the response and save the desired dataRepeatedly doing 1 and 2. Observation on url pattern The URLs of problems on poj all follow this format: http://poj.org/problem?id=<id> where <id> is an integer starting at 1000. So when we only need to send a request to this pattern, when <id> ranges from 1000 to 1099. Strategies against anti-crawler As a friendly crawler, we should not put much pressure on poj's server. This is something we should think about before we start coding. To avoid being banned, here are some simple strategies: Use fake UserAgents.Sleep for some seconds between 2 requests.If still get banned, we have to implement an ip-pool. Coding Parse response and save the data we want We can manually obtain the HTML code of websites using Firefox's view-source function. So we will start with the parsing. In this section, we will import…