题目

访问http://www.heibanke.com/lesson/crawler_ex00/，第一关是将页面出现的数字填写到当前url的尾部进行访问，然后会得到一个新的数字，再用它替换url中的尾部数字，这样不断循环往复，直到页面出现成功标识，如下图。

中间环节页面

BeautifulSoup实现方式

# coding=utf-8

import requests, bs4, re

url = 'http://www.heibanke.com/lesson/crawler_ex00/'

while True:
    # download the page
    print("forward to page %s ..." % url)
    response = requests.get(url)
    print("the return code : " + str(response.status_code))

    soup = bs4.BeautifulSoup(response.text, "html.parser")

    # get the url of the for the next page
    comic = soup.select('h3') # 获取页面数字
    print(comic[0].getText())
    number = re.findall("\d+", comic[0].getText())
    if number == []:
        print('The end.')
        break;
    else:
        url = 'http://www.heibanke.com/lesson/crawler_ex00/' + number[0] # 拼接新地址

程序运行结果

selenium实现方式

selenium 模块让 Python 直接控制浏览器，实际点击链接，填写登录信息，几乎就像是有一个人类用户在与页面交互。与Requests和Beautiful Soup相比，Selenium允许你用高级得多的方式与网页交互。但因为它启动了Web浏览器，假如你只是想从网络上下载一些文件，会有点慢，并且难以在后台运行。

Selenium需要一个驱动程序来连接所选的浏览器，需要下载浏览器对应的webdriver，并配置到系统环境变量。如Firefox的需要下载geckodriver。

浏览器	驱动下载地址
Chrome	https://sites.google.com/a/chromium.org/chromedriver/downloads
Edge	https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Firefox	https://github.com/mozilla/geckodriver/releases
Safari	https://webkit.org/blog/6900/webdriver-support-in-safari-10/

# coding=utf-8

import requests, re
from selenium import webdriver

url = 'http://www.heibanke.com/lesson/crawler_ex00/'

browser = webdriver.Firefox()

while True:
    # download the page
    print("Forward to page %s ..." % url)
    browser.get(url)
    elem = browser.find_element_by_tag_name('h3')

    # get the url of the for the next page
    print(elem.text)
    number = re.findall("\d+", elem.text)
    if number == []:
        print('The end.')
        browser.quit()
        break;
    else:
        url = 'http://www.heibanke.com/lesson/crawler_ex00/' + number[0] # 拼接新地址

最终页面

更多selenium使用方法参见官方文档。

生命在于折腾

爬虫闯关第一关

题目

BeautifulSoup实现方式

selenium实现方式