记录学习「Python 编程快速上手」一书的过程，主要涉及正则表达式、文件处理、网络爬虫等。

正则表达式

?匹配零次或一次前面的分组。
*匹配零次或多次前面的分组。
+匹配一次或多次前面的分组。
{n}匹配 n 次前面的分组。
{n,}匹配 n 次或更多前面的分组。
{,m}匹配零次到 m 次前面的分组。
{n,m}匹配至少 n 次、至多 m 次前面的分组。
{n,m}?或*?或+?对前面的分组进行非贪心匹配。
^spam 意味着字符串必须以 spam 开始。
spam$意味着字符串必须以 spam 结束。
.匹配所有字符，换行符除外。
\d、\w 和\s 分别匹配数字、单词和空格。
\D、\W 和\S 分别匹配出数字、单词和空格外的所有字符。
[abc]匹配方括号内的任意字符（诸如 a、b 或 c）。
[^abc]匹配不在方括号内的任意字符。

正则表达式的基本使用

>>> import re
>>> phone = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
>>> mo = phone.search('My number is 415-555-4242.')
>>> mo.group()
'415-555-4242'
>>>

# 使用括号分组
>>> phone =  re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> mo = phone.search('My number is 415-555-4242.')
>>> mo.group()
'415-555-4242'
>>> mo.group(1)
'415'
>>> mo.group(2)
'555-4242'
>>> mo.groups()
('415', '555-4242')

findall()

findall()不是返回一个 Match 对象，而是返回一个字符串列表

>>> phone = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
>>> mo = phone.search('Cell: 415-555-9999 Work: 212-555-0000')
>>> mo.group()
'415-555-9999'
>>> mo = phone.findall('Cell: 415-555-9999 Work: 212-555-0000')
>>> mo
['415-555-9999', '212-555-0000']

读写文件

文件路径 os.path

在 Windows 上，路径书写使用倒斜杠作为文件夹之间的分隔符。但在 OS X 和Linux 上，使用正斜杠作为它们的路径分隔符。

>>> import os

Windows下执行
>>> os.path.join('usr','bin','test')
'usr\\bin\\test'

# linux下执行
>>> os.path.join('usr','bin','test')
'usr/bin/test'

获取当前目录 getcwd()

# 获取当前目录
>>> os.getcwd()
'/mnt/e/liuhao_data/OneDrive/code/Python'
>>>

# 切换目录
>>> os.chdir('/mnt/e/liuhao_data/OneDrive/')
>>> os.getcwd()
'/mnt/e/liuhao_data/OneDrive'

# 更改的当前工作目录不存在
>>> os.chdir('/mnt/e/liuhao_data/OneDrive1/')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/liuhao_data/OneDrive1/'

创建目录 makedirs()

>>> os.makedirs('liuhao/test/')
>>> os.chdir('/mnt/e/liuhao_data/OneDrive/code/Python/liuhao/test')
>>> os.getcwd()
'/mnt/e/liuhao_data/OneDrive/code/Python/liuhao/test'

os.path()模块

os.path.getsize(path) 返回 path 参数中文件的字节数
os.listdir(path) 返回文件名字符串的列表

查看文件大小和文件夹内容

os.path.getsize(path) 将返回 path 参数中文件的字节数。
os.listdir(path) 将返回文件名字符串的列表，包含 path 参数中的每个文件
想知道这个目录下所有文件的总字节数，就可以同时使用os.path.getsize()和os.listdir()。

>>> totalSize = 0
>>> for file in os.listdir('/mnt/e/liuhao_data/OneDrive/code'):
...     totalSize += os.path.getsize(os.path.join('/mnt/e/liuhao_data/OneDrive/code', file))
...
>>> print(totalSize)
31180

检查路径有效性

os.path.exists(path) path 参数所指的文件或文件夹存在
os.path.isfile(path) path 参数存在，并且是一个文件

os.path.isdir(path) path 参数存在，并且是一个文件夹

>>> os.path.exists('/mnt/e/liuhao_data/OneDrive/code')
True
>>> os.path.exists('/mnt/e/liuhao_data/OneDrive/code2')
False
>>> os.path.isfile('/mnt/e/liuhao_data/OneDrive/code')
False
>>> os.path.isfile('/mnt/e/liuhao_data/OneDrive/code/xen.py')
True
>>> os.path.isdir('/mnt/e/liuhao_data/OneDrive/code/xen.py')
False
>>> os.path.isdir('/mnt/e/liuhao_data/OneDrive/code/')
True
>>> os.path.isdir('/mnt/e/liuhao_data/OneDrive/code')
True

读写文件过程

在 Python 中，读写文件有 3 个步骤：
1．调用open()函数，返回一个File对象。
2．调用File对象的read()或write()方法。
3．调用File对象的close()方法，关闭该文件。

打开并读取文件

>>> sonnet = open('sonnet29.txt')
>>> sonnet.read()
"When, in disgrace with fortune and men's eyes,\nI all alone beweep my outcast state,\nAnd trouble deaf heaven with my bootless cries,\nAnd look upon myself and curse my fate,\n"

>>> sonnet = open('sonnet29.txt')
>>> sonnet.readlines()
["When, in disgrace with fortune and men's eyes,\n", 'I all alone beweep my outcast state,\n', 'And trouble deaf heaven with my bootless cries,\n', 'And look upon myself and curse my fate,\n']

写入文件

写模式将覆写原有的文件，从头开始，将'w'作为第二个参数传递给open()，以写模式打开该文件。不同的是，添加模式将
在已有文件的末尾添加文本，而不是完全覆写该变量。将'a'作为第二个参数传递给open()，以添加模式打开该文件。

>>> sonnet = open('sonnet29.txt', 'w')
>>> sonnet.write('test write file.\n')
17
>>> sonnet.close()

>>> sonnet = open('sonnet29.txt')
>>> sonnet.read()
'test write file.\n'

>>> sonnet = open('sonnet29.txt', 'a')
>>> sonnet.write('this is line 2.\n')
16
>>> sonnet.close()

>>> sonnet = open('sonnet29.txt')
>>> sonnet.read()
'test write file.\nthis is line 2.\n'

组织文件

shutil 模块

复制文件和目录

shutil.copy(sourceFile, destination) 复制文件，返回目标文件名
shutil.copytree(sourceDir, destination) 复制目录，包括它的所有文件和子文件夹，返回目录目录名

>>> os.getcwd()
'/mnt/e/liuhao_data/OneDrive/code/Python/liuhao'
>>> os.listdir()
['hello.txt', 'randomQuizGen.py', 'sonnet29.txt', 'test', 'test.t']

# 复制文件
>>> shutil.copy('hello.txt', 'hello.copy')
'hello.copy'
>>> os.listdir()
['hello.copy', 'hello.txt', 'randomQuizGen.py', 'sonnet29.txt', 'test', 'test.t']

# 复制目录
>>> shutil.copytree('../liuhao', '../liuhao-bak')
'../liuhao-bak'
>>> os.listdir('../liuhao-bak')
['hello.copy', 'hello.txt', 'randomQuizGen.py', 'sonnet29.txt', 'test', 'test.t']

文件和目录的移动与重命名

shutil.move(source, destination)
如果destination指向一个文件夹，source文件将移动到destination中，并保持原来的文件名。

>>> os.listdir()
['hello.copy', 'hello.txt', 'randomQuizGen.py', 'sonnet29.txt', 'test', 'test.t']
# 文件重命名
>>> shutil.move('hello.copy', 'hello.bak')
'hello.bak'
>>> os.listdir()
['hello.bak', 'hello.txt', 'randomQuizGen.py', 'sonnet29.txt', 'test', 'test.t']

# 目录重命名
>>> shutil.move('test', 'test-bak')
'test-bak'
>>> os.listdir()
['hello.bak', 'hello.txt', 'randomQuizGen.py', 'sonnet29.txt', 'test-bak', 'test.t']

# 文件移动
>>> shutil.move('test.t', 'test-bak')
'test-bak/test.t'
>>> os.listdir()
['hello.bak', 'hello.txt', 'randomQuizGen.py', 'sonnet29.txt', 'test-bak']

永久删除文件和文件夹

os.unlink(path) 删除 path 处的文件
os.rmdir(path) 删除 path 处的文件夹。该文件夹必须为空
shutil.rmtree(path) 递归删除 path 处的文件夹

# 删除文件
>>> os.unlink('hello.bak')
>>> os.listdir()
['hello.txt', 'randomQuizGen.py', 'sonnet29.txt', 'test-bak']

# os.rmdir不能删除非空目录
>>> os.rmdir('test-bak')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 39] Directory not empty: 'test-bak'

# 递归删除目录
>>> shutil.rmtree('test-bak')
>>> os.listdir()
['hello.txt', 'randomQuizGen.py', 'sonnet29.txt']

遍历目录树 os.walk()

os.walk()函数被传入一个字符串值，即一个文件夹的路径。你可以在一个for循环语句中使用os.walk()函数，遍历目录树，就像使用range()函数遍历一个范围的数字一样。os.walk()在循环的每次迭代中，返回 3 个值：
1．当前文件夹名称的字符串。
2．当前文件夹中子文件夹的字符串的列表。
3．当前文件夹中文件的字符串的列表。

import os
for root, dirs, files in os.walk(".", topdown=False):
    for name in files:
        print(os.path.join(root, name))
    for name in dirs:
        print(os.path.join(root, name))

在子目录时，直接操作文件是有问题的，通过打印绝对路径，发现依然是处在当前路径下。需要通过os.path.join(root, name)的方式来获取文件的全路径。

import os, re, shutil

filereg = re.compile(r'(.jpg|.pdf)$')
for foldername, subfolders, filenames in os.walk('.'):
    for filename in filenames:
       if filereg.search(filename) != None:
           print(os.getcwd())
           print(filename)
           shutil.move(filename, '/mnt/e/liuhao_data/OneDrive/code/Python/')

运行结果：

liuhao@liuhao-pc:/mnt/e/liuhao_data/OneDrive/code/Python/liuhao$ python mvfile.py
The current folder is: .
Subfolders: ['test']
The current folder is: ./test
Subfolders: []
/mnt/e/liuhao_data/OneDrive/code/Python/liuhao
1.jpg
Traceback (most recent call last):
  File "mvfile.py", line 14, in <module>
    shutil.move(filename, '/mnt/e/liuhao_data/OneDrive/code/Python/')
  File "/usr/lib/python3.5/shutil.py", line 536, in move
    raise Error("Destination path '%s' already exists" % real_dst)
shutil.Error: Destination path '/mnt/e/liuhao_data/OneDrive/code/Python/1.jpg' already exists

从 Web 抓取信息

webbrowser：是 Python 自带的，打开浏览器获取指定页面。
requests：从因特网上下载文件和网页。
Beautiful Soup：解析 HTML，即网页编写的格式。
selenium：启动并控制一个 Web 浏览器。selenium能够填写表单，并模拟鼠标在这个浏览器中点击。

webbrowser模块打开URL

open()函数可以启动一个新浏览器，打开指定的 URL。

>>> import webbrowser
>>> webbrowser.open('www.baidu.com')
True

# 在Linux模式下运行，因为没有开启桌面模式，所以没有正常打开
>>> webbrowser.open('www.baidu.com')
False

requests 模块

requests.get()

requests.get()函数接受一个要下载的 URL 字符串，返回一个Response对象。

>>> import requests
>>> response = requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
>>> response.status_code
200
>>> len(response.text)
178981
>>> response.text[:100]
'\ufeffThe Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare\r\n\r\nThis eBook is for the us'
>>> file = open('baidu.txt', 'wb')
>>> for chunk in res.iter_content(100000):
...     file.write(chunk)
...
2381
>>> file.close()

用 BeautifulSoup 模块解析 HTML

bs4.BeautifulSoup()函数调用时需要一个字符串，其中包含将要解析的 HTML。函数返回一个BeautifulSoup对象。

>>> import requests,bs4
>>> res = requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
>>> noSoup = bs4.BeautifulSoup(res.text)
>>> type(noSoup)
<class 'bs4.BeautifulSoup'>

也可以向bs4.BeautifulSoup()传递一个File对象，从硬盘加载一个 HTML 文件。

>>> file = open('example.html')
>>> soup = bs4.BeautifulSoup(file)
>>> type(soup)
<class 'bs4.BeautifulSoup'>

新建测试html

<!-- This is the example.html example file. -->
<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://
inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>

用select()方法查找元素

传递给 select()方法的选择器	匹配项
`soup.select('div')`	所有名为`<div>`的元素
`soup.select('#author')`	带有 id 属性为 author 的元素
`soup.select('.notice')`	所有使用`CSS class`属性名为 notice 的元素
`soup.select('div span')`	所有在`<div>`元素之内的`<span>`元素
`soup.select('div > span')`	所有直接在`<div>`元素之内的`<span>`元素，中间没有其他元素
`soup.select('input[name]')`	所有名为`<input>`，并有一个 name 属性，其值无所谓的元素
`soup.select('input[type="button"]')`	所有名为`<input>`，并有一个 type 属性，其值为 button 的元素

select()方法将返回一个Tag对象的列表，这是 Beautiful Soup 表示一个 HTML元素的方式。针对 BeautifulSoup 对象中的 HTML 的每次匹配，列表中都有一个Tag对象。Tag值可以传递给str()函数，显示它们代表的HTML标签。Tag值也可以有attrs属性，它将该 Tag 的所有 HTML 属性作为一个字典。

>>> file = open('example.html')
>>> soup = bs4.BeautifulSoup(file.read())
>>> elems = soup.select('#author')
>>> elems
[<span id="author">Al Sweigart</span>]
>>> type(elems[0])
<class 'bs4.element.Tag'>
>>> elems[0].getText()
'Al Sweigart'
>>> elems[0]
<span id="author">Al Sweigart</span>
>>> elems[0].attrs
{'id': 'author'}

>>> elems = soup.select('.slogan')
>>> elems
[<p class="slogan">Learn Python the easy way!</p>]

通过元素的属性获取数据

Tag 对象的 get()方法让我们很容易从元素中获取属性值。向该方法传入一个属性名称的字符串，它将返回该属性的值。

>>> elems = soup.select('span')
>>> elems
[<span id="author">Al Sweigart</span>]
>>> elems = soup.select('span')[0]
>>> elems
<span id="author">Al Sweigart</span>
>>> str(elems)
'<span id="author">Al Sweigart</span>'
>>> elems.get('id')
'author'
>>> elems.attrs
{'id': 'author'}

抓取网站示例

下载网站http://xkcd.com/所有的漫画，首页有一个Prev按钮，让用户导航到前面的漫画。手工下载每张漫画要花较长的时间，但你可以写一个脚本，在几分钟内完成这件事。
代码需要做下列事情：

利用 requests 模块下载页面。
利用 Beautiful Soup 找到页面中漫画图像的 URL。
利用 iter_content()下载漫画图像，并保存到硬盘。

找到前一张漫画的链接 URL，然后重复。
打开一个新的文件编辑器窗口，将它保存为 downloadXkcd.py。

# --coding=utf-8--
#! python3

import requests, os, bs4

url = 'http://xkcd.com'
os.makedirs("xkcd", exist_ok=True)

while not url.endswith('#'):
    #  download the page
    print("Downloading page %s ..." % url)
    response = requests.get(url)
    print("the return code : " + str(response.status_code))

    soup = bs4.BeautifulSoup(response.text, "lxml")

    #  find the url of the comic image
    comic = soup.select('#comic img') #获取id属性为comic内的img元素
    if comic == []:
        print('cannot get the comic image')
    else:
        imageUrl = url + (comic[0].get('src'))
        print("downing the image %s..." %imageUrl)
        response = requests.get(imageUrl)
        print("the return code : " + str(response.status_code))

        # download the image
        print(os.path.basename(imageUrl))
        
        #  save the image to ./xkcd
        image = open(os.path.join('xkcd', os.path.basename(imageUrl)), 'wb')
        for i in response.iter_content(1000):
            image.write(i)
        image.close()

    # get the prev button'url
    prev = soup.select('a[rel="prev"]')[0] # 所有名为`<a>`，并有一个rel属性，其值为prev的元素
    url = 'http://xkcd.com' + prev.get('href')

print('Done.')