Python3 web crawler zhihu

免责声明：本项目代码遵循 robots.txt 协议，并仅供个人学习 python 使用，且无实际用处。

这个迫真爬虫是年前用了和 req-ccnu 一样的方法写的（api + 模拟登陆），他们都有一个特点，就是处理 curl 后，后端返回的 json。当时没有看面向对象也没有学语法，所以以后要好好学 bs4，lxml，scrapy 写真正意义上的爬虫。
简单的源代码：https://github.com/wwyqianqian/ShuYu/blob/master/ZhiHuCom2File.py
后端返回字典：https://github.com/wwyqianqian/ShuYu/blob/master/ZHC.js
本文仅仅记录当时遇到的两个小问题的解决方法，并已过时：

第一个是调试时候终端打印出奇怪的字符，解决办法是在 header 里面去掉
'accept-encoding': 'gzip, deflate, br',
因为 requests 会自己加上 gzip（但 gzip 解压是自动的）。手动传了 requests 就假设自己处理压缩了，去掉后，这样返回内容就不会经过编码。
第二个问题是，实际运行时候，有的评论会正确爬取，有的会爬出乱码，这说明代码逻辑没有问题，又是编码的错误，应该如何解决呢？

40. 为什么不做成猫粮狗粮？
41. èµ°é¤�é¦†æ¸ é�“ï¼Œç‰¹åˆ«æ˜¯ç�«é”…ï¼Œé…¸è�œé±¼
42. é±¼ä¸¸å•Šã€‚
43. æ²¡æœ‰æ¶ˆå¤±ã€‚éƒ½è½¬åˆ°äº†å†œæ�‘

原因就是翻页时候 Requests 检测不到正确的编码，而且知乎又没有限定编码，所以需要我们自己手动限定，加上一句 reponse.encoding = 'utf-8' 就好了。
改过之后就可以正确运行啦～

requests 源代码节选：https://github.com/requests/requests/blob/75bdc998e2d430a35d869b2abf1779bd0d34890e/requests/utils.py#L855

这个猜测的代码先取了四个字节，通过看 \0 的个数来判断是不是某个编码，然后 requests 没有选出来合适的 encoding，于是得到 none。

def guess_json_utf(data):
    """
    :rtype: str
    """
    # JSON always starts with two ASCII characters, so detection is as
    # easy as counting the nulls and from their location and count
    # determine the encoding. Also detect a BOM, if present.
    sample = data[:4]
    if sample in (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE):
        return 'utf-32'     # BOM included
    if sample[:3] == codecs.BOM_UTF8:
        return 'utf-8-sig'  # BOM included, MS style (discouraged)
    if sample[:2] in (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE):
        return 'utf-16'     # BOM included
    nullcount = sample.count(_null)
    if nullcount == 0:
        return 'utf-8'
    if nullcount == 2:
        if sample[::2] == _null2:   # 1st and 3rd are null
            return 'utf-16-be'
        if sample[1::2] == _null2:  # 2nd and 4th are null
            return 'utf-16-le'
        # Did not detect 2 valid UTF-16 ascii-range characters
    if nullcount == 3:
        if sample[:3] == _null3:
            return 'utf-32-be'
        if sample[1:] == _null3:
            return 'utf-32-le'
        # Did not detect a valid UTF-32 ascii-range character
    return None

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.
    :param headers: dictionary to extract encoding from.
    :rtype: str
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = _parse_content_type_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'


def stream_decode_response_unicode(iterator, r):
    """Stream decodes a iterator."""

    if r.encoding is None:
        for item in iterator:
            yield item
        return

    decoder = codecs.getincrementaldecoder(r.encoding)(errors='replace')
    for chunk in iterator:
        rv = decoder.decode(chunk)
        if rv:
            yield rv
    rv = decoder.decode(b'', final=True)
    if rv:
        yield rv

CC BY-NC-SA 4.0.