百度贴吧帖子备份爬虫

May 15, 2019

8968views

1612 words

<div class="tip share">请注意，本文编写于 1900 天前，最后修改于 1588 天前，其中某些信息可能已经过时。</div>
## 前言
最近贴吧把2017年以前的贴子都隐藏起来了，还并没说什么时候恢复。在某位xx大王的建议下，开始了备份之路。
一开始准备人工备份，但奈何实在是太多。顺带一提在[链接](http://tieba.baidu.com/mo/q-0--E22B510258AED42EE6CB244CD49B91A7%3AFG%3D1-sz%40480_800%2C-1-3-0--2--wapp_1557832930176_212 "链接")里仍然可以找到17年以前的贴子，但只有文字图片。
**百度已经把这个封了，这个链接已经失效了。**
量太大只好寻求自动之路，于是就写了这个爬虫。
## 代码
```python
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import re
import pyautogui
import os
def addToClipBoard(text):
    command = 'echo ' + text.strip() + '| clip'
    os.system(command)
pattern='m\?kz\=[0-9]{10}&is_bakan=0&lp=5010&pinf=1_1_'
strs='http://tieba.baidu.com/mo/q-0--E22B510258AED42EE6CB244CD49B91A7%3AFG%3D1-sz%40480_800%2C-1-3-0--2--wapp_1557832930176_212/m?kw=taritari&lp=5011&lm=4&pinf=1_1_0&pn='
driver = webdriver.Chrome()
p=0
for i in range(0,14):
	pp=str(p)
	driver.get(strs+pp)
	time.sleep(1)
	length = len(driver.find_elements_by_tag_name("a"))
	
	for i in range(1,length):
		links = driver.find_elements_by_tag_name("a")
		link = links[i]
		url=link.get_attribute('href')
		print(i)
		string=url
		if re.search(pattern, string) :
			driver.get(url) 
			time.sleep(5)
			h1=driver.find_element_by_tag_name('strong')
			h1str=h1.text
			print(h1str)
			
			
			pyautogui.hotkey('ctrl', 's')
			time.sleep(1)
			addToClipBoard(h1str)
			pyautogui.hotkey('ctrl', 'v')
			pyautogui.hotkey('enter')
			pyautogui.hotkey('enter')
			print('ok')
			driver.back()
	p=p+20
``` 
## 存在的问题
1. 有一些贴子加载慢会导致提前保存然后导致重名
2. 会有保存失败的贴子，原因未知
3. 不能保存贴子里高清图片
4. 贴子不能翻页，只能保存第一页的
5. 文字过多的楼层被折叠的部分不能保存

## 更新
<details>
  <summary>待定</summary>
</details>

Last modification：March 22nd, 2020 at 05:03 pm

If you think my article is useful to you, please feel free to appreciate

百度贴吧帖子备份爬虫