【python】网络爬虫_python网络爬虫软件
1. 使用requests获取网页
获取网页内容:
import requests
url = 'https://example.com'
response = requests.get(url)
html = response.text
2. 使用BeautifulSoup解析 HTML
解析 HTML 并提取数据:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify()) # Pretty-print the HTML
3. 遍历 HTML 树
使用标签查找元素:
title = soup.title.text # Get the page title
headings = soup.find_all('h1') # List of all tags
4. 使用 CSS 选择器
使用 CSS 选择器选择元素:
articles = soup.select('div.article') # All elements with class 'article' inside a 5. 从标签中提取数据
从 HTML 元素中提取文本和属性:
for article in articles:
title = article.h2.text # Text inside the tag
link = article.a['href'] # 'href' attribute of the tag
print(title, link)
6. 处理相对 URL
将相对 URL 转换为绝对 URL:
from urllib.parse import urljoin
absolute_urls = [urljoin(url, link) for link in relative_urls]
7. 处理分页
爬取多个页面上的内容:
base_url = "https://example.com/page/"
for page in range(1, 6): # For 5 pages
page_url = base_url + str(page)
response = requests.get(page_url)
# Process each page's content
8. 处理 AJAX 请求
爬取由 AJAX 请求加载的数据:
# Find the URL of the AJAX request (using browser's developer tools) and fetch it
ajax_url = 'https://example.com/ajax_endpoint'
data = requests.get(ajax_url).json() # Assuming the response is JSON
9. 在网络爬取中使用正则表达式
使用正则表达式提取数据:
import re
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', html)
10. 尊重robots.txt
要检查robots.txt以获取抓取权限:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
can_scrape = rp.can_fetch('*', url)
11. 使用会话和 Cookie
为了维护会话和处理 Cookies:
session = requests.Session()
session.get('https://example.com/login')
session.cookies.set('key', 'value') # Set cookies, if needed
response = session.get('https://example.com/protected_page')
12. 使用浏览器自动化进行抓取(selenium库)
爬取由 JavaScript 渲染的动态内容:
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://example.com')
content = browser.page_source
# Parse and extract data using BeautifulSoup, etc.
browser.quit()
13. 网络爬取中的错误处理
处理错误和异常:
try:
response = requests.get(url, timeout=5)
response.raise_for_status() # Raises an error for bad status codes
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
14. 异步网页抓取
异步抓取网站以实现更快的数据检索:
import aiohttp
import asyncio
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()urls = ['https://example.com/page1', 'https://example.com/page2']
loop = asyncio.get_event_loop()
pages = loop.run_until_complete(asyncio.gather(*(fetch(url) for url in urls)))
15. 数据存储(CSV,数据库)
将抓取的数据存储到 CSV 文件或数据库中:
import csv
with open('output.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Title', 'URL'])
for article in articles:
writer.writerow([article['title'], article['url']])