python用requests BeautifulSoup下载网页到txt并去掉html标记
python用requests BeautifulSoup下载网页到txt并去掉html标记
import requests
from bs4 import BeautifulSoup
url = "https://www.5a8.com"
filename = "www5a8com.txt"
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
# 自动检测编码
response.encoding = response.apparent_encoding
# 使用 BeautifulSoup 提取纯文本
soup = BeautifulSoup(response.text, "html.parser")
visible_text = soup.get_text(separator="\n", strip=True) # 用换行符分隔内容
# 保存处理后的文本
with open(filename, "w", encoding="utf-8") as f:
f.write(visible_text)
print(f"已提取可见文本至 {filename}")
except requests.exceptions.RequestException as e:
print(f"下载失败: {e}")
except Exception as e:
print(f"处理过程中发生错误: {e}")
运到方法
D:\code\python\get>python geturl1.py
已提取可见文本至 www5a8com.txt