Python入门到脱坑案例:简单网页爬虫

liftword5小时前技术文章1

网页爬虫是Python的一个非常实用的应用场景。下面我将介绍一个适合初学者的简单爬虫案例，使用Python的requests和BeautifulSoup库来抓取网页内容。

准备工作

首先需要安装必要的库：

pip install requests beautifulsoup4

案例1：获取网页标题和所有链接

import requests
from bs4 import BeautifulSoup

def simple_crawler(url):
    try:
        # 发送HTTP请求
        response = requests.get(url)
        response.raise_for_status()  # 检查请求是否成功
        
        # 解析HTML内容
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 获取网页标题
        title = soup.title.string
        print(f"网页标题: {title}")
        
        # 获取所有链接
        print("\n网页中的链接:")
        for link in soup.find_all('a'):
            href = link.get('href')
            if href and href.startswith('http'):  # 只显示完整的URL
                print(href)
                
    except requests.exceptions.RequestException as e:
        print(f"请求出错: {e}")

# 使用示例
url = input("请输入要爬取的网页URL: ")
simple_crawler(url)

案例2：抓取特定内容（例如新闻标题）

import requests
from bs4 import BeautifulSoup

def news_crawler():
    url = "https://news.baidu.com/"
    try:
        response = requests.get(url)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'html.parser')
        news_titles = soup.select('.hotnews a')  # 根据实际网页结构调整选择器
        
        print("百度热点新闻:")
        for i, title in enumerate(news_titles, 1):
            print(f"{i}. {title.get_text(strip=True)}")
            
    except Exception as e:
        print(f"发生错误: {e}")

news_crawler()

案例3：简单的图片下载器

import requests
from bs4 import BeautifulSoup
import os

def image_downloader(url, save_dir='images'):
    try:
        # 创建保存目录
        if not os.path.exists(save_dir):
            os.makedirs(save_dir)
            
        # 获取网页内容
        response = requests.get(url)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'html.parser')
        img_tags = soup.find_all('img')
        
        print(f"找到 {len(img_tags)} 张图片，开始下载...")
        
        for i, img in enumerate(img_tags):
            img_url = img.get('src')
            if img_url and img_url.startswith('http'):
                try:
                    img_data = requests.get(img_url).content
                    with open(f"{save_dir}/image_{i+1}.jpg", 'wb') as f:
                        f.write(img_data)
                    print(f"已下载: image_{i+1}.jpg")
                except:
                    continue
                    
        print("下载完成！")
        
    except Exception as e:
        print(f"发生错误: {e}")

# 使用示例
url = input("请输入包含图片的网页URL: ")
image_downloader(url)

案例4：简单的豆瓣电影Top250爬虫

import requests
from bs4 import BeautifulSoup
import csv

def douban_top250():
    base_url = "https://movie.douban.com/top250"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    
    movies = []
    
    for start in range(0, 250, 25):
        url = f"{base_url}?start={start}"
        try:
            response = requests.get(url, headers=headers)
            soup = BeautifulSoup(response.text, 'html.parser')
            
            for item in soup.select('.item'):
                title = item.select_one('.title').get_text(strip=True)
                rating = item.select_one('.rating_num').get_text(strip=True)
                movies.append((title, rating))
                
        except Exception as e:
            print(f"获取数据出错: {e}")
    
    # 保存到CSV文件
    with open('douban_top250.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['电影名称', '评分'])
        writer.writerows(movies)
    
    print("数据已保存到 douban_top250.csv")

douban_top250()

爬虫注意事项

遵守robots.txt：检查目标网站的robots.txt文件，了解哪些内容允许爬取
设置请求头：添加User-Agent模拟浏览器访问
限制请求频率：避免对服务器造成过大压力

import time time.sleep(1) # 每次请求间隔1秒
错误处理：网络请求可能失败，需要妥善处理异常
合法性：确保你的爬虫用途合法，不违反网站的使用条款

总结

这个简单的爬虫案例涵盖了：

使用requests库发送HTTP请求
使用BeautifulSoup解析HTML内容
提取特定元素和数据
简单的数据存储(CSV文件)
基本的错误处理

对于初学者来说，这些案例可以帮助理解网页爬虫的基本原理和Python的相关库使用。随着学习的深入，可以进一步学习更高级的爬虫框架如Scrapy，以及处理JavaScript渲染的页面等技术。

每天一个Python库:lxml全面实战指南，爬虫解析速度翻倍

你是否遇到过：BeautifulSoup 解析太慢？re 提取数据太容易误匹配？抓数据时 HTML 太乱根本不好搞？本期，我们用 lxml + XPath 一把梭！学习本来就不是一蹴而就的事，不过只要...

Python中subprocess模块:轻松调用外部程序与命令

大家好！在Python的世界里，subprocess模块提供了强大且灵活的方式来创建和管理子进程，让Python 程序能够调用外部程序、执行系统命令。一、subprocess模块简介subproces...

Python学不会来打我(37)yield关键字详解，一篇讲清所有知识点

在Python中，yield 关键字是生成器（Generator）的核心组成部分。它使得函数可以“暂停执行”，并在后续调用时“继续执行”，从而实现了一种轻量级的协程机制。本文将详细讲解 yield...

python入门到脱坑—字符串的切片

在 Python 中，字符串切片（String Slicing）是一种强大的操作，可以灵活地提取子字符串。以下是字符串切片的详细指南，包含基础语法、高级技巧和实际应用场景：1. 基础切片语法text...

Python学不会来打我(81)yield关键字的作用总结

上一篇文章我们介绍了yield创建的生成器，yield除了创建生成器之外，还有其他的作用，今天我们就分享yield关键字的其他几个作用！#python##python教程##python自学##...

Python 中的前缀删除操作全指南

1. 字符串前缀删除1.1 使用内置方法Python 提供了几种内置方法来处理字符串前缀的删除：# 1. 使用 removeprefix() 方法 (Python 3.9+) text = "...

流照教程网