用python写的这个文件查询关键字的代码太好用了，赶紧复制!

liftword3个月前 (03-29)技术文章21

适用场景

咱们可以想象一个这样的场景，你这边有大量的文件，现在我需要查找一个 序列号：xxxxxx，我想知道这个 序列号：xxxxxx 在哪个文件中。

在没有使用代码脚本的情况下，你可能需要一个文件一个文件打开，然后按 CTRL+F 来进行搜索查询。

那既然我们会使用 python，何不自己写一个呢？本文将实现这样一个工具，且源码全在文章中，只需要复制粘贴即可使用。

思路

主要思路就是通过打开文件夹，获取文件，一个个遍历查找关键字，流程图如下：

怎么样，思路非常简单，所以其实实现也不难。

本文将支持少部分文件类型，更多类型需要读者自己实现：

txt
docx
csv
xlsx
pptx

1、读取txt

安装库

pip install chardet

代码

import chardet


def detect_encoding(file_path):
    raw_data = None
    with open(file_path, 'rb') as f:
        for line in f:
            raw_data = line
            break

        if raw_data is None:
            raw_data = f.read()
    result = chardet.detect(raw_data)
    return result['encoding']


def read_txt(file_path, keywords=''):
    is_in = False
    encoding = detect_encoding(file_path)
    with open(file_path, 'r', encoding=encoding) as f:
        for line in f:
            if line.find(keywords) != -1:
                is_in = True
                break

    return is_in

我们使用了 chardet 库来判断 txt 的编码，以应对不同编码的读取方式。

2、读取docx

安装库

pip install python-docx

代码

from docx import Document


def read_docx(file_path, keywords=''):
    doc = Document(file_path)
    is_in = False

    for para in doc.paragraphs:
        if para.text.find(keywords) != -1:
            is_in = True
            break

    return is_in

3、读取csv

代码

import csv


def read_csv(file_path, keywords=''):
    is_in = False

    encoding = detect_encoding(file_path)
    with open(file_path, mode='r', encoding=encoding) as f:
        reader = csv.reader(f)

        for row in reader:
            row_text = ''.join([str(v) for v in row])
            if row_text.find(keywords) != -1:
                is_in = True
                break

    return is_in

4、读取xlsx

安装库

pip install openpyxl

代码

from openpyxl import load_workbook


def read_xlsx(file_path, keywords=''):
    wb = load_workbook(file_path)
    sheet_names = wb.sheetnames

    is_in = False
    for sheet_name in sheet_names:
        sheet = wb[sheet_name]
        for row in sheet.iter_rows(values_only=True):
            row_text = ''.join([str(v) for v in row])
            if row_text.find(keywords) != -1:
                is_in = True
                break

    wb.close()

    return is_in

5、读取pptx

安装库

pip install python-pptx

代码

from pptx import Presentation


def read_ppt(ppt_file, keywords=''):
    prs = Presentation(ppt_file)
    is_in = False
    for slide in prs.slides:
        for shape in slide.shapes:
            if shape.has_text_frame:
                text_frame = shape.text_frame
                for paragraph in text_frame.paragraphs:
                    for run in paragraph.runs:
                        if run.text.find(keywords) != -1:
                            is_in = True
                            break

    return is_in

6、文件夹递归

为了防止文件夹嵌套导致的问题，我们还有一个文件夹递归的操作。

代码

from pathlib import Path


def list_files_recursive(directory):
    file_paths = []

    for path in Path(directory).rglob('*'):
        if path.is_file():
            file_paths.append(str(path))

    return file_paths

完整代码

# -*- coding: utf-8 -*-
from pptx import Presentation
import chardet
from docx import Document
import csv
from openpyxl import load_workbook
from pathlib import Path


def detect_encoding(file_path):
    raw_data = None
    with open(file_path, 'rb') as f:
        for line in f:
            raw_data = line
            break

        if raw_data is None:
            raw_data = f.read()
    result = chardet.detect(raw_data)
    return result['encoding']


def read_txt(file_path, keywords=''):
    is_in = False
    encoding = detect_encoding(file_path)
    with open(file_path, 'r', encoding=encoding) as f:
        for line in f:
            if line.find(keywords) != -1:
                is_in = True
                break

    return is_in


def read_docx(file_path, keywords=''):
    doc = Document(file_path)
    is_in = False

    for para in doc.paragraphs:
        if para.text.find(keywords) != -1:
            is_in = True
            break

    return is_in


def read_csv(file_path, keywords=''):
    is_in = False

    encoding = detect_encoding(file_path)
    with open(file_path, mode='r', encoding=encoding) as f:
        reader = csv.reader(f)

        for row in reader:
            row_text = ''.join([str(v) for v in row])
            if row_text.find(keywords) != -1:
                is_in = True
                break

    return is_in


def read_xlsx(file_path, keywords=''):
    wb = load_workbook(file_path)
    sheet_names = wb.sheetnames

    is_in = False
    for sheet_name in sheet_names:
        sheet = wb[sheet_name]
        for row in sheet.iter_rows(values_only=True):
            row_text = ''.join([str(v) for v in row])
            if row_text.find(keywords) != -1:
                is_in = True
                break

    wb.close()

    return is_in


def read_ppt(ppt_file, keywords=''):
    prs = Presentation(ppt_file)
    is_in = False
    for slide in prs.slides:
        for shape in slide.shapes:
            if shape.has_text_frame:
                text_frame = shape.text_frame
                for paragraph in text_frame.paragraphs:
                    for run in paragraph.runs:
                        if run.text.find(keywords) != -1:
                            is_in = True
                            break

    return is_in


def list_files_recursive(directory):
    file_paths = []

    for path in Path(directory).rglob('*'):
        if path.is_file():
            file_paths.append(str(path))

    return file_paths


if __name__ == '__main__':
    keywords = '测试关键字'
    file_paths = list_files_recursive(r'测试文件夹')
    for file_path in file_paths:
        if file_path.endswith('.txt'):
            is_in = read_txt(file_path, keywords)
        elif file_path.endswith('.docx'):
            is_in = read_docx(file_path, keywords)
        elif file_path.endswith('.csv'):
            is_in = read_csv(file_path, keywords)
        elif file_path.endswith('.xlsx'):
            is_in = read_xlsx(file_path, keywords)
        elif file_path.endswith('.pptx'):
            is_in = read_ppt(file_path, keywords)

        if is_in:
            print(file_path)

结尾

现在你可以十分方便地使用代码查找出各种文件中是否存在关键字了，代码复制即可使用。

如果你喜欢这篇文章，给我点个赞吧！

Python 中读取、写入和管理文件的基础知识

File 处理是 Python 中每个开发人员都应该掌握的一项基本技能。无论您是在处理文本文件、日志、配置文件，还是二进制数据，了解如何读取、写入和管理文件都是必不可少的。Python 的内置工具使文...

Python 文件读写(txt、json、xml、ini)

在进行接口自动化测试时，我们经常需要处理各种格式的文件。熟练掌握对这些文件的读写能力对于提升测试效率至关重要。今天我们就来一起学习如何用Python来操作常见的文件类型，包括文本文件（.txt）、JS...

怎么用Python提取txt的章节目录?

提取txt文本中的章节在网络上下载的小说，一般是txt格式的，并且往往是没有目录的。那么有没办法提取出小说的目录呢？下面是一个示例代码，用于提取txt文本中的章节作为目录： import re #...

python初学者系列:windows下载python

合集说明这是一个针对python3初学者的合集，将持续更新。本合集主要内容是一些代码案例，帮助初学者学习发散，主体包含知识点、结果展示、思路分析、实现流程四大模块。初学者可以根据知识点判断是否继续阅...

22-3-Python高级特性-上下文管理器

4-上下文管理器4-1-概念上下文管理器是一种实现了 `__enter__()` 和 `__exit__()` 方法的对象；用于管理资源的生命周期，如文件的打开和关闭、数据库连接的建立和断开等。使用...

21-Python-文件操作

在Python中，文件操作是非常重要的一部分，它允许我们读取、写入和修改文件。下面将详细讲解Python文件操作的各个方面，并给出相应的示例。1-打开文件在Python中，使用`open()`函数来打...

流照教程网