在 Python 中处理数据时，您经常会遇到 NaN（非数字）值。这些讨厌的缺失值可能会弄乱您的计算并导致代码中出现错误。让我们一步一步地了解什么是 NaN 以及如何处理它。

什么是 NaN 以及为什么它很重要？

NaN 不仅仅是一个常规值 — 它是 Python 表示 “this number is missing or undefined” 的方式。将其视为一个显示“此处没有数据”的占位符。让我们看看 NaN 在代码中出现的所有不同方式：

import math
import numpy as np

# Here are all the different types of NaN you might see
nan_values = [
    float('nan'),     # Python's basic NaN - used in regular Python
    math.nan,         # Math module's NaN - used in mathematical operations
    np.nan,           # NumPy's NaN - used in numerical arrays
    None             # Python's None - often treated like NaN in data analysis
]

# Let's see what type each one is
print([type(x) for x in nan_values])
# Output: [, , , ]

发现有趣的内容了吗？前三个 NaN 类型都是浮点数（十进制数），但 None 是它自己的特殊类型。这很重要，因为我们在清理数据时需要以不同的方式处理它们。

清理简单列表：基本方法

让我们从最常见的任务开始：从常规 Python 列表中删除 NaN。我们将介绍两种方法，并准确解释为什么代码的每个部分都很重要：

import math

# Create a test list with some NaN values mixed in
data = [1, float('nan'), 3, math.nan, 5, None]

# Method 1: Remove just NaN values
cleaned = [x for x in data if not (isinstance(x, float) and math.isnan(x))]
print(cleaned)  # [1, 3, 5, None]

# Method 2: Remove both NaN and None values
cleaned = [x for x in data if not (isinstance(x, float) and math.isnan(x)) and x is not None]
print(cleaned)  # [1, 3, 5]

让我们来详细分析一下这段代码的作用：

1. 'isinstance（x， float）' 检查每个值是否为浮点数
— 我们需要这个，因为只有 float 值可以是 NaN
— 如果我们跳过此检查，我们将收到非数字值的错误

2. 'math.isnan（x）' 检查浮点数是否为 NaN
— 这仅适用于 float 值
— 这就是为什么我们需要先检查 isinstance

3. 'x is not None' 检查 None 值
— 我们使用 'is' 而不是 '=='，因为 None 是一个特殊的 Python 对象
— 这是检查 None 的正确方法

使用 NumPy：更快、更强大

当您处理较大的数据集时，NumPy 让生活变得更加轻松。以下是如何使用它：

import numpy as np

# Create a NumPy array with NaN values
arr = np.array([1, np.nan, 3, np.nan, 5, None])

# Method 1: Using np.isnan (fastest method)
cleaned = arr[~np.isnan(arr)]
print(cleaned)  # [1. 3. 5.]

# Method 2: Using pandas (more flexible)
import pandas as pd
cleaned = pd.Series(arr).dropna().values
print(cleaned)  # [1. 3. 5.]

# Method 3: Replace NaN with zeros (or any other value)
filled = np.nan_to_num(arr, nan=0)
print(filled)  # [1. 0. 3. 0. 5. 0.]

以下是每种方法都有用的原因：

1. 'np.isnan（arr）'：
— 为每个值创建一个 True/False 掩码
— '~' 将 True 翻转为 False，反之亦然
— 对于大型数组来说非常快
— 适用于数值数据

2. 'PD.Series（arr）.dropna（）' 中：
— 更灵活 — 处理不同类型的缺失数据
— 适用于复杂的数据结构
— 速度稍慢，但功能更丰富

3. 'np.nan_to_num（）'：
— 替换 NaN 而不是删除它
— 当您需要保持相同的数组大小时很有用
— 可以指定自定义替换值

您真正想阅读的作者的笔记

嘿，我是 Ryan 。我希望您发现这篇文章有用！

我只是想告诉你我在经历了太多次深夜调试会议后构建的东西。

事实是这样的：我厌倦了花费数小时寻找错误，滚动浏览无休止的 Stack Overflow 线程，并获得实际上并不能解决我问题的通用 AI 响应。

所以我构建了 SolvePro （https://solvepro.co/ai/），结果证明它是我希望几年前就拥有的工具。

认识 SolvePro：您的 Programming AI 合作伙伴

还记得当你终于理解了一个概念，一切都只是点击时的那种感觉吗？

这就是我想创造的 — 不仅仅是另一个 AI 工具，而是一个真正的学习伴侣，可以帮助那些 “啊哈 ”的时刻更频繁地发生。

SolvePro 与其他 AI 的不同之处在于它如何指导您的学习之旅。根据您的编码问题和风格，它会推荐符合您需求的测验和真实项目。

我对你的承诺

作为一名教育工作者和开发人员，我支持 SolvePro 的质量。我们根据用户反馈不断改进，我亲自阅读了每一个建议。如果它不能帮助你成为一个更好的程序员，我想知道为什么。

我相信每个人都应该获得高质量的编程教育。这就是为什么您可以在 https://solvepro.co/ai/ 上即时访问 SolvePro 的原因

来自其他开发人员

“这就像有一个非常有耐心的高级开发人员，他真的想帮助你了解问题。”

- Sarah，后端工程师

“这帮助我最终理解了异步编程。个性化的练习让一切变得不同。

- Mike，全栈开发人员

个人笔记

我构建这个是因为我相信编码应该不那么令人沮丧，而且更有意义。如果您尝试 SolvePro 但没有帮助，请直接发送电子邮件至 help@solvepro.co，我想知道为什么，以便我们做得更好。

实际示例：清洁传感器数据

让我们看一下您在现实生活中可能遇到的一个实际示例 — 清理传感器中的数据：

import numpy as np
import pandas as pd
from datetime import datetime, timedelta

def clean_sensor_data(readings):
    """Clean sensor readings and provide detailed statistics"""
    
    # Convert the readings to a numpy array
    data = np.array(readings, dtype=float)
    
    # Gather information about the data
    total_readings = len(data)
    nan_count = np.isnan(data).sum()
    
    # Remove NaN values
    cleaned_data = data[~np.isnan(data)]
    
    # Calculate useful statistics
    stats = {
        'original_count': total_readings,    # How many readings we started with
        'nan_count': nan_count,             # How many NaN values we found
        'clean_count': len(cleaned_data),   # How many readings remain
        'nan_percentage': (nan_count/total_readings) * 100,  # Percentage of bad data
        'mean': np.mean(cleaned_data),      # Average reading
        'std': np.std(cleaned_data)         # How much readings vary
    }
    
    return cleaned_data, stats

# Test the function with some sample sensor data
sensor_readings = [
    23.5, np.nan, 24.1, 23.8, np.nan,
    24.3, 23.9, np.nan, 24.0, 23.7
]

cleaned, stats = clean_sensor_data(sensor_readings)

print("Cleaned sensor data:", cleaned)
print("\nDetailed Statistics:")
for key, value in stats.items():
    print(f"{key}: {value:.2f}")

此函数执行以下几项重要操作：
1. 将数据转换为 NumPy 数组以加快处理速度
2. 计算缺失的读数数量
3. 删除所有 NaN 值
4. 计算有关数据的有用统计数据
5. 以有组织的方式返回已清理的数据和统计信息

处理嵌套列表（列表中的列表）

有时，列表内会有列表，删除 NaN 值会变得更加棘手。这是一个强大的解决方案：

import math
import numpy as np

def clean_nested_list(nested_data):
    """Remove NaN values from lists within lists"""
    
    # If we're looking at a list, process each item in it
    if isinstance(nested_data, list):
        # Process each item, but only keep non-NaN values
        return [
            clean_nested_list(item) 
            for item in nested_data 
            if not (isinstance(item, float) and math.isnan(item))
            and item is not None
        ]
    
    # If it's not a list, return it unchanged
    return nested_data

# Test with a complex nested structure
nested_data = [
    [1, float('nan'), [3, None]],
    [math.nan, 5, [6, float('nan')]],
    [7, [8, None, 9]]
]

cleaned = clean_nested_list(nested_data)
print("Cleaned nested data:")
print(cleaned)
# Output: [[1, [3]], [5, [6]], [7, [8, 9]]]

此递归函数：
1. 检查每个项目本身是否是一个列表
2. 如果是列表，则处理其中的每一项
3. 如果不是列表，则检查是否为 NaN
4. 构建一个新的清理列表，删除所有 NaN 值

使用时间序列数据：详细方法

时间序列数据（随时间变化的数据点）需要特殊处理，因为数据点的顺序和间距很重要。以下是正确处理它的方法：

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create realistic time series data with some missing values
dates = pd.date_range(start='2024-01-01', periods=10, freq='H')
temperatures = [23.5, np.nan, 24.1, 23.8, np.nan, 24.3, 23.9, np.nan, 24.0, 23.7]

# Create a DataFrame (a table-like structure)
df = pd.DataFrame({
    'timestamp': dates,
    'temperature': temperatures
})

# Method 1: Simply drop NaN values
cleaned_drop = df.dropna()

# Method 2: Fill NaN with the previous value (forward fill)
cleaned_ffill = df.fillna(method='ffill')

# Method 3: Fill NaN by estimating between points (interpolation)
cleaned_interp = df.copy()
cleaned_interp['temperature'] = df['temperature'].interpolate()

# Print all versions to compare
print("Original Data:")
print(df)
print("\nAfter Dropping NaN:")
print(cleaned_drop)
print("\nAfter Forward Filling:")
print(cleaned_ffill)
print("\nAfter Interpolation:")
print(cleaned_interp)

让我们详细了解每种方法：

1. 'dropna（）'：
— 只需删除带有 NaN 的行
— 当您不需要每个时间戳的值时，效果最佳
— 保持数据准确性，但丢失时间点

2. 'fillna（method='ffill'）'：
— 向前复制最后一个有效值
— 适用于偶尔错过读数的传感器
— 假设条件保持不变，直到下一次读数

3. '插值（）'：
- 根据周围的点估计缺失值
— 最适合平滑变化的值，如温度
— 创建比正向填充更现实的估计值

性能测试：哪种方法最快？

1. NumPy 方法：
— 对于简单的数值数据，通常最快
— 最适合大型数组
— 仅限于数字数据

2. 艺术理解：
— 适合小列表
— 数据类型更灵活
— 列表越大越慢

3. 熊猫法：
— 比 NumPy 稍慢
— 更多功能和灵活性
— 更适合复杂的数据结构

常见错误以及如何避免它们

以下是人们在处理 NaN 值时最常犯的错误，以及如何修复它们：

# Mistake 1: Not checking data types first
data = [1, float('nan'), '3', np.nan]  # Mixed types

# Wrong way (will crash):
try:
    cleaned = [x for x in data if not math.isnan(x)]
except TypeError as e:
    print(f"Error: {e}")

# Right way:
cleaned = [x for x in data 
          if not (isinstance(x, float) and math.isnan(x))]
print("Correctly cleaned:", cleaned)

# Mistake 2: Losing data structure in pandas
import pandas as pd
df = pd.DataFrame({
    'A': [1, np.nan, 3], 
    'B': [4, 5, 6]
})

# Wrong way (loses DataFrame structure):
wrong_clean = df.values[~np.isnan(df.values)]
print("\nWrong way shape:", wrong_clean.shape)

# Right way (keeps DataFrame structure):
right_clean = df.dropna()
print("Right way shape:", right_clean.shape)

为什么这些错误很重要：

1. 不检查类型：
— NaN 仅对浮点数有效
— 尝试检查其他类型会导致错误
— 始终首先使用 isinstance（）检查类型

2. 丢失数据结构：
— 一些清理方法会破坏您的数据组织
— 列之间的重要关系可能会丢失
— 使用保留数据结构的方法

最佳实践摘要

以下是使用每种方法的时间：

1. 在以下情况下使用 List Comprehension：
— 使用简单的 Python 列表
— 处理混合数据类型
— 列表相对较小（< 10,000 项）

2. 在以下情况下使用 NumPy：
— 仅使用数值数据
— 需要快速处理
— 处理大型数据集
— 所有数据均为相同类型

3. 在以下情况下使用 Pandas：
— 使用表格数据
— 需要保留数据结构
— 需要更复杂的清洁选项
— 处理时间序列数据

记得：
- 始终先检查您的数据类型
- 考虑 NaN 值是否意味着重要的事情
- 选择保留重要数据关系的方法
- 使用特定数据大小测试性能

现在，您拥有了在任何 Python 数据结构中有效处理 NaN 值所需的所有工具。关键是根据您的具体情况和数据类型选择正确的方法。

流照教程网

从 Python 中的列表中删除 NaN:完整指南

什么是 NaN 以及为什么它很重要？

清理简单列表：基本方法

使用 NumPy：更快、更强大

您真正想阅读的作者的笔记

认识 SolvePro：您的 Programming AI 合作伙伴

我对你的承诺

来自其他开发人员

个人笔记

实际示例：清洁传感器数据

处理嵌套列表（列表中的列表）

使用时间序列数据：详细方法

性能测试：哪种方法最快？

常见错误以及如何避免它们

最佳实践摘要

相关文章

python如何彻底卸载

Python目录删除

python删除文件和删除目录的方法

无需手动干预!通过Python脚本实现EXE程序的静默安装与卸载

Python文件、文件夹删除之os、shutil

Python 高效的删除字符串中不需要的字符

蜀ICP备2024111239号-1