【Python代码模板】数据预处理、数据分析、假设检验、机器学习

liftword6个月前 (12-07)技术文章66

1 数据介绍

本次分析使用的数据来自"yc_data.csv"，该文件包含了 Y Combinator（YC）创业加速器投资的公司详细信息:

文件包含多个列，如公司ID、公司名称、简短描述、详细描述、YC批次、公司状态、标签、位置、国家等。
数据涵盖了从YC早期批次（如S05、W06）到最近的批次（如W24、S24）的公司。
公司状态包括Active（活跃）、Acquired（被收购）和Inactive（不活跃）等。
数据包含了许多知名公司，如Reddit、Twitch、Scribd等。
每个公司的信息包括创始人数量、创始人姓名、团队规模、网站、Crunchbase链接和LinkedIn链接等。
标签列表示公司的业务领域或技术方向，如AI、fintech、SaaS等。
location数据显示了公司的地理分布，主要集中在美国，但也包括其他国家的公司。
年份信息显示了公司的创立时间，从早期到最近几年都有。
团队规模从个位数到数千人不等，反映了公司的不同发展阶段。
最近批次的公司数据显示了当前创业趋势，如人工智能、开源软件、开发者工具等领域的增长。

2 数据预处理

首先，我们使用 pandas 库读取 CSV 文件，并查看数据的基本信息：

Bash

import pandas as pd
df = pd.read_csv("yc_data.csv")
print(df.head())

输出结果显示，数据集包含17列，分别为：

batch_idx: 批次索引
company_id: 公司ID
company_name: 公司名称
short_description: 简短描述
long_description: 详细描述
batch: YC批次
status: 公司状态
tags: 标签
location: 位置
country: 国家
year_founded: 成立年份
num_founders: 创始人数量
founders_names: 创始人姓名
team_size: 团队规模
website: 网站
cb_url: Crunchbase链接
linkedin_url: LinkedIn链接

接下来，我们查看数据的整体情况：

Bash

print(df.info())
print(df.isnull().sum())

Bash

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4586 entries, 0 to 4585
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   batch_idx          4586 non-null   int64  
 1   company_id         4586 non-null   int64  
 2   company_name       4586 non-null   object 
 3   short_description  4432 non-null   object 
 4   long_description   4266 non-null   object 
 5   batch              4586 non-null   object 
 6   status             4586 non-null   object 
 7   tags               4586 non-null   object 
 8   location           4324 non-null   object 
 9   country            4331 non-null   object 
 10  year_founded       3563 non-null   float64
 11  num_founders       4586 non-null   int64  
 12  founders_names     4586 non-null   object 
 13  team_size          4515 non-null   float64
 14  website            4585 non-null   object 
 15  cb_url             2540 non-null   object 
 16  linkedin_url       2980 non-null   object 
dtypes: float64(2), int64(3), object(12)
memory usage: 609.2+ KB
None
...
website                 1
cb_url               2046
linkedin_url         1606
dtype: int64
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

从输出结果可以看出，数据集共有4586行，部分列存在缺失值，如short_description、long_description、location、country、year_founded等。

3 数据清洗

为了便于后续分析，我们需要对数据进行清洗和预处理。

Bash

# 处理缺失值
df['short_description'] = df['short_description'].fillna('No description')
df['year_founded'] = df['year_founded'].fillna(df['year_founded'].median())
df['team_size'] = df['team_size'].fillna(df['team_size'].median())
# 创建一个新列表示公司是否成功(假设Acquired或Active状态为成功)
df['is_successful'] = df['status'].isin(['Acquired', 'Active'])
# 从batch列中提取年份，处理异常情况
def extract_year(batch):
    try:
        year = batch[-2:]  # 提取字符串的最后两个字符
        return int('20' + year)  # 将年份转换为整数类型
    except:
        return np.nan
df['batch_year'] = df['batch'].apply(extract_year)
# 查看batch_year列的唯一值，以检查是否还有问题
print(df['batch_year'].unique())

4 探索性数据分析

现在我们的数据已经清理完毕,让我们开始探索一些有趣的见解:

4.1 公司状态分布

Bash

status_counts = df['status'].value_counts()
plt.figure(figsize=(10, 6))
status_counts.plot(kind='bar')
plt.title('Distribution of Company Statuses')
plt.xlabel('Status')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

这段代码将生成一个柱状图,显示不同公司状态的分布

4.2 每批次公司数量的变化

这将生成一个折线图,展示每个批次的公司数量变化。

4.3 最常见的标签

这个图表将展示最常见的20个标签。

4.4 公司成功率随时间的变化

从图中可以看出，YC创业公司的成功率总体呈上升趋势，近年来保持在较高水平。

5 假设检验

接下来，我们使用T检验分析不同因素对成功率的影响。首先，我们定义一个函数对给定变量进行T检验：

Bash

from scipy import stats
def perform_t_test(variable):
    successful_values = df[df['is_successful']][variable]
    unsuccessful_values = df[~df['is_successful']][variable]
    
    t_stat, p_value = stats.ttest_ind(successful_values, unsuccessful_values)
    
    print(f"Variable: {variable}")
    print(f"T-statistic: {t_stat}")
    print(f"P-value: {p_value}")
    print("---")

Bash

Variable: year_founded
T-statistic: 4.2584208077988706
P-value: 2.0999812247726262e-05
---
Variable: num_founders
T-statistic: 3.5994256038457904
P-value: 0.0003222920811796079
---
Variable: team_size
T-statistic: 0.2147248161445528
P-value: 0.8299914248081315
---
Variable: batch_year
T-statistic: 27.695299446266723
P-value: 3.067399233387115e-156
---

从输出结果可以看出：

year_founded、num_founders和batch_year对成功率有显著影响（p值小于0.05）
team_size对成功率没有显著影响（p值大于0.05）

6 预测模型

最后，我们尝试使用随机森林模型预测公司的成功率：

Bash

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
features = ['year_founded', 'num_founders', 'team_size']
X = df[features]
y = df['is_successful']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
feature_importance = pd.DataFrame({'feature': features, 'importance': rf_model.feature_importances_})
feature_importance = feature_importance.sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)

Bash

Accuracy: 0.8540305010893247
Classification Report:
              precision    recall  f1-score   support
       False       0.61      0.51      0.56       164
        True       0.90      0.93      0.91       754
    accuracy                           0.85       918
   macro avg       0.75      0.72      0.73       918
weighted avg       0.85      0.85      0.85       918
Feature Importance:
        feature  importance
2     team_size    0.562311
0  year_founded    0.368118
1  num_founders    0.069571

从输出结果可以看出，随机森林模型在测试集上的准确率为85.4%，表现较好。从特征重要性可以看出，团队规模、成立年份和创始人数量依次对预测结果的贡献最大。

7 总结

通过对YC创业公司数据的分析，我们得到以下主要结论：

YC创业公司的成功率总体呈上升趋势，近年来保持在较高水平。
成立年份、创始人数量和批次年份对成功率有显著影响，而团队规模对成功率没有显著影响。
成功公司的创始人数量显著高于不成功公司。
使用随机森林模型可以较好地预测公司的成功率，团队规模、成立年份和创始人数量是最重要的预测因素。

这些发现可以为创业者和投资者提供有价值的参考和启示。

图解机器学习 - 中文版（72 张 PNG）

ChatGPT 、大模型系列研究报告（50+ 个 PDF）

搭建完美的写作环境：工具篇（12 章）

108页PDF小册子：搭建机器学习开发环境及Python基础

116页PDF小册子：机器学习中的概率论、统计学、线性代数

《全网最全 Python、机器学习、AI、LLM 速查表（100 余张）》

Windows 系统升级 Python 2 到 Python 3 最新版

Python 2.x 已于 2020 年 1 月 1 日停止支持，因此强烈建议将你的 Python 环境从 Python 2 升级到 Python 3。以下是如何在 Windows 系统上将 Pyth...

深入了解 Python 的 Signal 库:让你的程序更“聪明”

你知道吗？在Python中，有一个非常有用的库叫做Signal，它可以帮助你轻松处理信号和事件。这种功能在创建响应式应用时尤其重要。今天，跟随我们一起深入探讨Signal，让它成为你工具箱中的另一个强...

流照教程网

【Python代码模板】数据预处理、数据分析、假设检验、机器学习

1 数据介绍

2 数据预处理

3 数据清洗

4 探索性数据分析

4.1 公司状态分布

5 假设检验

6 预测模型

7 总结

相关文章

Windows 系统升级 Python 2 到 Python 3 最新版

深入了解 Python 的 Signal 库:让你的程序更“聪明”

蜀ICP备2024111239号-1