Python数据采样实现方法-高效处理与分析技巧

技术栈

2025-05-05 21

在Python中，数据采样可以通过多种方法实现，常用库包括pandas、numpy和scikit-learn。以下是几种常见采样方法及示例代码：

1. 简单随机采样

从数据集中随机抽取指定数量或比例的样本，可设置是否放回（replace）。

使用 pandas

import pandas as pd

# 创建示例数据
data = pd.DataFrame({'A': range(100), 'B': range(100, 200)})

# 随机抽取10行（不放回）
sample1 = data.sample(n=10, random_state=42)

# 按比例抽取10%的样本
sample2 = data.sample(frac=0.1, random_state=42)

使用 numpy

import numpy as np

# 从数组索引中随机选择
indices = np.random.choice(data.index, size=10, replace=False)
sample = data.loc[indices]

2. 分层采样

保持不同类别样本的比例，常用于分类任务的不均衡数据集。

使用 scikit-learn

from sklearn.model_selection import train_test_split

# 示例数据（假设有类别标签列'label'）
X = data.drop('label', axis=1)
y = data['label']

# 按分层比例划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    stratify=y,  # 保持类别比例
    random_state=42
)

手动实现分层采样

# 按每个类别抽取固定数量样本
samples_per_class = 5
stratified_sample = data.groupby('label').apply(lambda x: x.sample(samples_per_class))
stratified_sample = stratified_sample.reset_index(drop=True)

3. 加权采样

根据权重值对样本进行概率抽样。

使用 pandas

# 假设每个样本有权重列'weights'
weights = data['weights'] / data['weights'].sum()  # 归一化
sample = data.sample(n=10, weights=weights, random_state=42)

使用 numpy

# 根据权重随机选择索引
indices = np.random.choice(
    data.index, 
    size=10, 
    p=data['weights'] / data['weights'].sum()  # 权重需归一化
)
sample = data.loc[indices]

4. 时间序列采样

对时间序列数据按时间窗口或频率采样。

使用 pandas 重采样

# 假设数据索引为时间戳
time_series = pd.DataFrame({
    'value': range(100)
}, index=pd.date_range('2023-01-01', periods=100, freq='D'))

# 按周平均采样
weekly_sample = time_series.resample('W').mean()