Scikit-learn: Dùng để tiền xử lý dữ liệu (chuẩn hóa, mã hóa, chia tách dữ liệu

Scikit-learn

Mục tiêu: Cung cấp các công cụ mạnh mẽ cho học máy (machine learning) và tiền xử lý dữ liệu.

Hỗ trợ tiền xử lý: chuẩn hóa (scaling), mã hóa nhãn (label encoding), xử lý giá trị thiếu, chia tập train/test,...
Cung cấp các thuật toán học máy: hồi quy, phân loại, phân cụm, giảm chiều,...
Giao diện đơn giản, dễ dùng, tích hợp tốt với NumPy & Pandas.
Là một trong những thư viện học máy phổ biến nhất trong Python.

Ví dụ cơ bản:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pandas as pd

# Tạo dữ liệu mẫu
df = pd.DataFrame({
    'age': [25, 30, 35, 40],
    'income': [3000, 4000, 5000, 6000],
    'label': ['yes', 'no', 'yes', 'no']
})

# Mã hóa nhãn
le = LabelEncoder()
df['label_encoded'] = le.fit_transform(df['label'])

# Chuẩn hóa dữ liệu
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

# Chia tập train/test
X = df[['age', 'income']]
y = df['label_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

PreviousMatplotlib & Seaborn: Vẽ biểu đồ, trực quan hóa dữ liệu NextPhần 2: Học Máy (Machine Learning)

Last updated 3 months ago

Was this helpful?