Silver bullet
선형 회귀 실습(Scikit-learn & One-hot encoding) 본문
1. Scikit-learn
: Python으로 Traditinal Machine Learning 알고리즘을 구현한 오픈 소스 라이브러리
Scikit-learn의 장점
- python의 다른 라이브러리들과 호환성이 좋음
- 전체에 걸쳐 통일된 인터페이스를 가지고 있기 때문에 매우 간단하게 여러 알고리즘들을 적용할 수 있음
1. 데이터 셋 불러오기
sklearn.datasets.load_[DATA]()
df = pd.read_excel()
data = np.array(df)
2. Train / Test set 으로 데이터 나누기
sklearn.model_selection.train_test_split(X, Y, test_size)
3. 모델 객체(Model Instance) 생성하기
model = sklearn.linear_model.LinearRegression()
model = sklearn.linear_model.LogisticRegression()
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors)
model = sklearn.cluster.KMeans(n_cluster)
model = sklearn.decomposition.PCA(n_components)
model = sklearn.svm.SVC(kernel, C, gamma)
4. 모델 학습 시키기 (Model fitting)
model.fit(train_X, train_Y)
5. 모델로 새로운 데이터 예측하기 (Predict on test data)
model.predict(test_X)
model.predict_proba(test_X)
sklearn.metrics.mean_squared_error(predict_Y, test_Y)
sklearn.metrics.accuracy_score(predict_Y, test_Y)
sklearn.metrics.precision_score
sklearn.metrics.recall_score
sklearn.metrics.r2_score
2. Least Squares Estimation (Scikit-learn LinearRegression, 집값 예측)
1. (미국 보스턴의 주택 가격) 데이터 읽어들이기
1) Features
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df_data = pd.read_excel('boston_house_data.xlsx', index_col=0) # 엑셀 파일 읽기
df_data.head()
df_data[8].value_counts(sort=False).index # .keys()
from collections import Counter
Counter(df_data[3])
# Feature Normalization
- Numerical Column(Variable) → Min-max Algorithm or Standardization
- Categorical Column → One-hot encoding (의사결정나무에는 적용하지 않는다.)
- 선형 기반 모델은 Categorical Column 에 대해 One-hot encoding 거의 필수적으로 한다.
2) Target
df_target = pd.read_excel('boston_house_target.xlsx', index_col=0)
df_target.head()
df_main = pd.concat([df_data, df_target], axis=1) # concatenate
df_main.head()
3) Features & Target 합쳐서 살펴보기
df_main = pd.concat([df_data, df_target], axis=1) # concatenate
df_main.head()
# 열 이름 통째로 바꾸기
df_main.columns = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']
df_main.head()
df_main.describe() # description
# DataFrame을 np.array(배열, 행렬)로 바꿔주기
boston_data = np.array(df_data)
boston_target = np.array(df_target)
# array 의 차원수 확인 (506개의 데이터, 13개의 Data features)
boston_data.shape # '모양'을 영어로
# array 의 차원수 확인 (506개의 라벨값)
boston_target.shape
2. Feature 선택하기
# Use only one feature
# 항상 행렬 형태로 뽑아서 모델에게 던져줘야 합니다
boston_X = boston_data[:, 12:13] # 인구 중 하위 계층 비율
boston_X
boston_Y = boston_target
3. Training & Test set 나눠주기
from sklearn import model_selection
x_train, x_test, y_train, y_test = model_selection.train_test_split(boston_X, boston_Y, test_size=0.3, random_state=0)
# random_state (random_seed or seed) : make the result reproducible
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
random_state=42
은하수를 여행하는 히치하이커를 위한 안내서에서 인생이란 무엇인가?라는 질문에 대한 컴퓨터의 답이 42이다.
4. 비어있는 모델 객체 만들기
from sklearn import linear_model
model = linear_model.LinearRegression() # 선형회귀
5. 모델 객체 학습시키기 (on training data)
# Train the model using the training sets
model.fit(x_train, y_train) # 모델에 데이터를 '맞춰줍니다'
print('Coefficients: ', model.coef_) # 계수 / y=ax+b 에서 a값 -> a == -0.968
print('Intercepts: ', model.intercept_) # b == 34.78
6. 학습이 끝난 모델 테스트하기 (on test data)
model.predict(x_train) # '예측하다'
# 354개 Train 데이터에 대한 Model 의 Mean squared error
print('MSE(Training data) : ', np.mean((model.predict(x_train) - y_train) ** 2))
# Use this!
from sklearn.metrics import mean_squared_error
print('MSE(Training data) : ', mean_squared_error(model.predict(x_train), y_train))
# 152개 Test 데이터에 대한 Model 의 Mean squared error
print('MSE(Test data) : ', mean_squared_error(model.predict(x_test), y_test))
# RMSE (Root mean squared error)\
# Square root of error
np.sqrt( mean_squared_error(model.predict(x_test), y_test) )
MSE(Training data) : 37.933978172880295
MSE(Test data) : 39.81715050474418
6.310083240714355
overfitting 아님
7. 모델 시각화
plt.figure(figsize=(10, 10))
plt.scatter(x_test, y_test, color="black") # Test data
plt.scatter(x_train, y_train, color="red", s=1) # Train data
plt.plot(x_test, model.predict(x_test), color="blue", linewidth=3) # Fitted line
plt.show()
전체 코드 정리
from sklearn import model_selection, linear_model
from sklearn.metrics import mean_squared_error
# 1. Prepare the data (array!)
boston_data = np.array(df_data)
boston_target = np.array(df_target)
# 2. Feature selection
boston_X = boston_data[:, 12:13]
boston_Y = boston_target
# 3. Train/Test split
x_train, x_test, y_train, y_test = model_selection.train_test_split(boston_X, boston_Y, test_size=0.3, random_state=0)
# 4. Create model object
model = linear_model.LinearRegression()
# 5. Train the model
model.fit(x_train, y_train)
# 6. Test the model
print('MSE(Training data) : ', mean_squared_error(model.predict(x_train), y_train))
print('MSE(Test data) : ', mean_squared_error(model.predict(x_test), y_test))
# 7. Visualize the model
plt.figure(figsize=(10, 10))
plt.scatter(x_test, y_test, color="black") # Test data
plt.scatter(x_train, y_train, color="red", s=1) # Train data
plt.plot(x_test, model.predict(x_test), color="blue", linewidth=3) # Fitted line
plt.show()
3. Least Squares Estimation (Scikit-learn LinearRegression, 당뇨병 예측)
1. (당뇨병 진행도) 데이터 읽어들이기
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, model_selection, linear_model
from sklearn.metrics import mean_squared_error
diabetes = datasets.load_diabetes()
diabetes.keys()
diabetes['target'].shape
# dir(diabetes)
print(diabetes['DESCR'])
df = pd.DataFrame(diabetes['data']) # Array to dataframe
df.head()
# 기본적으로 array 형태로 저장되어 있으므로 바로 활용 가능
print(diabetes.data.shape) # '모양'
print(diabetes.target.shape)
diabetes.data[0, :]
2. Feature 선택하기
# Use one or many feature (visualization is only for one feature)
diabetes_X = diabetes.data[:, 2:3] # Try column index 2 first (Body Mass Index)
diabetes_X
# diabetes_X = diabetes.data[:, 2]
# diabetes_X.reshape(-1, 1)
diabetes_Y = diabetes.target
diabetes_Y
3. Training & Test set으로 나눠주기
from sklearn import model_selection
x_train, x_test, y_train, y_test = model_selection.train_test_split(diabetes_X, diabetes_Y, test_size=0.3, random_state=0)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
4. 비어있는 모델 객체 만들기
# Try simple linear regression first
model = linear_model.LinearRegression()
dir(linear_model)
5. 모델 객체 학습시키기 (on training data)
# Train the model using the training sets
model.fit(x_train, y_train)
6. 학습이 끝난 모델 테스트하기 (on test data)
model.predict(x_test)
# Train 데이터에 대한 Model 의 Mean squared error
print('MSE(Training data) : ', mean_squared_error(model.predict(x_train), y_train))
# Test 데이터에 대한 Model 의 Mean squared error
print('MSE(Test data) : ', mean_squared_error(model.predict(x_test), y_test))
# Square root of error
np.sqrt( mean_squared_error(model.predict(x_test), y_test) )
MSE(Training data) : 3892.7208150824304
MSE(Test data) : 3921.3720274248517
62.62085936351282
overfitting 아님
7. 모델 시각화
plt.figure(figsize=(10, 10))
plt.scatter(x_test, y_test, color="black") # Test data
plt.scatter(x_train, y_train, color="red", s=1) # Train data
plt.plot(x_test, model.predict(x_test), color="blue", linewidth=3) # Fitted line
plt.show()
전체 코드 정리
* 모든 열 사용
from sklearn import datasets, model_selection, linear_model
from sklearn.metrics import mean_squared_error
# 1. Prepare the data (array!)
diabetes = datasets.load_diabetes()
# 2. Feature selection
diabetes_X = diabetes.data #[:, 2:3]
diabetes_Y = diabetes.target
# 3. Train/Test split
x_train, x_test, y_train, y_test = model_selection.train_test_split(diabetes_X, diabetes_Y, test_size=0.3, random_state=0)
# 4. Create model object
model = linear_model.LinearRegression()
# 5. Train the model
model.fit(x_train, y_train)
# 6. Test the model
print('MSE(Training data) : ', mean_squared_error(model.predict(x_train), y_train))
print('MSE(Test data) : ', mean_squared_error(model.predict(x_test), y_test))
# 7. Visualize the model
# plt.figure(figsize=(10, 10))
# plt.scatter(x_test, y_test, color="black") # Test data
# plt.scatter(x_train, y_train, color="red", s=1) # Train data
# plt.plot(x_test, model.predict(x_test), color="blue", linewidth=3) # Fitted line
# plt.show()
MSE(Training data) : 2804.122899724064
MSE(Test data) : 3097.146138387797
심한 overfitting은 아님
test data MSE가 줄어들었으니 조금은 개선됐다고 말할 수 있
from sklearn import ensemble
from sklearn import datasets, model_selection, linear_model
from sklearn.metrics import mean_squared_error
# 1. Prepare the data (array!)
diabetes = datasets.load_diabetes()
# 2. Feature selection
diabetes_X = diabetes.data #[:, 2:3]
diabetes_Y = diabetes.target
# 3. Train/Test split
x_train, x_test, y_train, y_test = model_selection.train_test_split(diabetes_X, diabetes_Y, test_size=0.3, random_state=0)
# 4. Create model object
model = ensemble.GradientBoostingRegressor()
# 5. Train the model
model.fit(x_train, y_train)
# 6. Test the model
print('MSE(Training data) : ', mean_squared_error(model.predict(x_train), y_train))
print('MSE(Test data) : ', mean_squared_error(model.predict(x_test), y_test))
# 7. Visualize the model
# plt.figure(figsize=(10, 10))
# plt.scatter(x_test, y_test, color="black") # Test data
# plt.scatter(x_train, y_train, color="red", s=1) # Train data
# plt.plot(x_test, model.predict(x_test), color="blue", linewidth=3) # Fitted line
# plt.show()
MSE(Training data) : 778.3345855072206
MSE(Test data) : 3688.673032338571
위 코드는 과적합이 발생하였다.
'AI > AI' 카테고리의 다른 글
| Adaboost & Gradient Boosting & XGBoost / Gradient Boosting regression & classification 실습 코드 (0) | 2024.07.10 |
|---|---|
| 로지스틱 회귀 (sigmoid function & Cutoff, Cross-entropy & Softmax, Accuracy & Recall & Precision, ROC Curve & AUC) (0) | 2024.07.09 |
| 비용함수와 경사하강법 (Cost function & MSE) (0) | 2024.07.08 |
| 머신러닝 기초 (capacity, overfitting, K-Fold cross validation) (0) | 2024.07.08 |
| 그로스 해킹을 위한 파이썬 통계 분석 (0) | 2024.07.04 |