Notice
Recent Posts
Recent Comments
Link
«   2026/02   »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
Tags
more
Archives
Today
Total
관리 메뉴

Silver bullet

선형 회귀 실습(Scikit-learn & One-hot encoding) 본문

AI/AI

선형 회귀 실습(Scikit-learn & One-hot encoding)

밀크쌀과자 2024. 7. 8. 19:43

1. Scikit-learn

: Python으로 Traditinal Machine Learning 알고리즘을 구현한 오픈 소스 라이브러리

 

Scikit-learn의 장점

  • python의 다른 라이브러리들과 호환성이 좋음
  • 전체에 걸쳐 통일된 인터페이스를 가지고 있기 때문에 매우 간단하게 여러 알고리즘들을 적용할 수 있음

 

1. 데이터 셋 불러오기

sklearn.datasets.load_[DATA]()
df = pd.read_excel()
data = np.array(df)

 

2. Train / Test set 으로 데이터 나누기

sklearn.model_selection.train_test_split(X, Y, test_size)

 

3. 모델 객체(Model Instance) 생성하기

model = sklearn.linear_model.LinearRegression()
model = sklearn.linear_model.LogisticRegression()
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors)
model = sklearn.cluster.KMeans(n_cluster)
model = sklearn.decomposition.PCA(n_components)
model = sklearn.svm.SVC(kernel, C, gamma)

 

4. 모델 학습 시키기 (Model fitting)

model.fit(train_X, train_Y)

 

5. 모델로 새로운 데이터 예측하기 (Predict on test data)

model.predict(test_X)
model.predict_proba(test_X)

sklearn.metrics.mean_squared_error(predict_Y, test_Y)
sklearn.metrics.accuracy_score(predict_Y, test_Y)
sklearn.metrics.precision_score
sklearn.metrics.recall_score
sklearn.metrics.r2_score

 

 


2.  Least Squares Estimation (Scikit-learn LinearRegression, 집값 예측)

1. (미국 보스턴의 주택 가격) 데이터 읽어들이기

1) Features

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df_data = pd.read_excel('boston_house_data.xlsx', index_col=0) # 엑셀 파일 읽기
df_data.head()

df_data[8].value_counts(sort=False).index # .keys()
from collections import Counter
Counter(df_data[3])

# Feature Normalization
- Numerical Column(Variable) → Min-max Algorithm or Standardization
- Categorical Column → One-hot encoding (의사결정나무에는 적용하지 않는다.)

- 선형 기반 모델은 Categorical Column 에 대해 One-hot encoding 거의 필수적으로 한다.

 

2) Target

df_target = pd.read_excel('boston_house_target.xlsx', index_col=0)
df_target.head()
df_main = pd.concat([df_data, df_target], axis=1) # concatenate
df_main.head()

 

3) Features & Target 합쳐서 살펴보기

df_main = pd.concat([df_data, df_target], axis=1) # concatenate
df_main.head()

# 열 이름 통째로 바꾸기
df_main.columns = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV'] 
df_main.head()

df_main.describe() # description

# DataFrame을 np.array(배열, 행렬)로 바꿔주기
boston_data = np.array(df_data)
boston_target = np.array(df_target)

# array 의 차원수 확인 (506개의 데이터, 13개의 Data features)
boston_data.shape # '모양'을 영어로

# array 의 차원수 확인 (506개의 라벨값)
boston_target.shape

 

2. Feature 선택하기

# Use only one feature 

# 항상 행렬 형태로 뽑아서 모델에게 던져줘야 합니다
boston_X = boston_data[:, 12:13] # 인구 중 하위 계층 비율 
boston_X

boston_Y = boston_target

 

3. Training & Test set 나눠주기

from sklearn import model_selection

x_train, x_test, y_train, y_test = model_selection.train_test_split(boston_X, boston_Y, test_size=0.3, random_state=0)
# random_state (random_seed or seed) : make the result reproducible

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

random_state=42

은하수를 여행하는 히치하이커를 위한 안내서에서 인생이란 무엇인가?라는 질문에 대한 컴퓨터의 답이 42이다.

 

4. 비어있는 모델 객체 만들기

from sklearn import linear_model

model = linear_model.LinearRegression() # 선형회귀

 

5. 모델 객체 학습시키기 (on training data)

# Train the model using the training sets

model.fit(x_train, y_train) # 모델에 데이터를 '맞춰줍니다'

print('Coefficients: ', model.coef_) # 계수 / y=ax+b 에서 a값 -> a == -0.968
print('Intercepts: ', model.intercept_) # b == 34.78

 

6. 학습이 끝난 모델 테스트하기 (on test data)

model.predict(x_train) # '예측하다'

# 354개 Train 데이터에 대한 Model 의 Mean squared error 
print('MSE(Training data) : ', np.mean((model.predict(x_train) - y_train) ** 2))
# Use this!
from sklearn.metrics import mean_squared_error

print('MSE(Training data) : ', mean_squared_error(model.predict(x_train), y_train))

# 152개 Test 데이터에 대한 Model 의 Mean squared error 
print('MSE(Test data) : ', mean_squared_error(model.predict(x_test), y_test))

# RMSE (Root mean squared error)\
# Square root of error
np.sqrt( mean_squared_error(model.predict(x_test), y_test) )
MSE(Training data) :  37.933978172880295
MSE(Test data) :  39.81715050474418
6.310083240714355

overfitting 아님

 

7. 모델 시각화

plt.figure(figsize=(10, 10))

plt.scatter(x_test, y_test, color="black") # Test data
plt.scatter(x_train, y_train, color="red", s=1) # Train data

plt.plot(x_test, model.predict(x_test), color="blue", linewidth=3) # Fitted line

plt.show()

 

 

 

전체 코드 정리

from sklearn import model_selection, linear_model
from sklearn.metrics import mean_squared_error

# 1. Prepare the data (array!)
boston_data = np.array(df_data)
boston_target = np.array(df_target)

# 2. Feature selection
boston_X = boston_data[:, 12:13] 
boston_Y = boston_target

# 3. Train/Test split
x_train, x_test, y_train, y_test = model_selection.train_test_split(boston_X, boston_Y, test_size=0.3, random_state=0)

# 4. Create model object 
model = linear_model.LinearRegression()

# 5. Train the model 
model.fit(x_train, y_train)

# 6. Test the model
print('MSE(Training data) : ', mean_squared_error(model.predict(x_train), y_train))
print('MSE(Test data) : ', mean_squared_error(model.predict(x_test), y_test))

# 7. Visualize the model
plt.figure(figsize=(10, 10))
plt.scatter(x_test, y_test, color="black") # Test data
plt.scatter(x_train, y_train, color="red", s=1) # Train data
plt.plot(x_test, model.predict(x_test), color="blue", linewidth=3) # Fitted line
plt.show()

3.  Least Squares Estimation  (Scikit-learn LinearRegression, 당뇨병 예측)

1. (당뇨병 진행도) 데이터 읽어들이기

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets, model_selection, linear_model
from sklearn.metrics import mean_squared_error

diabetes = datasets.load_diabetes()

diabetes.keys()

diabetes['target'].shape

# dir(diabetes)
print(diabetes['DESCR'])

df = pd.DataFrame(diabetes['data']) # Array to dataframe
df.head()

# 기본적으로 array 형태로 저장되어 있으므로 바로 활용 가능
print(diabetes.data.shape) # '모양'
print(diabetes.target.shape)

diabetes.data[0, :]

 

2. Feature 선택하기

# Use one or many feature (visualization is only for one feature)

diabetes_X = diabetes.data[:, 2:3] # Try column index 2 first (Body Mass Index)
diabetes_X

# diabetes_X = diabetes.data[:, 2]
# diabetes_X.reshape(-1, 1)

diabetes_Y = diabetes.target
diabetes_Y

 

3. Training & Test set으로 나눠주기

from sklearn import model_selection

x_train, x_test, y_train, y_test = model_selection.train_test_split(diabetes_X, diabetes_Y, test_size=0.3, random_state=0)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

 

4. 비어있는 모델 객체 만들기

# Try simple linear regression first
model = linear_model.LinearRegression()

dir(linear_model)

 

5. 모델 객체 학습시키기 (on training data)

# Train the model using the training sets

model.fit(x_train, y_train)

 

6. 학습이 끝난 모델 테스트하기 (on test data)

model.predict(x_test)

# Train 데이터에 대한 Model 의 Mean squared error 
print('MSE(Training data) : ', mean_squared_error(model.predict(x_train), y_train))

# Test 데이터에 대한 Model 의 Mean squared error 
print('MSE(Test data) : ', mean_squared_error(model.predict(x_test), y_test))

# Square root of error
np.sqrt( mean_squared_error(model.predict(x_test), y_test) )
MSE(Training data) :  3892.7208150824304
MSE(Test data) :  3921.3720274248517
62.62085936351282

overfitting 아님

7. 모델 시각화

plt.figure(figsize=(10, 10))

plt.scatter(x_test, y_test, color="black") # Test data
plt.scatter(x_train, y_train, color="red", s=1) # Train data

plt.plot(x_test, model.predict(x_test), color="blue", linewidth=3) # Fitted line

plt.show()

 

전체 코드 정리

* 모든 열 사용

from sklearn import datasets, model_selection, linear_model
from sklearn.metrics import mean_squared_error

# 1. Prepare the data (array!)
diabetes = datasets.load_diabetes()

# 2. Feature selection
diabetes_X = diabetes.data #[:, 2:3] 
diabetes_Y = diabetes.target

# 3. Train/Test split
x_train, x_test, y_train, y_test = model_selection.train_test_split(diabetes_X, diabetes_Y, test_size=0.3, random_state=0)
# 4. Create model object 
model = linear_model.LinearRegression()

# 5. Train the model 
model.fit(x_train, y_train)

# 6. Test the model
print('MSE(Training data) : ', mean_squared_error(model.predict(x_train), y_train))
print('MSE(Test data) : ', mean_squared_error(model.predict(x_test), y_test))

# 7. Visualize the model
# plt.figure(figsize=(10, 10))
# plt.scatter(x_test, y_test, color="black") # Test data
# plt.scatter(x_train, y_train, color="red", s=1) # Train data
# plt.plot(x_test, model.predict(x_test), color="blue", linewidth=3) # Fitted line
# plt.show()
MSE(Training data) :  2804.122899724064
MSE(Test data) :  3097.146138387797

심한 overfitting은 아님

test data MSE가 줄어들었으니 조금은 개선됐다고 말할 수 있

 

from sklearn import ensemble
from sklearn import datasets, model_selection, linear_model
from sklearn.metrics import mean_squared_error

# 1. Prepare the data (array!)
diabetes = datasets.load_diabetes()

# 2. Feature selection
diabetes_X = diabetes.data #[:, 2:3] 
diabetes_Y = diabetes.target

# 3. Train/Test split
x_train, x_test, y_train, y_test = model_selection.train_test_split(diabetes_X, diabetes_Y, test_size=0.3, random_state=0)
# 4. Create model object 
model = ensemble.GradientBoostingRegressor()

# 5. Train the model 
model.fit(x_train, y_train)

# 6. Test the model
print('MSE(Training data) : ', mean_squared_error(model.predict(x_train), y_train))
print('MSE(Test data) : ', mean_squared_error(model.predict(x_test), y_test))

# 7. Visualize the model
# plt.figure(figsize=(10, 10))
# plt.scatter(x_test, y_test, color="black") # Test data
# plt.scatter(x_train, y_train, color="red", s=1) # Train data
# plt.plot(x_test, model.predict(x_test), color="blue", linewidth=3) # Fitted line
# plt.show()
MSE(Training data) :  778.3345855072206
MSE(Test data) :  3688.673032338571

위 코드는 과적합이 발생하였다.