Data Science

Day 1. Intro To Machine Learning

Gauss1 2020. 7. 28. 01:23

Pandas

데이터를 탐색하고 조작하는 데 사용하는 주요 library

import pandas as pd

# csv 파일 읽기
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)

# summary statistics
melbourne_data.describe()

# column들 출력
print(melbourne_data.columns)

# SalesPrice column data만 취하기(prediction target)
y=melbourne_data.SalePrice

# column list들 data 취하기
feature_names = ["LotArea","YearBuilt","1stFlrSF", "2ndFlrSF","FullBath","BedroomAbvGr", "TotRmsAbvGrd"]
X = melbourne_data[feature_names]

# 위에 몇줄만 출력
print(X.head())

scikit-learn

DataFrame으로 된 data를 modeling하는데 가장 많이 사용되는 library

 

모델을 만드는 과정은 아래 4과정

1. Define: 어떤 type의 model을 사용할 것인지 ex) decision tree

2. Fit: data의 pattern을 찾는일

3. Predict

4. Evaluate: model의 정확도를 확인하는 일

 

# model 정의하기
from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

# Predict
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

# model validation
from sklearn.metrics import mean_absolute_error
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

# split training data and validation data
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))


# model option test
# Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
# Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))
    
# random forest => average prediction of depth
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

 

'Data Science' 카테고리의 다른 글

회귀  (0) 2020.10.07
분류  (0) 2020.10.06
데이터 가공  (0) 2020.08.19
사이킷런(scikit-learn)  (0) 2020.08.14
Numpy, Pandas  (0) 2020.08.14