[ML/DL study note] 2-1 Train Set and Test Set

Study Source 혼자 공부하는 머신러닝/딥러닝 Chaper 01-1
Study Date 2021/11/01

Supervised Learning and Unsupervised Learning

ML algorithm can be classified with Supervised Learning and Unsupervised Learning.

In Supervised Learning, we need a data and it`s answer(label) to train.

In example of Bream ans Smelt, data was length and weight, and answer was whether it is Bream or not.

We call this input and target. Together, they`re called as Training Data.

And i told you before, the length and weight that are use as a input is Feature.

Supervised Learning trains the algorithm to get the correct answer
because it already have another answers(target).

Unsupervised Learning? We will gonna learn this at Ch6.


If we only have input, we need to use Unsupervised one.

But before, we had both of `em so we can use Supervised one.

Train Set and Test Set

If you get a test with an answer, you can easily get a full score. Right?

ML is same with this. If you train with Bream&Smelt`s data and test with same one,

You must have the answer. So we should separate dataset as Train Set and Test Set.


Back to CoLab, open BreamAndSmelt.

fish_length = [25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7, 31.0, 31.0, 31.5, 32.0, 32.0, 32.0, 33.0, 33.0, 33.5, 33.5, 34.0, 34.0, 34.5, 35.0, 35.0, 35.0, 35.0, 36.0, 36.0, 37.0, 38.5, 38.5, 39.5, 41.0, 41.0, 9.8, 10.5, 10.6, 11.0, 11.2, 11.3, 11.8, 11.8, 12.0, 12.2, 12.4, 13.0, 14.3, 15.0]
fish_weight = [242.0, 290.0, 340.0, 363.0, 430.0, 450.0, 500.0, 390.0, 450.0, 500.0, 475.0, 500.0, 500.0, 340.0, 600.0, 600.0, 700.0, 700.0, 610.0, 650.0, 575.0, 685.0, 620.0, 680.0, 700.0, 725.0, 720.0, 714.0, 850.0, 1000.0, 920.0, 955.0, 925.0, 975.0, 950.0, 6.7, 7.5, 7.0, 9.7, 9.8, 8.7, 10.0, 9.9, 9.8, 12.2, 13.4, 12.2, 19.7, 19.9]

fish_data = [[l, w] for l, w in zip(fish_length, fish_weight)]
fish_target = [1] * 35 + [0] * 14

Now we have a sample data list. (count: 49)

First 35 will be use as Train Set, and the other 14 will be use as Test Set.

from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()

Using Index in python. Please let you sure about this.

train_input  = fish_data[:35]
train_target = fish_target[:35]
test_input   = fish_data[35:]
test_target  = fish_target[35:]

We made some Sets, now train.

kn = kn.fit(train_input, train_target)
kn.score(test_input, test_target)
result: 0.0

Sampling Bias

Uh oh, result is 0.0. Something is wrong!

This is because Train Set includes Bream only, and Test Set includes Smelts only.

Like this, Train Set and Test Set does not mixed well, called Sampling Bias.

To mix this well, we will use another library. Numpy, 👏 GIVE HIM AN APPLAUSE!! 👏

Numpy

numpy is a representative array library in python.

import numpy as np	# of course, colab already installed most of important library.

input_arr  = np.array(fish_data)
target_arr = np.array(fish_target)

Now we prepared numpy array, let's choose sample randomly.

! Be careful ! You must move input value and target value at once.

np.random.seed(42)
index = np.arange(49)
np.random.shuffle(index)

train_input  = input_arr[ index[:35]]
train_target = target_arr[index[:35]]

test_input   = input_arr[ index[35:]]
test_target  = target_arr[index[35:]]

Ok. looks good. All datas are prepared. Let start draw it as scatter plot.

import matplotlib.pyplot as plt
plt.scatter(train_input[:,0], train_input[:,1])
plt.scatter(test_input[ :,0], test_input[ :,1])
plt.xlabel('length')
plt.ylabel('weight')
plt.show()

Did you get your graph?

If you are, train the model yourself this time.

I will hide the answer here.

더보기
kn = kn.fit(train_input, train_target)
kn.score(test_input, test_target)		# You must get 1.0 here
print(kn.predict(test_input))
print(test_target)						# compare this two!
										# [0 0 1 0 1 1 1 0 1 1 0 1 1 0]
										# [0 0 1 0 1 1 1 0 1 1 0 1 1 0]



Well, it must be finished here.


Hope to see you again in next post!👏


If you enjoyed or found some errors here,
just contact me in comment, 42 network Slack - kkim, or kwanho0096@gmail.com.