[ML/DL study note] 2-1 Train Set and Test Set

By kkimkkimkkim

AI Study Note

2021. 11. 1. 17:23

1. Supervised Learning and Unsupervised Learning

2. Train Set and Test Set

3. Sampling Bias

4. Numpy

Study Source	혼자 공부하는 머신러닝/딥러닝 Chaper 01-1
Study Date	2021/11/01

1. Supervised Learning and Unsupervised Learning

ML algorithm can be classified with Supervised Learning and Unsupervised Learning.

In Supervised Learning, we need a data and it`s answer(label) to train.

In example of Bream ans Smelt, data was length and weight, and answer was whether it is Bream or not.

We call this input and target. Together, they`re called as Training Data.

And i told you before, the length and weight that are use as a input is Feature.

Supervised Learning trains the algorithm to get the correct answer
because it already have another answers(target).

Unsupervised Learning? We will gonna learn this at Ch6.

If we only have input, we need to use Unsupervised one.

But before, we had both of `em so we can use Supervised one.

2. Train Set and Test Set

If you get a test with an answer, you can easily get a full score. Right?

ML is same with this. If you train with Bream&Smelt`s data and test with same one,

You must have the answer. So we should separate dataset as Train Set and Test Set.

Back to CoLab, open BreamAndSmelt.

fish_length = [25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7, 31.0, 31.0, 31.5, 32.0, 32.0, 32.0, 33.0, 33.0, 33.5, 33.5, 34.0, 34.0, 34.5, 35.0, 35.0, 35.0, 35.0, 36.0, 36.0, 37.0, 38.5, 38.5, 39.5, 41.0, 41.0, 9.8, 10.5, 10.6, 11.0, 11.2, 11.3, 11.8, 11.8, 12.0, 12.2, 12.4, 13.0, 14.3, 15.0]
fish_weight = [242.0, 290.0, 340.0, 363.0, 430.0, 450.0, 500.0, 390.0, 450.0, 500.0, 475.0, 500.0, 500.0, 340.0, 600.0, 600.0, 700.0, 700.0, 610.0, 650.0, 575.0, 685.0, 620.0, 680.0, 700.0, 725.0, 720.0, 714.0, 850.0, 1000.0, 920.0, 955.0, 925.0, 975.0, 950.0, 6.7, 7.5, 7.0, 9.7, 9.8, 8.7, 10.0, 9.9, 9.8, 12.2, 13.4, 12.2, 19.7, 19.9]

fish_data = [[l, w] for l, w in zip(fish_length, fish_weight)]
fish_target = [1] * 35 + [0] * 14
python

Now we have a sample data list. (count: 49)

First 35 will be use as Train Set, and the other 14 will be use as Test Set.

from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()python

Using Index in python. Please let you sure about this.

train_input  = fish_data[:35]
train_target = fish_target[:35]
test_input   = fish_data[35:]
test_target  = fish_target[35:]python

We made some Sets, now train.

kn = kn.fit(train_input, train_target)
kn.score(test_input, test_target)python

result: 0.0

3. Sampling Bias

Uh oh, result is 0.0. Something is wrong!

This is because Train Set includes Bream only, and Test Set includes Smelts only.

Like this, Train Set and Test Set does not mixed well, called Sampling Bias.

To mix this well, we will use another library. Numpy, 👏 GIVE HIM AN APPLAUSE!! 👏

4. Numpy

numpy is a representative array library in python.

import numpy as np	# of course, colab already installed most of important library.

input_arr  = np.array(fish_data)
target_arr = np.array(fish_target)python

Now we prepared numpy array, let's choose sample randomly.

! Be careful ! You must move input value and target value at once.

np.random.seed(42)
index = np.arange(49)
np.random.shuffle(index)

train_input  = input_arr[ index[:35]]
train_target = target_arr[index[:35]]

test_input   = input_arr[ index[35:]]
test_target  = target_arr[index[35:]]python

Ok. looks good. All datas are prepared. Let start draw it as scatter plot.

import matplotlib.pyplot as plt
plt.scatter(train_input[:,0], train_input[:,1])
plt.scatter(test_input[ :,0], test_input[ :,1])
plt.xlabel('length')
plt.ylabel('weight')
plt.show()python

Did you get your graph?

If you are, train the model yourself this time.

I will hide the answer here.

Show answer

kn = kn.fit(train_input, train_target)
kn.score(test_input, test_target)		# You must get 1.0 here
print(kn.predict(test_input))
print(test_target)						# compare this two!
										# [0 0 1 0 1 1 1 0 1 1 0 1 1 0]
										# [0 0 1 0 1 1 1 0 1 1 0 1 1 0]python

Well, it must be finished here.

Hope to see you again in next post!👏

If you enjoyed or found some errors here,
just contact me in comment, 42 network Slack - kkim, or kwanho0096@gmail.com.

'AI Study Note' 카테고리의 다른 글

[ML/DL study note] 1-3 Machine Learning, K-Nearest Neighbor (0)	2021.10.31
[ML/DL study note] 1-2 DevEnv(CoLab, Jupyter Notebook) (0)	2021.10.31
[ML/DL study note] 1-1 History, AGI/StrongAI, ML, DL (0)	2021.10.30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Notice

Category

Recent Post

Popular Post

Comment

Tags

Visitor Counter

1. Supervised Learning and Unsupervised Learning

2. Train Set and Test Set

3. Sampling Bias

4. Numpy

'AI Study Note' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역