R을 통한 데이터 정규화와 전처리 이해하기

Posted Jan 28, 2025

By 박예성

15 min read

전처리 model은 정확도를 높이기 위해 feature engineering가 필요합니다. R을 통해 정규화, 데이터 전처리를 실습해 봅시다. 따라 하다보면 금방 이해가 될 것입니다. 직접 입력하고 이해해 보는 것을 추천드립니다.

통계학 기본 개념

합계 (Sum)
평균 (Mean)
분산 (Var)
표준편차 (SD)
표준오차 (Standard Error): 신뢰구간 계산 시 사용

모든 것을 분석할 수 없을 경우 표본을 뽑아 진행한다. 이때 얼마나 신뢰할 수 있는지 유의 수준 5% (표본편차 2배수)까지 해서 95% 내에서 분석한다.

표본분산:
$s^2 = \frac{\sum_{i=1}^{n} (x[i] - \bar{x})^2}{n-1}$
표본표준편차:
$s = \sqrt{s^2}$
표준오차:
$SE = \frac{s}{\sqrt{n}}$
변동계수 [표준 편차 / 평균] : 모집단의 크기에 따라 멀어지는 것도 달라지므로 변동을 비교 하기 위해(표준편차가 같더라도 평균이 다른 것), 평균이 다른 것을 얼마나 분산이 있나 비율로 비교하기 위해. 즉 측정 단위가 다른 자료를 비교

정규화

Min-Max 정규화:
$X_{\text{norm}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}$
Z-점수 표준화
Z-점수
- 표준화된 척도로 데이터가 평균에서 얼마나 떨어져 있는지 측정
- 평균 0, 표준편차 1인 표준 정규 분표
- 즉, 모든 평균마다 표준편차를 구하는 것이 아니라 표준정규분포에 맞추는 것 $Z = \frac{X - \mu}{\sigma}$
Robust 정규화:
중위수와 IQR 사용 (X - 중위수 / IQR) 평균은 양 극단값에 영향을 많이 받으니 양 끝값을 빼서 정규화 하는 방식

데이터 전처리 기법

결측치 처리: 제거, 평균 대체, KNN 알고리즘(최근접 이웃 알고리즘, 비슷하게 이웃한 것으로 처리)
이상치 처리: 상한/하한 값 filtering
범주화 (Factorizing): 원-핫 인코딩 (0과 1로 변환)
이미지 정규화: 픽셀 값을 255로 나눔

예제

dim, 이상치

  
iris # 종별 50 / 150 head(iris) dim(iris) # 차원 확인 length(iris) # 열수 nrow(iris) # 행수 ncol(iris) # 열수 names(iris) # 열이름 summary(iris) str(iris) # 설명 : 대부분 에러는 데이터 타입문제니 확인할 때 유용 class(iris) sapply(iris, class) # 기본이 열, 열로 class로 알려줌 boxplot(iris) # 동그라미 : 이상치 확인 (상한치, 하한치 넘는 것) 

결측치

  
install.packages("mlbench") # 표준 데이터 셋 제공 library(mlbench) # 로딩 data ("BostonHousing", package="mlbench") original <- BostonHousing # 백업 set.seed(100) # sampling : 컴퓨터 난수는 의사난수 : 정해져 있음 str(BostonHousing) # 집값(medv) 예측하기 위해 14 종류 변수 506개가 있다 sum(is.na(BostonHousing)) # na => 결측치 확인 BostonHousing<-na.omit(BostonHousing) # 결측치 삭제 BostonHousing[sample(1:nrow(BostonHousing), 40), "rad"] <- NA #결측치 40개 만들기 BostonHousing[sample(1:nrow(BostonHousing), 40), "ptratio"] install.packages("mice") library(mice) sum(is.na(BostonHousing)) # 결측치 수 확인 # 함수 -> 입력 (sapply는 열데이터) sapply(BostonHousing, function(x) sum(is.na(x))) # 함수 작성 후 전달, 한열에 na가 몇개 있는가 

선형회귀

  
# lm : 선형회귀 모델 t=ax+by+c : 계수(a,b) 구함 (즉, medv에 ptratio랑 rad가 얼마나 영향인지지) # 종속변수 ~ 독립변수 lm(medv ~ ptratio + rad, data=BostonHousing, na.action = na.omit) # medv(종속변수) ~ (독립변수)ptratio와 rad로, na.action = na.omit(결측치 처리)는 자동으로 결측치 처리 #(Intercept) : 절편(c) ptratio rad  #56.4829 -1.7405 -0.1971 # 즉, y(medv) = -1.7405*ptratio -0.1971*rad + 56.4829 # 예측 오차는 발생 

중위수 대체

  
install.packages("Hmisc") library(Hmisc) impute(BostonHousing$rad, mean) # R의 특성 : 원본 데이터를 보존 => 다시 대입 # 함수는 원본에 영향x 재정의 해줘야 함 BostonHousing$rad <- impute(BostonHousing$rad, median) # 결측치를 median으로 sapply(BostonHousing, function(x) sum(is.na(x))) 

범주화

  
# 범주화 (x <- -5:5) cut(x, breaks = 2) # breaks : 구간bin의 개수 : 2개의 구간 : -5~0는 -5~0구간에 있다. 1~5는 0에서 5구간. 01은 포함x cut(x, breaks = c(-6, 2,5)) # 구간 지정 : -6~2, 2~5 age <- c(0, 12, 89, 14, 25, 2, 65) cut(age, breaks = c(0, 19, 24, 40, 65, Inf), #0(하한값)은 범위값에 포함 x labels = c("minor", "Children", "Youth", "Adult", "senior")) 

범주화 & ggplot2 : 시각화

  
library(ggplot2) #시각화 library(MASS) str(Cars93) ?Cars93 hist(Cars93$MPG.highway) disc_1 <- Cars93[,c("Model","MPG.highway")] # 2개의 열데이터만 추출 (열추출) head(disc_1) within( Cars93, {MPG.highway >=20 & MPG.highway <25}) #within 안에서만 사용(Cars93$ 생략), 20이상 25미만 range(disc_1["MPG.highway"]) # range : 범위값 출력 #데이터 프레임에 데이터 실시간 추가가 가능 disc_1 <- within( disc_1, { MPG.highway_cd = character(0) #초기화 MPG.highway_cd[ MPG.highway >=20 & MPG.highway <25 ] = "20~25" #범위(문자열)로 채워짐 MPG.highway_cd[ MPG.highway >=25 & MPG.highway <30 ] = "25~30" MPG.highway_cd[ MPG.highway >=30 & MPG.highway <35 ] = "30~35" MPG.highway_cd[ MPG.highway >=35 & MPG.highway <40 ] = "35~40" MPG.highway_cd[ MPG.highway >=40 & MPG.highway <45 ] = "40~45" MPG.highway_cd[ MPG.highway >=45 & MPG.highway <50 ] = "45~50" MPG.highway_cd = factor(MPG.highway_cd, # 문자열에서 factor로 범주화 (차원 축소 : 복잡한 것을 간단하게게) level = c("20~25", "25~30", "30~35", "35~40", "40~45", "45~50")) }) disc_1 table(disc_1$MPG.highway_cd) # 도수분포표 (수 세줌) 

위에 것에 이어서 그풉핑을 통해 기술통계를 해봅시다.

기술통계

  
# 그룹핑( -> 집계함수) : MPG.highway를 MPG.highway_cd그룹별로 계산 (데이터, 범주, 계산방식) tapply(disc_1$MPG.highway, disc_1$MPG.highway_cd, sum) # 함수가 아니라 주소만 절다해서 ()가 없음 tapply(disc_1$MPG.highway, disc_1$MPG.highway_cd, mean) tapply(disc_1$MPG.highway, disc_1$MPG.highway_cd, sd) # 표준편차 # 이산적(범주화해서) (연속적x) # y축 = (범주별로)자동으로 카운트 ggplot(disc_1, aes(x=MPG.highway_cd, fill=MPG.highway_cd)) + geom_bar() # x = 구간, fill = 채우는 것. geom_bar() : 막대그래프 ggplot(disc_1, aes(x=MPG.highway_cd, fill=MPG.highway_cd)) + geom_dotplot(binwidth = 0.1) 

원-핫 인코딩 예제

  
# : 범주형변수의 정규화 cust_id <- c("대한", "민국", "만세", "영원", "무궁", "발전", "행복") age <-c(25,45,31,30,49,53,27) cust_profile <- data.frame(cust_id, age, stringsAsFactors = F) # factor로 안만들기 위해 stringsAsFactors = F cust_profile # 새로운 변수의 추가 : 파생변수 (나이에 따라서) # R에서의 ifelse 3항 연산자 (elif) cust_profile <- transform(cust_profile, age_20 = ifelse(age >= 20 & age < 30, 1, 0 ), age_30 = ifelse(age >= 30 & age < 40, 1, 0 ), age_40 = ifelse(age >= 40 & age < 50, 1, 0 ), age_50 = ifelse(age >= 50 & age < 60, 1, 0 )) cust_profile 

min-max 정규화

  
# feature 영향력을 균등하게 (편향을 제거) min_max_norm <- function(x) { # (주로 data.frame) x => 열 단위로 들어옴 (x - min(x)) / (max(x) - min(x)) } str(iris) lapply(iris[1:4], min_max_norm) # 5번째는 factor라서 제거, l-> list로 나옴 ($~) iris_norm <- as.data.frame(lapply(iris[1:4], min_max_norm)) head(iris_norm) iris_norm$Species <- iris$Species # 제거했던 Species(종속변수) 추가 head(iris_norm) 

z 점수 정규화

  
iris$Sepal.Width <- (iris$Sepal.Width - mean(iris$Sepal.Width)) / sd(iris$Sepal.Width) mean(iris$Sepal.Width) sd(iris$Sepal.Width) 

문제 : iris 데이터를 통해 z 점수 정규화 함수를 작성하고 iris 데이터를 정규화해서 재표현해보시오

정답 확인

z <- function(x){(x-mean(x))/sd(x)} iris_z <- as.data.frame(lapply(iris[1:4], z)) iris_z$Species <- iris$Species head(iris_z)

z 점수 정규화 또 다른 방법

  
class(scale(iris[1:4])) # 데이터 타입이 matrix, array iris_standardize <- as.data.frame(scale(iris[1:4])) # 데이터 프레임으로 변환 iris_standardize$Species <- iris$Species iris_standardize 

이상치

  
data("iris") re = boxplot(iris) re $stats #통계 : summary # [,1] [,2] [,3] [,4] [,5] # [1,] 4.3 2.2 1.00 0.1 1 min # [2,] 5.1 2.8 1.60 0.3 1 1사분위수 # [3,] 5.8 3.0 4.35 1.3 2 mean # [4,] 6.4 3.3 5.10 1.8 3 3사분위수 # [5,] 7.9 4.0 6.90 2.5 3 max #  $n :각열의 데이터 개수 # [1] 150 150 150 150 150 $conf : confidence 신뢰구간, 중위수의 신뢰구간 # 95% 신뢰구간 평균 +- 1.96*SE(standard error 표준오차) SE = 표준편차 / root(n) # [,1] [,2] [,3] [,4] [,5] # [1,] 5.632292 2.935497 3.898477 1.10649 1.741987 # [2,] 5.967708 3.064503 4.801523 1.49351 2.258013 $out # 이상치 # [1] 4.4 4.1 4.2 2.0 $group # 이상치가 속한 그룹 # [1] 2 2 2 2 $names # [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"  # [5] "Species" re$out re$group iris$Sepal.Width[iris$Sepal.Width %in% re$out] <- NA # 이상치 제외 summary(iris$Sepal.Width) 

문제 : iris 데이터에서 결측치의 개수를 확인하고, 결측치를 Sepal.width의 평균값으로 대체

정답 확인

sapply(iris, function(x) sum(is.na(x))) library(mice) md.pattern(iris) iris$Sepal.Width <- impute(iris$Sepal. Width, mean) # mean 또는 median sum(is.na(iris))

사분위, 중위값, IQR

  
uantile(iris$Sepal.Width, 0.25) # 1사분위 quantile(iris$Sepal.Width, 0.75) # 3사분위 quantile(iris$Sepal.Width, 0.5) # 중위값 IQR(iris$Sepal.Width) # 3사분위 - 1사분위  quantile(iris$Sepal.Width, 0.75) - quantile(iris$Sepal.Width, 0.25) 

이상치 제거

  
(iris$Sepal.Width= ifelse(iris$Sepal.Width > 3, iris$Sepal.Width, NA)) # 1보다 큰 값은 그대로 두고, 아니면 NA # iris %>% filter(!is.na(iris)) # na가 아닌 것이 필터/ %>% must be a logical vector is.na(iris[,1]) is.na(iris) sum(is.na(iris)) fivenum(iris[,1], na.rm = TRUE) # fivenum(중위수) : min, lower-hinge, median, upper-hinge, maximun outdata <- iris[1:4] ncol(outdata)-1 for(i in 1:(ncol(outdata)-1)){ uppercut=fivenum(outdata[,i], na.rm=T)[4]+1.5*IQR(outdata[,i],na.rm=T) # na는 제외하고, 상한치 lowercut=fivenum(outdata[,i], na.rm=T)[4]-1.5*IQR(outdata[,i],na.rm=T) # 하한치 out<-filter(outdata, outdata[,i]<=uppercut , outdata[,i]>=lowercut) # 이상치 제거 } str(out) 

상한치, 하한치

  
install.packages("nycflights13") library(nycflights13) weather<-nycflights13::weather attach(weather) # 데이터를 패키지 로등히듯이 메모리에 로딩 # $weather을 생략해도 되게 됨 search() # 메모리에 로딩된 패키지를 확인 함수 names(weather) # 어떤 이름이 있는지 확인 str(weather)와 다르게 이름만만 length(temp) IQR(temp, na.rm=T) median(temp, na.rm=T) upper <- quantile(temp, 0.75, na.rm = T) + 1.5*IQR(temp, na.rm = T) lower <- quantile(temp, 0.25, na.rm = T) + 1.5*IQR(temp, na.rm = T) weather <- weather[weather$temp < upper, ] # 인덱스를 이용하여 filtering weather <- weather[weather$temp > lower, ] length(temp) # 이상치는 없고 결측치만 있음 

데이터 생성 및 분석 예제

  
# 데이터 생성 state_table <- data.frame( key=c("SE", "DJ", "DG", "SH", "QD"), name=c("서울", "대전", "대구", "상해", "칭따오"), country =c("한국","한국","한국","중국","중국")) state_table #년도 month_table <- data.frame(key=1:12, desc=c("Jan","Feb","Mar","Apr","May","Jun","Jul", # describe : 서술 "Aug","Sep","Oct","Nov","Dec"), quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3", # 4사분면 "Q3","Q3","Q4","Q4","Q4")) month_table #상품 테이블 prod_table <- data.frame(key=c("Printer","Tablet","Laptop"), price=c(225,570,1120)) prod_table (prod_table1 <- data.frame(Printer=225, Tablet=570,Laptop=1120)) (prod_table1 <- t(prod_table1)) gen_sales <- function(no_of_recs){ loc <- sample(state_table$key, no_of_recs, replace = T, prob = c(2,2,1,1,1)) # prob 확률 조작 : 2번(서울, 대전) 1번(대구, 상해, 칭따오) time_month <- sample(month_table$key, no_of_recs, replace=T) # replace : 복원추출 (뽑고 다시 넣음) time_year <- sample(c(2012,2013), no_of_recs, replace=T) prod <- sample(prod_table$key, no_of_recs, replace = T, prob = c(1,3,2)) unit <- sample(c(1,2), no_of_recs, replace = T, prob = c(10,3)) # 개수 1개 2개를 10대 3으로 amount <- unit*prod_table1[prod,1] sales <- data.frame(month=time_month, year=time_year, loc=loc, prod=prod, unit=unit, amount=amount) # sort vs order : 정렬 # sort는 데이터 자체를 정렬 (ex: ㄱㄴㄷ 순서) # order는 정렬된 데이터의 인덱스 (ex : 정렬된 인덱스 ; 2, 1, 3) sales <- sales[order(sales$year, sales$month),] row.names(sales) <- NULL return(sales) } sales_fact <- gen_sales(500) str(sales_fact) head(sales_fact) tail(sales_fact) # 월별 제품 판매 현황 (년도별 지역별 / 상품별(행) 월별(열) 판매현황) # 그루핑 => 집계함수(여러개 데이터)가 와야함 (revenue_cube <- tapply(sales_fact$amount, sales_fact[,c("prod", "month", "year","loc")], # 그룹이 되는 열 FUN=function(x){return(sum(x))})) # 각 열 합합 dimnames(revenue_cube) revenue_cube # 연도와 지역은 2*5 = 10가지로 상품,월별 표로 나옴 

R, 데이터

This post is licensed under CC BY 4.0 by the author.