Tokenizing, POS tagging, Stopwords, Lemmatization

Notice

Recent Posts

Recent Comments

Link

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

Silver bullet

Tokenizing, POS tagging, Stopwords, Lemmatization 본문

AI/AI

Tokenizing, POS tagging, Stopwords, Lemmatization

밀크쌀과자 2024. 7. 1. 20:08

The process of data analysis for text data

1. 텍스트 데이터를 str 자료형으로 준비

2. Tokenize (형태소 분석) → POS Tagging (Part-of-speech, 품사 표시) → Stopwords 제거 (불용어 제거)

3.단어 개수 카운팅 & 단어 사전 생성 → 단어 사전 기반 데이터 시각화 → (+ 머신러닝/딥러닝 모델 적용)

1. 영어 문장 토큰화

import nltk

# nltk.download()  # 텍스트 데이터 처리를 위한 패키지 다운로더

# Download following packages
# Corpora : stopwords, wordnet
# Models : averaged_perceptron_tagger, maxnet_treebank_pos_tagger, punkt

# 전처리하고자 하는 문장을 String 변수로 저장한다
sentence = 'NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.'

# 각 문장을 토큰화한 결과를 출력한다
nltk.word_tokenize(sentence)  # 문장을 '단어 수준에서' 토큰화해 출력한다

2. 영어 문장 품사 태깅(POS tagging)하기

# 각 문장을 토큰화한 후 품사를 태깅하여 결과를 출력한다

#형태소의 list
tokens = nltk.word_tokenize(sentence)  # 문장을 토큰화한다
nltk.pos_tag(tokens)  # 토큰화한 문장을 대상으로 품사를 태깅("POS" Tagging)하여 출력한다

# 앞글자로 품사 판단
# N :명사
# V : 동사
# J : 형용사 -> J or A

3. Stopwords 제거하기

# nltk 모듈에서 Stopwords를 직접 불러온다
from nltk.corpus import stopwords

stopwords.fileids()
# nltk엔 한국어가 없다. 한국어 자연어처리때는 한국어에 특화된 라이브러리 사용예정

# 영어의 stopwords를 불러와 변수에 저장한다 (stopwords에 속하는 "단어" 리스트)
stopWords = stopwords.words('english') # 지원 언어 목록 : stopwords.fileids()

# 불용어 사전은 개발자가 임의에 따라 조정가능 -> list 형식

# 문장에서 stopwords 제거

result = []  # stopwords가 제거된 결과를 담기 위한 리스트를 생성한다

for token in tokens:  # for문을 통해 각각의 token이 stopwords인지 아닌지를 판별해 결과에 저장한다
    if token.lower() not in stopWords:  # 만약 소문자로 변환한 token이 stopWords 내에 없으면:
        result.append(token)  # token을 리스트에 더해준다

print(result)  # 결과를 출력한다

# stopwords에 쉼표(,)와 마침표(.) 추가하여 다시 적용하기

stop_words = stopwords.words("english") # stop_words == list
stop_words.append(',')
stop_words.append('.')

result = []  # stopwords가 제거된 결과를 담기 위한 리스트를 생성한다

for token in tokens:  # for문을 통해 각각의 token이 stopwords인지 아닌지를 판별해 결과에 저장한다
    if token.lower() not in stop_words:  # 만약 소문자로 변환한 token이 stopWords 내에 없으면:
        result.append(token)  # token을 리스트에 첨부한다

print(result)  # 결과를 출력한다

* 중요한 점 : nltk의 불용어는 모두 소문자이기 때문에 꼭 적용할 때 lower()를 써야 한다.

4. 영화 리뷰 데이터 전처리하기 - Lemmatizing

Lemmatization : 단어의 형태소적 & 사전적 분석을 통해 파생적 의미를 제거하고, 어근에 기반하여 기본 사전형인 lemma를 찾는 것

# WordNetLemmatizer 객체 생성
lemmatizer = nltk.wordnet.WordNetLemmatizer()

print(lemmatizer.lemmatize("cats")) # lemmatize한 결과를 출력한다
print(lemmatizer.lemmatize("geese"))

print(lemmatizer.lemmatize("better"))
print(lemmatizer.lemmatize("better", pos="a"))

print(lemmatizer.lemmatize("ran"))
print(lemmatizer.lemmatize("ran", 'v'))

# Stopwords
stop_words = stopwords.words("english")
stop_words.append(',')
stop_words.append('.')

file = open('moviereview.txt', 'r', encoding='utf-8') # 읽기 형식('r')로 지정하고 인코딩은 'utf-8'로 설정한다
lines = file.readlines()  # readlines 함수로 텍스트 파일의 내용을 읽어 리스트로 저장한다

sentence = lines[1] 
tokens = nltk.word_tokenize(sentence)  
tagged_tokens = nltk.pos_tag(tokens)

# for문을 통해 stopwords 제거와 lemmatization을 수행한다
lemmas = []  # lemmatize한 결과를 담기 위한 리스트를 생성한다
for token, pos in tagged_tokens:  
    if token.lower() not in stop_words:  # 소문자로 변환한 token이 stopwords에 없으면:
        
        if pos.startswith('N'):
            lemmas.append(lemmatizer.lemmatize(token, pos='n'))  # lemmatize한 결과를 리스트에 첨부한다
        elif pos.startswith('J'):
            lemmas.append(lemmatizer.lemmatize(token, pos='a'))
        elif pos.startswith('V'):
            lemmas.append(lemmatizer.lemmatize(token, pos='v'))
        else:
            lemmas.append(lemmatizer.lemmatize(token))
            
print('Lemmas of : ' + sentence)  # lemmatize한 결과를 출력한다
print(lemmas)

'AI > AI' 카테고리의 다른 글

한글 텍스트 데이터 전처리 (0)	2024.07.02
TF-IDF & Cosine similarity 이론 (0)	2024.07.02
품사별 토큰 추출 & 등장횟수 시각화, 정규 표현식 (0)	2024.07.01
정형 데이터 전처리 & 시각화 (0)	2024.07.01
정형 데이터 분석 & 데이터 시각화 (0)	2024.07.01

'AI/AI' Related Articles

Silver bullet

Tokenizing, POS tagging, Stopwords, Lemmatization 본문

Tokenizing, POS tagging, Stopwords, Lemmatization

The process of data analysis for text data

1. 영어 문장 토큰화

2. 영어 문장 품사 태깅(POS tagging)하기

3. Stopwords 제거하기

4. 영화 리뷰 데이터 전처리하기 - Lemmatizing

'AI > AI' 카테고리의 다른 글

티스토리툴바