nltk 또는 python을 사용하여 불용어를 제거하는 방법

developer tip

nltk 또는 python을 사용하여 불용어를 제거하는 방법

copycodes 2020. 8. 26. 08:01

nltk 또는 python을 사용하여 불용어를 제거하는 방법

그래서 사용에서 불용어를 제거하고 싶은 데이터 세트가 있습니다.

stopwords.words('english')

나는 단순히이 단어를 제거하기 위해 코드 내에서 이것을 사용하는 방법에 어려움을 겪고 있습니다. 이 데이터 세트의 단어 목록이 이미 있습니다. 제가 고민하고있는 부분은이 목록과 비교하여 불용어를 제거하는 것입니다. 도움을 주시면 감사하겠습니다.

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

예를 들어 다음과 같이 set diff를 수행 할 수도 있습니다.

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

불용어를 제거하려는 단어 목록 (word_list)이 있다고 가정합니다. 다음과 같이 할 수 있습니다.

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

nltk 중지 단어를 포함한 모든 유형의 중지 단어를 제외하려면 다음과 같이 할 수 있습니다.

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

textcleaner 라이브러리를 사용 하여 데이터에서 불용어를 제거합니다.

다음 링크를 따르십시오 : https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

이 라이브러리를 사용하려면 다음 단계를 따르십시오.

pip install textcleaner

설치 후 :

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

위의 코드를 사용하여 불용어를 제거하십시오.

필터 사용 :

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

이 기능을 사용할 수 있습니다. 모든 단어를 낮춰야합니다.

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

   import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
    if x not in list:           # comparing from the list and removing it
        another_list.append(x)  # it is also possible to use .remove
for x in another_list:
     print(x,end=' ')

   # 2) if you want to use .remove more preferred code
    import sys
    print ("enter the string from which you want to remove list of stop words")
    userstring = input().split(" ")
    list =["a","an","the","in"]
    another_list = []
    for x in userstring:
        if x in list:           
            userstring.remove(x)  
    for x in userstring:           
        print(x,end = ' ') 
    #the code will be like this

이를 위해 매우 간단한 경량 파이썬 패키지가 stop-words있습니다.

다음을 사용하여 패키지를 먼저 설치하십시오. pip install stop-words

그런 다음 목록 이해를 사용하여 한 줄로 단어를 제거 할 수 있습니다.

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

이 패키지는 다운로드하기에 매우 가볍고 (nltk와 달리) Python 2및 둘 다에서 작동하며 다음 과 Python 3같은 다른 많은 언어에 대해 불용어 가 있습니다.

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

참고 URL : https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python

'developer tip' 카테고리의 다른 글

Node.js의 동기 요청 (0)	2020.08.26
Powershell에서 Null 병합 (0)	2020.08.26
"java.security.cert.CertificateException : 제목 대체 이름 없음"오류를 수정하는 방법은 무엇입니까? (0)	2020.08.26
SmtpException : 전송 연결에서 데이터를 읽을 수 없음 : net_io_connectionclosed (0)	2020.08.26
Java에서 URL 확인 (0)	2020.08.26

현재글nltk 또는 python을 사용하여 불용어를 제거하는 방법

copycodes

nltk 또는 python을 사용하여 불용어를 제거하는 방법

nltk 또는 python을 사용하여 불용어를 제거하는 방법

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

티스토리툴바

nltk 또는 python을 사용하여 불용어를 제거하는 방법

nltk 또는 python을 사용하여 불용어를 제거하는 방법

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

관련글

티스토리툴바