developer tip

Python : tf-idf-cosine : 문서 유사성 찾기

copycodes 2020. 9. 23. 07:45
반응형

Python : tf-idf-cosine : 문서 유사성 찾기


Part 1 & Part 2 에서 사용할 수있는 튜토리얼을 따르고있었습니다 . 불행히도 저자는 실제로 두 문서 사이의 거리를 찾기 위해 코사인 유사성을 사용하는 마지막 섹션에 대한 시간이 없었습니다. 나는 stackoverflow 의 다음 링크를 사용하여 기사의 예제를 따랐 습니다. 위 링크에 언급 된 코드가 포함되어 있습니다 (삶을 더 쉽게 만들기 위해)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."]  # Documents
test_set = ["The sun in the sky is bright."]  # Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

위 코드의 결과로 다음과 같은 매트릭스가 있습니다.

Fit Vectorizer to train set [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

코사인 유사성을 계산하기 위해이 출력을 사용하는 방법을 잘 모르겠습니다. 길이가 비슷한 두 벡터에 대해 코사인 유사성을 구현하는 방법을 알고 있지만 여기서는 두 벡터를 식별하는 방법을 잘 모르겠습니다.


먼저 카운트 기능을 추출하고 TF-IDF 정규화 및 행 단위 유클리드 정규화를 적용하려면 다음을 사용하여 한 번의 작업으로 수행 할 수 있습니다 TfidfVectorizer.

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty = fetch_20newsgroups()

>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)
>>> tfidf
<11314x130088 sparse matrix of type '<type 'numpy.float64'>'
    with 1787553 stored elements in Compressed Sparse Row format>

이제 한 문서 (예 : 데이터 세트의 첫 번째 문서)와 다른 모든 문서의 코사인 거리를 찾으려면 tfidf 벡터가 이미 행 정규화되어 있으므로 첫 번째 벡터의 내적을 다른 모든 문서와 함께 계산하면됩니다. scipy 희소 행렬 API는 약간 이상합니다 (밀도 N 차원 numpy 배열만큼 유연하지 않음). 첫 번째 벡터를 얻으려면 행 단위로 행렬을 슬라이스하여 단일 행이있는 부분 행렬을 얻어야합니다.

>>> tfidf[0:1]
<1x130088 sparse matrix of type '<type 'numpy.float64'>'
    with 89 stored elements in Compressed Sparse Row format>

scikit-learn은 이미 벡터 컬렉션의 조밀 한 표현과 희소 표현 모두에서 작동하는 쌍 단위 메트릭 (머신 러닝 용어로 커널)을 제공합니다. 이 경우 선형 커널이라고도하는 내적이 필요합니다.

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])

따라서 상위 5 개의 관련 문서를 찾기 위해 argsort몇 가지 음의 배열 슬라이싱을 사용할 수 있습니다 (대부분의 관련 문서는 가장 높은 코사인 유사성 값을 가지므로 정렬 된 인덱스 배열의 끝에 있음).

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([    0,   958, 10576,  3277])
>>> cosine_similarities[related_docs_indices]
array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

첫 번째 결과는 온 전성 검사입니다. 다음 텍스트를 포함하는 코사인 유사성 점수가 1 인 가장 유사한 문서 인 쿼리 문서를 찾습니다.

>>> print twenty.data[0]
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----

두 번째로 가장 유사한 문서는 원본 메시지를 인용하는 회신이므로 많은 공통 단어가 있습니다.

>>> print twenty.data[958]
From: rseymour@reed.edu (Robert Seymour)
Subject: Re: WHAT car is this!?
Article-I.D.: reed.1993Apr21.032905.29286
Reply-To: rseymour@reed.edu
Organization: Reed College, Portland, OR
Lines: 26

In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where's my
thing) writes:
>
>  I was wondering if anyone out there could enlighten me on this car I saw
> the other day. It was a 2-door sports car, looked to be from the late 60s/
> early 70s. It was called a Bricklin. The doors were really small. In
addition,
> the front bumper was separate from the rest of the body. This is
> all I know. If anyone can tellme a model name, engine specs, years
> of production, where this car is made, history, or whatever info you
> have on this funky looking car, please e-mail.

Bricklins were manufactured in the 70s with engines from Ford. They are rather
odd looking with the encased front bumper. There aren't a lot of them around,
but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a
performance Ford with new styling slapped on top.

>    ---- brought to you by your neighborhood Lerxst ----

Rush fan?

--
Robert Seymour              rseymour@reed.edu
Physics and Philosophy, Reed College    (NeXTmail accepted)
Artificial Life Project         Reed College
Reed Solar Energy Project (SolTrain)    Portland, OR

@excray의 의견을 통해 답을 찾을 수있었습니다. 우리가해야 할 일은 실제로 기차 데이터와 테스트 데이터를 나타내는 두 배열을 반복하는 간단한 for 루프를 작성하는 것입니다.

먼저 코사인 계산을위한 공식을 보유하는 간단한 람다 함수를 구현합니다.

cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

그런 다음 간단한 for 루프를 작성하여 to 벡터를 반복합니다. 논리는 모든 "trainVectorizerArray의 각 벡터에 대해 testVectorizerArray의 벡터와 코사인 유사성을 찾아야합니다."

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

for vector in trainVectorizerArray:
    print vector
    for testV in testVectorizerArray:
        print testV
        cosine = cx(vector, testV)
        print cosine

transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

다음은 출력입니다.

Fit Vectorizer to train set [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]
[1 0 1 0]
[0 1 1 1]
0.408
[0 1 0 1]
[0 1 1 1]
0.816

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

나는 그것의 오래된 포스트를 안다. 하지만 http://scikit-learn.sourceforge.net/stable/ 패키지를 시도했습니다 . 여기 코사인 유사성을 찾는 코드가 있습니다. 문제는이 패키지와의 코사인 유사성을 어떻게 계산할 것인가였으며 여기에 대한 내 코드가 있습니다.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

f = open("/root/Myfolder/scoringDocuments/doc1")
doc1 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc2")
doc2 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc3")
doc3 = str.decode(f.read(), "UTF-8", "ignore")

train_set = ["president of India",doc1, doc2, doc3]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)  #finds the tfidf score with normalization
print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train)  #here the first element of tfidf_matrix_train is matched with other three elements

여기에서 쿼리가 train_set의 첫 번째 요소이고 doc1, doc2 및 doc3이 코사인 유사성을 사용하여 순위를 매길 문서라고 가정합니다. 이 코드를 사용할 수 있습니다.

Also the tutorials provided in the question was very useful. Here are all the parts for it part-I,part-II,part-III

the output will be as follows :

[[ 1.          0.07102631  0.02731343  0.06348799]]

here 1 represents that query is matched with itself and the other three are the scores for matching the query with the respective documents.


Let me give you another tutorial written by me. It answers your question, but also makes an explanation why we are doing some of the things. I also tried to make it concise.

So you have a list_of_documents which is just an array of strings and another document which is just a string. You need to find such document from the list_of_documents that is the most similar to document.

Let's combine them together: documents = list_of_documents + [document]

Let's start with dependencies. It will become clear why we use each of them.

from nltk.corpus import stopwords
import string
from nltk.tokenize import wordpunct_tokenize as tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cosine

One of the approaches that can be uses is a bag-of-words approach, where we treat each word in the document independent of others and just throw all of them together in the big bag. From one point of view, it looses a lot of information (like how the words are connected), but from another point of view it makes the model simple.

In English and in any other human language there are a lot of "useless" words like 'a', 'the', 'in' which are so common that they do not possess a lot of meaning. They are called stop words and it is a good idea to remove them. Another thing that one can notice is that words like 'analyze', 'analyzer', 'analysis' are really similar. They have a common root and all can be converted to just one word. This process is called stemming and there exist different stemmers which differ in speed, aggressiveness and so on. So we transform each of the documents to list of stems of words without stop words. Also we discard all the punctuation.

porter = PorterStemmer()
stop_words = set(stopwords.words('english'))

modified_arr = [[porter.stem(i.lower()) for i in tokenize(d.translate(None, string.punctuation)) if i.lower() not in stop_words] for d in documents]

So how will this bag of words help us? Imagine we have 3 bags: [a, b, c], [a, c, a] and [b, c, d]. We can convert them to vectors in the basis [a, b, c, d]. So we end up with vectors: [1, 1, 1, 0], [2, 0, 1, 0] and [0, 1, 1, 1]. The similar thing is with our documents (only the vectors will be way to longer). Now we see that we removed a lot of words and stemmed other also to decrease the dimensions of the vectors. Here there is just interesting observation. Longer documents will have way more positive elements than shorter, that's why it is nice to normalize the vector. This is called term frequency TF, people also used additional information about how often the word is used in other documents - inverse document frequency IDF. Together we have a metric TF-IDF which have a couple of flavors. This can be achieved with one line in sklearn :-)

modified_doc = [' '.join(i) for i in modified_arr] # this is only to convert our list of lists to list of strings that vectorizer uses.
tf_idf = TfidfVectorizer().fit_transform(modified_doc)

Actually vectorizer allows to do a lot of things like removing stop words and lowercasing. I have done them in a separate step only because sklearn does not have non-english stopwords, but nltk has.

So we have all the vectors calculated. The last step is to find which one is the most similar to the last one. There are various ways to achieve that, one of them is Euclidean distance which is not so great for the reason discussed here. Another approach is cosine similarity. We iterate all the documents and calculating cosine similarity between the document and the last one:

l = len(documents) - 1
for i in xrange(l):
    minimum = (1, None)
    minimum = min((cosine(tf_idf[i].todense(), tf_idf[l + 1].todense()), i), minimum)
print minimum

Now minimum will have information about the best document and its score.


This should help you.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity  

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(train_set)
print tfidf_matrix
cosine = cosine_similarity(tfidf_matrix[length-1], tfidf_matrix)
print cosine

and output will be:

[[ 0.34949812  0.81649658  1.        ]]

Here is a function that compares your test data against the training data, with the Tf-Idf transformer fitted with the training data. Advantage is that you can quickly pivot or group by to find the n closest elements, and that the calculations are down matrix-wise.

def create_tokenizer_score(new_series, train_series, tokenizer):
    """
    return the tf idf score of each possible pairs of documents
    Args:
        new_series (pd.Series): new data (To compare against train data)
        train_series (pd.Series): train data (To fit the tf-idf transformer)
    Returns:
        pd.DataFrame
    """

    train_tfidf = tokenizer.fit_transform(train_series)
    new_tfidf = tokenizer.transform(new_series)
    X = pd.DataFrame(cosine_similarity(new_tfidf, train_tfidf), columns=train_series.index)
    X['ix_new'] = new_series.index
    score = pd.melt(
        X,
        id_vars='ix_new',
        var_name='ix_train',
        value_name='score'
    )
    return score

train_set = pd.Series(["The sky is blue.", "The sun is bright."])
test_set = pd.Series(["The sun in the sky is bright."])
tokenizer = TfidfVectorizer() # initiate here your own tokenizer (TfidfVectorizer, CountVectorizer, with stopwords...)
score = create_tokenizer_score(train_series=train_set, new_series=test_set, tokenizer=tokenizer)
score

   ix_new   ix_train    score
0   0       0       0.617034
1   0       1       0.862012

참고URL : https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity

반응형