ElasticSearch에서 모든 _id를 검색하는 효율적인 방법

developer tip

ElasticSearch에서 모든 _id를 검색하는 효율적인 방법

copycodes 2020. 12. 7. 08:22

ElasticSearch에서 모든 _id를 검색하는 효율적인 방법

ElasticSearch에서 특정 인덱스의 모든 _id를 가져 오는 가장 빠른 방법은 무엇입니까? 간단한 쿼리를 사용하여 가능합니까? 내 색인 중 하나에는 약 20,000 개의 문서가 있습니다.

편집 : @Aleck Landgraf의 답변도 읽으십시오.

elasticsearch-internal _id필드를 원하십니까? 아니면 id문서 내의 필드?

전자의 경우 시도

curl http://localhost:9200/index/type/_search?pretty=true -d '
{ 
    "query" : { 
        "match_all" : {} 
    },
    "stored_fields": []
}
'

참고 2017 업데이트 : 게시물은 원래 포함 "fields": []되었지만 그 이후로 이름이 변경되었으며 stored_fields새로운 값입니다.

결과에는 문서의 "메타 데이터"만 포함됩니다.

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "index",
      "_type" : "type",
      "_id" : "36",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "38",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "39",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "34",
      "_score" : 1.0
    } ]
  }
}

후자의 경우 문서의 필드를 포함하려면 fields배열에 추가하기 만하면 됩니다.

curl http://localhost:9200/index/type/_search?pretty=true -d '
{ 
    "query" : { 
        "match_all" : {} 
    },
    "fields": ["document_field_to_be_returned"]
}
'

Better to use scroll and scan to get the result list so elasticsearch doesn't have to rank and sort the results.

With the elasticsearch-dsl python lib this can be accomplished by:

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

es = Elasticsearch()
s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)

s = s.fields([])  # only get ids, otherwise `fields` takes a list of field names
ids = [h.meta.id for h in s.scan()]

Console log:

GET http://localhost:9200/my_index/my_doc/_search?search_type=scan&scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
...

Note: scroll pulls batches of results from a query and keeps the cursor open for a given amount of time (1 minute, 2 minutes, which you can update); scan disables sorting. The scan helper function returns a python generator which can be safely iterated through.

For elasticsearch 5.x, you can use the "_source" field.

GET /_search
{
    "_source": false,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

"fields" has been deprecated. (Error: "The field [fields] is no longer supported, please use [stored_fields] to retrieve stored fields or _source filtering if the field is not stored")

Another option

curl 'http://localhost:9200/index/type/_search?pretty=true&fields='

will return _index, _type, _id and _score.

you can also do it in python, which gives you a proper list:

import elasticsearch
es = elasticsearch.Elasticsearch()

res = es.search(
    index=your_index, 
    body={"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]})

ids = [d['_id'] for d in res['hits']['hits']]

Elaborating on the 2 answers by @Robert-Lujo and @Aleck-Landgraf (someone with the permissions can gladly move this to a comment): if you do not want to print but get everything inside a list from the returned generator, here is what I use:

from elasticsearch import Elasticsearch,helpers
es = Elasticsearch(hosts=[YOUR_ES_HOST])
a=helpers.scan(es,query={"query":{"match_all": {}}},scroll='1m',index=INDEX_NAME)#like others so far

IDs=[aa['_id'] for aa in a]

Inspired by @Aleck-Landgraf answer, for me it worked by using directly scan function in standard elasticsearch python API:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
es = Elasticsearch()
for dobj in scan(es, 
                 query={"query": {"match_all": {}}, "fields" : []},  
                 index="your-index-name", doc_type="your-doc-type"): 
        print dobj["_id"],

Url -> http://localhost:9200/<index>/<type>/_query
http method -> GET
Query -> {"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]}

참고URL : https://stackoverflow.com/questions/17497075/efficient-way-to-retrieve-all-ids-in-elasticsearch

'developer tip' 카테고리의 다른 글

한 SQL Server에서 다른 SQL Server로 테이블 데이터 내보내기 (0)	2020.12.07
Spring Profile 변수 설정 (0)	2020.12.07
Android의 WindowManager는 무엇입니까? (0)	2020.12.07
UIToolbar의 높이를 변경하는 방법이 있습니까? (0)	2020.12.06
Linq OrderByDescending, 먼저 null (0)	2020.12.06

현재글ElasticSearch에서 모든 _id를 검색하는 효율적인 방법

copycodes

ElasticSearch에서 모든 _id를 검색하는 효율적인 방법

ElasticSearch에서 모든 _id를 검색하는 효율적인 방법

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

티스토리툴바

ElasticSearch에서 모든 _id를 검색하는 효율적인 방법

ElasticSearch에서 모든 _id를 검색하는 효율적인 방법

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

관련글

티스토리툴바