고정 너비 파일을 효율적으로 구문 분석하는 방법은 무엇입니까?

developer tip

고정 너비 파일을 효율적으로 구문 분석하는 방법은 무엇입니까?

copycodes 2020. 10. 16. 07:34

고정 너비 파일을 효율적으로 구문 분석하는 방법은 무엇입니까?

고정 너비 라인을 보유하는 파일을 구문 분석하는 효율적인 방법을 찾으려고합니다. 예를 들어 처음 20자는 열을 나타내며 21:30부터 다른 열을 나타냅니다.

행에 100 개의 문자가 있다고 가정 할 때 행을 여러 구성 요소로 구문 분석하는 효율적인 방법은 무엇입니까?

줄마다 문자열 슬라이싱을 사용할 수 있지만 줄이 크면 조금 못 생겼습니다. 다른 빠른 방법이 있습니까?

Python 표준 라이브러리의 struct모듈을 사용하는 것은 C로 작성 되었기 때문에 매우 쉽고 빠릅니다.

원하는 작업을 수행하는 데 사용할 수있는 방법은 다음과 같습니다. 또한 필드의 문자 수에 음수 값을 지정하여 문자 열을 건너 뛸 수 있습니다.

import struct

fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                        for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))

산출:

fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

다음 수정은 Python 2 또는 3에서 작동하도록 조정하고 유니 코드 입력을 처리합니다.

import sys

fieldstruct = struct.Struct(fmtstring)
if sys.version_info[0] < 3:
    parse = fieldstruct.unpack_from
else:
    # converts unicode input to byte string and results back to unicode string
    unpack = fieldstruct.unpack_from
    parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

고려하고 있었지만 너무 추악해질 수 있다는 우려가 있었기 때문에 문자열 슬라이스로 수행하는 방법이 있습니다. 그것에 대한 좋은 점은 그다지 추악하지 않은 것 외에도 Python 2와 3 모두에서 변경되지 않고 작동하고 유니 코드 문자열을 처리 할 수 있다는 것입니다. 나는 그것을 벤치마킹하지 않았지만 struct모듈 버전 과 속도면에서 경쟁 할 수 있다고 생각합니다 . 패딩 필드를 갖는 기능을 제거하여 약간의 속도를 높일 수 있습니다.

try:
    from itertools import izip_longest  # added in Py 2.6
except ImportError:
    from itertools import zip_longest as izip_longest  # name change in Py 3.x

try:
    from itertools import accumulate  # added in Py 3.2
except ImportError:
    def accumulate(iterable):
        'Return running totals (simplified version).'
        total = next(iterable)
        yield total
        for value in iterable:
            total += value
            yield total

def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
    flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
    # optional informational function attributes
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))

산출:

format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

이것이 효율적인지 잘 모르겠지만 (수동으로 슬라이싱하는 것과는 반대로) 읽을 수 있어야합니다. slices문자열과 열 길이를 가져오고 하위 문자열을 반환 하는 함수 를 정의했습니다 . 나는 그것을 생성기로 만들었으므로 정말 긴 줄의 경우 임시 하위 문자열 목록을 작성하지 않습니다.

def slices(s, *args):
    position = 0
    for length in args:
        yield s[position:position + length]
        position += length

예

In [32]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2))
Out[32]: ['ab']

In [33]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2, 10, 50))
Out[33]: ['ab', 'cdefghijkl', 'mnopqrstuvwxyz0123456789']

In [51]: d,c,h = slices('dogcathouse', 3, 3, 5)
In [52]: d,c,h
Out[52]: ('dog', 'cat', 'house')

그러나 한 번에 모든 열이 필요하면 발전기의 장점이 손실된다고 생각합니다. 이점을 얻을 수있는 곳은 루프에서 열을 하나씩 처리하려는 경우입니다.

이미 언급 한 솔루션보다 더 쉽고 더 아름다운 두 가지 옵션 :

첫 번째는 팬더를 사용하는 것입니다.

import pandas as pd

path = 'filename.txt'

# Using Pandas with a column specification
col_specification = [(0, 20), (21, 30), (31, 50), (51, 100)]
data = pd.read_fwf(path, colspecs=col_specification)

그리고 numpy.loadtxt를 사용하는 두 번째 옵션 :

import numpy as np

# Using NumPy and letting it figure it out automagically
data_also = np.loadtxt(path)

실제로 데이터를 사용하려는 방식에 따라 다릅니다.

아래 코드는 심각한 고정 열 너비 파일 처리가 필요한 경우 수행 할 작업에 대한 스케치를 제공합니다.

"심각한"= 여러 파일 유형의 여러 레코드 유형, 최대 1000 바이트의 레코드, 레이아웃 정의 자 및 "반대"생산자 / 소비자는 태도가있는 정부 부서이며 레이아웃 변경으로 인해 사용되지 않는 열이 발생하며 최대 백만 개의 레코드가 있습니다. 파일에서 ...

기능 : 구조체 형식을 미리 컴파일합니다. 원하지 않는 열을 무시합니다. 입력 문자열을 필수 데이터 유형으로 변환합니다 (sketch에서 오류 처리 생략). 레코드를 개체 인스턴스 (또는 사전 또는 원하는 경우 명명 된 튜플)로 변환합니다.

암호:

import struct, datetime, io, pprint

# functions for converting input fields to usable data
cnv_text = rstrip
cnv_int = int
cnv_date_dmy = lambda s: datetime.datetime.strptime(s, "%d%m%Y") # ddmmyyyy
# etc

# field specs (field name, start pos (1-relative), len, converter func)
fieldspecs = [
    ('surname', 11, 20, cnv_text),
    ('given_names', 31, 20, cnv_text),
    ('birth_date', 51, 8, cnv_date_dmy),
    ('start_date', 71, 8, cnv_date_dmy),
    ]

fieldspecs.sort(key=lambda x: x[1]) # just in case

# build the format for struct.unpack
unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
    start = fieldspec[1] - 1
    end = start + fieldspec[2]
    if start > unpack_len:
        unpack_fmt += str(start - unpack_len) + "x"
    unpack_fmt += str(end - start) + "s"
    unpack_len = end
field_indices = range(len(fieldspecs))
print unpack_len, unpack_fmt
unpacker = struct.Struct(unpack_fmt).unpack_from

class Record(object):
    pass
    # or use named tuples

raw_data = """\
....v....1....v....2....v....3....v....4....v....5....v....6....v....7....v....8
          Featherstonehaugh   Algernon Marmaduke  31121969            01012005XX
"""

f = cStringIO.StringIO(raw_data)
headings = f.next()
for line in f:
    # The guts of this loop would of course be hidden away in a function/method
    # and could be made less ugly
    raw_fields = unpacker(line)
    r = Record()
    for x in field_indices:
        setattr(r, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x]))
    pprint.pprint(r.__dict__)
    print "Customer name:", r.given_names, r.surname

산출:

78 10x20s20s8s12x8s
{'birth_date': datetime.datetime(1969, 12, 31, 0, 0),
 'given_names': 'Algernon Marmaduke',
 'start_date': datetime.datetime(2005, 1, 1, 0, 0),
 'surname': 'Featherstonehaugh'}
Customer name: Algernon Marmaduke Featherstonehaugh

> str = '1234567890'
> w = [0,2,5,7,10]
> [ str[ w[i-1] : w[i] ] for i in range(1,len(w)) ]
['12', '345', '67', '890']

여기에 NumPy와는 후드 아래 사용하는 것입니다 (아직 훨씬 간단하지만 -이 코드가 발견되어 LineSplitter class내에서 _iotools module)

import numpy as np

DELIMITER = (20, 10, 10, 20, 10, 10, 20)

idx = np.cumsum([0] + list(DELIMITER))
slices = [slice(i, j) for (i, j) in zip(idx[:-1], idx[1:])]

def parse(line):
    return [line[s] for s in slices]

열 무시에 대한 음수 구분 기호를 처리하지 않으므로.만큼 다재다능 struct하지는 않지만 더 빠릅니다.

여기에 따라 파이썬 3를위한 간단한 모듈의 존 머신의 대답은 - 필요에 따라 적용 :

"""
fixedwidth

Parse and iterate through a fixedwidth text file, returning record objects.

Adapted from https://stackoverflow.com/a/4916375/243392


USAGE

    import fixedwidth, pprint

    # define the fixed width fields we want
    # fieldspecs is a list of [name, description, start, width, type] arrays.
    fieldspecs = [
        ["FILEID", "File Identification", 1, 6, "A/N"],
        ["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"],
        ["SUMLEV", "Summary Level", 9, 3, "A/N"],
        ["LOGRECNO", "Logical Record Number", 19, 7, "N"],
        ["POP100", "Population Count (100%)", 30, 9, "N"],
    ]

    # define the fieldtype conversion functions
    fieldtype_fns = {
        'A': str.rstrip,
        'A/N': str.rstrip,
        'N': int,
    }

    # iterate over record objects in the file
    with open(f, 'rb'):
        for record in fixedwidth.reader(f, fieldspecs, fieldtype_fns):
            pprint.pprint(record.__dict__)

    # output:
    {'FILEID': 'SF1ST', 'LOGRECNO': 2, 'POP100': 1, 'STUSAB': 'TX', 'SUMLEV': '040'}
    {'FILEID': 'SF1ST', 'LOGRECNO': 3, 'POP100': 2, 'STUSAB': 'TX', 'SUMLEV': '040'}    
    ...

"""

import struct, io


# fieldspec columns
iName, iDescription, iStart, iWidth, iType = range(5)


def get_struct_unpacker(fieldspecs):
    """
    Build the format string for struct.unpack to use, based on the fieldspecs.
    fieldspecs is a list of [name, description, start, width, type] arrays.
    Returns a string like "6s2s3s7x7s4x9s".
    """
    unpack_len = 0
    unpack_fmt = ""
    for fieldspec in fieldspecs:
        start = fieldspec[iStart] - 1
        end = start + fieldspec[iWidth]
        if start > unpack_len:
            unpack_fmt += str(start - unpack_len) + "x"
        unpack_fmt += str(end - start) + "s"
        unpack_len = end
    struct_unpacker = struct.Struct(unpack_fmt).unpack_from
    return struct_unpacker


class Record(object):
    pass
    # or use named tuples


def reader(f, fieldspecs, fieldtype_fns):
    """
    Wrap a fixedwidth file and return records according to the given fieldspecs.
    fieldspecs is a list of [name, description, start, width, type] arrays.
    fieldtype_fns is a dictionary of functions used to transform the raw string values, 
    one for each type.
    """

    # make sure fieldspecs are sorted properly
    fieldspecs.sort(key=lambda fieldspec: fieldspec[iStart])

    struct_unpacker = get_struct_unpacker(fieldspecs)

    field_indices = range(len(fieldspecs))

    for line in f:
        raw_fields = struct_unpacker(line) # split line into field values
        record = Record()
        for i in field_indices:
            fieldspec = fieldspecs[i]
            fieldname = fieldspec[iName]
            s = raw_fields[i].decode() # convert raw bytes to a string
            fn = fieldtype_fns[fieldspec[iType]] # get conversion function
            value = fn(s) # convert string to value (eg to an int)
            setattr(record, fieldname, value)
        yield record


if __name__=='__main__':

    # test module

    import pprint, io

    # define the fields we want
    # fieldspecs are [name, description, start, width, type]
    fieldspecs = [
        ["FILEID", "File Identification", 1, 6, "A/N"],
        ["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"],
        ["SUMLEV", "Summary Level", 9, 3, "A/N"],
        ["LOGRECNO", "Logical Record Number", 19, 7, "N"],
        ["POP100", "Population Count (100%)", 30, 9, "N"],
    ]

    # define a conversion function for integers
    def to_int(s):
        """
        Convert a numeric string to an integer.
        Allows a leading ! as an indicator of missing or uncertain data.
        Returns None if no data.
        """
        try:
            return int(s)
        except:
            try:
                return int(s[1:]) # ignore a leading !
            except:
                return None # assume has a leading ! and no value

    # define the conversion fns
    fieldtype_fns = {
        'A': str.rstrip,
        'A/N': str.rstrip,
        'N': to_int,
        # 'N': int,
        # 'D': lambda s: datetime.datetime.strptime(s, "%d%m%Y"), # ddmmyyyy
        # etc
    }

    # define a fixedwidth sample
    sample = """\
SF1ST TX04089000  00000023748        1 
SF1ST TX04090000  00000033748!       2
SF1ST TX04091000  00000043748!        
"""
    sample_data = sample.encode() # convert string to bytes
    file_like = io.BytesIO(sample_data) # create a file-like wrapper around bytes

    # iterate over record objects in the file
    for record in reader(file_like, fieldspecs, fieldtype_fns):
        # print(record)
        pprint.pprint(record.__dict__)

문자열 슬라이싱은 조직적으로 유지하는 한 추악 할 필요가 없습니다. 필드 너비를 사전에 저장 한 다음 관련 이름을 사용하여 객체를 생성하는 것이 좋습니다.

from collections import OrderedDict

class Entry:
    def __init__(self, line):

        name2width = OrderedDict()
        name2width['foo'] = 2
        name2width['bar'] = 3
        name2width['baz'] = 2

        pos = 0
        for name, width in name2width.items():

            val = line[pos : pos + width]
            if len(val) != width:
                raise ValueError("not enough characters: \'{}\'".format(line))

            setattr(self, name, val)
            pos += width

file = "ab789yz\ncd987wx\nef555uv"

entry = []

for line in file.split('\n'):
    entry.append(Entry(line))

print(entry[1].bar) # output: 987

내 이전 작업은 종종 1 백만 줄의 고정 폭 데이터를 처리하기 때문에 Python을 사용하기 시작할 때이 문제에 대해 조사했습니다.

FixedWidth에는 두 가지 유형이 있습니다.

ASCII FixedWidth (ascii 문자 길이 = 1, 2 바이트 인코딩 문자 길이 = 2)
유니 코드 FixedWidth (ascii 문자 및 2 바이트 인코딩 문자 길이 = 1)

리소스 문자열이 모두 ASCII 문자로 구성된 경우 ASCII FixedWidth = Unicode FixedWidth

다행히도 py3에서는 문자열과 바이트가 다르기 때문에 2 바이트 인코딩 문자 (eggbk, big5, euc-jp, shift-jis 등)를 처리 할 때 많은 혼동이 줄어 듭니다.
"ASCII FixedWidth"처리를 위해 문자열은 일반적으로 바이트로 변환 된 다음 분할됩니다.

타사 모듈
totalLineCount = 1 million, lineLength = 800 byte, FixedWidthArgs = (10,25,4, ....)를 가져 오지 않고 Line을 약 5 가지 방식으로 분할하고 다음과 같은 결론을 얻습니다.

구조체가 가장 빠름 (1x)
전처리가 아닌 루프 전용 FixedWidthArgs가 가장 느립니다 (5x +).
slice(bytes) 보다 빠릅니다 slice(string)
소스 문자열은 바이트 테스트 결과입니다 : struct (1x), operator.itemgetter (1.7x), 미리 컴파일 된 sliceObject & list comprehensions (2.8x), re.patten object (2.9x)

대용량 파일을 다룰 때 종종 with open ( file, "rb") as f:.
이 메서드는 약 2.4 초 동안 위 파일 중 하나를 순회합니다.
1 백만 행의 데이터를 처리하는 적절한 핸들러는 각 행을 20 개의 필드로 분할하고 2.4 초도 채 걸리지 않습니다.

난 단지 그것을 발견 stuct하고 itemgetter요구 사항을 충족

ps : 일반 디스플레이를 위해 유니 코드 str을 바이트로 변환했습니다. 2 바이트 환경에있는 경우이 작업을 수행 할 필요가 없습니다.

from itertools import accumulate
from operator import itemgetter

def oprt_parser(sArgs):
    sum_arg = tuple(accumulate(abs(i) for i in sArgs))
    # Negative parameter field index
    cuts = tuple(i for i,num in enumerate(sArgs) if num < 0)
    # Get slice args and Ignore fields of negative length
    ig_Args = tuple(item for i, item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts)
    # Generate `operator.itemgetter` object
    oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args])
    return oprtObj

lineb = b'abcdefghijklmnopqrstuvwxyz\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4\xb6\xee\xb7\xa2\xb8\xf6\xba\xcd0123456789'
line = lineb.decode("GBK")

# Unicode Fixed Width
fieldwidthsU = (13, -13, 4, -4, 5,-5) # Negative width fields is ignored
# ASCII Fixed Width
fieldwidths = (13, -13, 8, -8, 5,-5) # Negative width fields is ignored
# Unicode FixedWidth processing
parse = oprt_parser(fieldwidthsU)
fields = parse(line)
print('Unicode FixedWidth','fields: {}'.format(tuple(map(lambda s: s.encode("GBK"), fields))))
# ASCII FixedWidth processing
parse = oprt_parser(fieldwidths)
fields = parse(lineb)
print('ASCII FixedWidth','fields: {}'.format(fields))
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)
parse = oprt_parser(fieldwidths)
fields = parse(line)
print(f"fields: {fields}")

산출:

Unicode FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234')
ASCII FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234')
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

oprt_parser4x make_parser(목록 이해력 + 슬라이스)

연구 과정에서 CPU 속도가 빠를수록 re방법 의 효율성이 빠르게 증가 하는 것으로 나타 났습니다.
테스트 할 더 좋은 컴퓨터가 많지 않기 때문에 내 테스트 코드를 제공하십시오. 관심이있는 사람이 있으면 더 빠른 컴퓨터로 테스트 할 수 있습니다.

실행 환경 :

os : win10
파이썬 : 3.7.2
CPU : amd athlon x3450
HD : 씨게이트 1T

import timeit
import time
import re
from itertools import accumulate
from operator import itemgetter

def eff2(stmt,onlyNum= False,showResult=False):
    '''test function'''
    if onlyNum:
        rl = timeit.repeat(stmt=stmt,repeat=roundI,number=timesI,globals=globals())
        avg = sum(rl) / len(rl)
        return f"{avg * (10 ** 6)/timesI:0.4f}"
    else:
        rl = timeit.repeat(stmt=stmt,repeat=10,number=1000,globals=globals())
        avg = sum(rl) / len(rl)
        print(f"【{stmt}】")
        print(f"\tquick avg = {avg * (10 ** 6)/1000:0.4f} s/million")
        if showResult:
            print(f"\t  Result = {eval(stmt)}\n\t  timelist = {rl}\n")
        else:
            print("")

def upDouble(argList,argRate):
    return [c*argRate for c in argList]

tbStr = "000000001111000002222真2233333333000000004444444QAZ55555555000000006666666ABC这些事中文字abcdefghijk"
tbBytes = tbStr.encode("GBK")
a20 = (4,4,2,2,2,3,2,2, 2 ,2,8,8,7,3,8,8,7,3, 12 ,11)
a20U = (4,4,2,2,2,3,2,2, 1 ,2,8,8,7,3,8,8,7,3, 6 ,11)
Slng = 800
rateS = Slng // 100

tStr = "".join(upDouble(tbStr , rateS))
tBytes = tStr.encode("GBK")
spltArgs = upDouble( a20 , rateS)
spltArgsU = upDouble( a20U , rateS)

testList = []
timesI = 100000
roundI = 5
print(f"test round = {roundI} timesI = {timesI} sourceLng = {len(tStr)} argFieldCount = {len(spltArgs)}")


print(f"pure str \n{''.ljust(60,'-')}")
# ==========================================
def str_parser(sArgs):
    def prsr(oStr):
        r = []
        r_ap = r.append
        stt=0
        for lng in sArgs:
            end = stt + lng 
            r_ap(oStr[stt:end])
            stt = end 
        return tuple(r)
    return prsr

Str_P = str_parser(spltArgsU)
# eff2("Str_P(tStr)")
testList.append("Str_P(tStr)")

print(f"pure bytes \n{''.ljust(60,'-')}")
# ==========================================
def byte_parser(sArgs):
    def prsr(oBytes):
        r, stt = [], 0
        r_ap = r.append
        for lng in sArgs:
            end = stt + lng
            r_ap(oBytes[stt:end])
            stt = end
        return r
    return prsr
Byte_P = byte_parser(spltArgs)
# eff2("Byte_P(tBytes)")
testList.append("Byte_P(tBytes)")

# re,bytes
print(f"re compile object \n{''.ljust(60,'-')}")
# ==========================================


def rebc_parser(sArgs,otype="b"):
    re_Args = "".join([f"(.{{{n}}})" for n in sArgs])
    if otype == "b":
        rebc_Args = re.compile(re_Args.encode("GBK"))
    else:
        rebc_Args = re.compile(re_Args)
    def prsr(oBS):
        return rebc_Args.match(oBS).groups()
    return prsr
Rebc_P = rebc_parser(spltArgs)
# eff2("Rebc_P(tBytes)")
testList.append("Rebc_P(tBytes)")

Rebc_Ps = rebc_parser(spltArgsU,"s")
# eff2("Rebc_Ps(tStr)")
testList.append("Rebc_Ps(tStr)")


print(f"struct \n{''.ljust(60,'-')}")
# ==========================================

import struct
def struct_parser(sArgs):
    struct_Args = " ".join(map(lambda x: str(x) + "s", sArgs))
    def prsr(oBytes):
        return struct.unpack(struct_Args, oBytes)
    return prsr
Struct_P = struct_parser(spltArgs)
# eff2("Struct_P(tBytes)")
testList.append("Struct_P(tBytes)")

print(f"List Comprehensions + slice \n{''.ljust(60,'-')}")
# ==========================================
import itertools
def slice_parser(sArgs):
    tl = tuple(itertools.accumulate(sArgs))
    slice_Args = tuple(zip((0,)+tl,tl))
    def prsr(oBytes):
        return [oBytes[s:e] for s, e in slice_Args]
    return prsr
Slice_P = slice_parser(spltArgs)
# eff2("Slice_P(tBytes)")
testList.append("Slice_P(tBytes)")

def sliceObj_parser(sArgs):
    tl = tuple(itertools.accumulate(sArgs))
    tl2 = tuple(zip((0,)+tl,tl))
    sliceObj_Args = tuple(slice(s,e) for s,e in tl2)
    def prsr(oBytes):
        return [oBytes[so] for so in sliceObj_Args]
    return prsr
SliceObj_P = sliceObj_parser(spltArgs)
# eff2("SliceObj_P(tBytes)")
testList.append("SliceObj_P(tBytes)")

SliceObj_Ps = sliceObj_parser(spltArgsU)
# eff2("SliceObj_Ps(tStr)")
testList.append("SliceObj_Ps(tStr)")


print(f"operator.itemgetter + slice object \n{''.ljust(60,'-')}")
# ==========================================

def oprt_parser(sArgs):
    sum_arg = tuple(accumulate(abs(i) for i in sArgs))
    cuts = tuple(i for i,num in enumerate(sArgs) if num < 0)
    ig_Args = tuple(item for i,item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts)
    oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args])
    return oprtObj

Oprt_P = oprt_parser(spltArgs)
# eff2("Oprt_P(tBytes)")
testList.append("Oprt_P(tBytes)")

Oprt_Ps = oprt_parser(spltArgsU)
# eff2("Oprt_Ps(tStr)")
testList.append("Oprt_Ps(tStr)")

print("|".join([s.split("(")[0].center(11," ") for s in testList]))
print("|".join(["".center(11,"-") for s in testList]))
print("|".join([eff2(s,True).rjust(11," ") for s in testList]))

산출:

Test round = 5 timesI = 100000 sourceLng = 744 argFieldCount = 20
...
...
   Str_P | Byte_P | Rebc_P | Rebc_Ps | Struct_P | Slice_P | SliceObj_P|SliceObj_Ps| Oprt_P | Oprt_Ps
-----------|-----------|-----------|-----------|-- ---------|-----------|-----------|-----------|---- -------|-----------
     9.6315| 7.5952| 4.4187| 5.6867| 1.5123| 5.2915| 4.2673| 5.7121| 2.4713| 3.9051

이것이 필드가 시작되고 끝나는 곳을 포함하는 사전으로 해결 한 방법입니다. 시작점과 끝점을 제공하면 칼럼 길이의 변경 사항을 관리하는데도 도움이되었습니다.

# fixed length
#      '---------- ------- ----------- -----------'
line = '20.06.2019 myname  active      mydevice   '
SLICES = {'date_start': 0,
         'date_end': 10,
         'name_start': 11,
         'name_end': 18,
         'status_start': 19,
         'status_end': 30,
         'device_start': 31,
         'device_end': 42}

def get_values_as_dict(line, SLICES):
    values = {}
    key_list = {key.split("_")[0] for key in SLICES.keys()}
    for key in key_list:
       values[key] = line[SLICES[key+"_start"]:SLICES[key+"_end"]].strip()
    return values

>>> print (get_values_as_dict(line,SLICES))
{'status': 'active', 'name': 'myname', 'date': '20.06.2019', 'device': 'mydevice'}

참고 URL : https://stackoverflow.com/questions/4914008/how-to-efficiently-parse-fixed-width-files

'developer tip' 카테고리의 다른 글

Dumpbin.exe를 찾을 수 없습니다 (0)	2020.10.16
Python으로 웹 사이트에 로그인하려면 어떻게해야합니까? (0)	2020.10.16
Safari, iPhone 및 iPad에서 HTML5 비디오 태그가 작동하지 않음 (0)	2020.10.16
"실수로"결정 기억을 클릭 한 후 IntelliJ의 새 창에서 프로젝트를 엽니 다. (0)	2020.10.16
Python에서 Excel 파일 읽기 (0)	2020.10.16

현재글고정 너비 파일을 효율적으로 구문 분석하는 방법은 무엇입니까?

copycodes

고정 너비 파일을 효율적으로 구문 분석하는 방법은 무엇입니까?

고정 너비 파일을 효율적으로 구문 분석하는 방법은 무엇입니까?

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

티스토리툴바

고정 너비 파일을 효율적으로 구문 분석하는 방법은 무엇입니까?

고정 너비 파일을 효율적으로 구문 분석하는 방법은 무엇입니까?

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

관련글

티스토리툴바