developer tip

BeautifulSoup에서 xpath를 사용할 수 있습니까?

copycodes 2020. 9. 4. 07:38
반응형

BeautifulSoup에서 xpath를 사용할 수 있습니까?


BeautifulSoup을 사용하여 URL을 긁어 내고 다음 코드가 있습니다.

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

이제 위의 코드에서 findAll태그와 관련 정보를 가져 오는 데 사용할 수 있지만 xpath를 사용하고 싶습니다. BeautifulSoup에서 xpath를 사용할 수 있습니까? 가능하다면 누구든지 더 도움이 될 수 있도록 예제 코드를 제공해 주시겠습니까?


아니요, BeautifulSoup 자체는 XPath 표현식을 지원하지 않습니다.

또 다른 라이브러리, LXML는 , 수행 지원의 XPath 1.0. 그것은이 BeautifulSoup로 호환 모드 는 노력 할게요 및 HTML에게 수프가하는 방법을 깨진 구문 분석합니다. 그러나 기본 lxml HTML 파서 는 깨진 HTML을 파싱하는 것과 똑같이 잘 수행하며 더 빠르다고 생각합니다.

문서를 lxml 트리로 구문 분석 한 후에는 .xpath()메서드를 사용하여 요소를 검색 할 수 있습니다 .

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

추가 기능이 있는 전용 lxml.html()모듈있습니다.

위의 예제 에서는 파서가 스트림에서 직접 읽도록하는 것이 응답을 큰 문자열로 먼저 읽는 것보다 더 효율적이므로 response객체를에 직접 전달했습니다 lxml. requests라이브러리 에서 동일한 작업을 수행하려면 투명 전송 압축 해제를 활성화 한 후 객체 를 설정 stream=True하고 전달 하려고합니다 .response.raw

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

관심을 가질만한 것은 CSS 선택기 지원입니다 . CSSSelector클래스는 CSS 문을 XPath 표현식으로 변환하여 검색을 td.empformbody훨씬 쉽게 만듭니다 .

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

Coming full circle: BeautifulSoup itself does have very complete CSS selector support:

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

I can confirm that there is no XPath support within Beautiful Soup.


As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

I used this as a reference.


BeautifulSoup has a function named findNext from current element directed childern,so:

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a') 

Above code can imitate the following xpath:

div[class=class_value]/div[id=id_value]

I've searched through their docs and it seems there is not xpath option. Also, as you can see here on a similar question on SO, the OP is asking for a translation from xpath to BeautifulSoup, so my conclusion would be - no, there is no xpath parsing available.


when you use lxml all simple:

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

but when use BeautifulSoup BS4 all simple too:

  • first remove "//" and "@"
  • second - add star before "="

try this magic:

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

as you see, this does not support sub-tag, so i remove "/@href" part


This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.

Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

참고URL : https://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup

반응형