Python Forum
use Xpath in Python :: libxml2 for a page-to-page skip-setting
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
use Xpath in Python :: libxml2 for a page-to-page skip-setting
#1
hello dear Pythonists good day dear experts Smile
..good day dear Larz60+ and snipsat


my name is Apollo and i am pretty new to Python, BS4 and all that.

i have some intersting ideas of parsing - the results - (note: some pages of the following records) could be fetched each by each.
it is a interesting parsing job as it can teach how to do some mechanized (page by page) skipping and parsing.

see the pages:

https://europa.eu/youth/volunteering/organisation/50160
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163
https://europa.eu/youth/volunteering/organisation/50164

[note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting]


background-note:i work as a teacher in the field of Volunteering services -therefore i am interested in the data itself.

the technical part: i want to achieve this with the following approach - to use Xpath in Python.

some considerations regarding how this can be achieved. Ideas about technique and approaches: there is a full implementation - well - libxml2 has a number of advantages for this job i guess.

- Compliance to the specs ( https://www.w3.org/TR/xpath/all/ )
- Speed: This is really a python wrapper around a C implementation.
- Ubiquity availability: The libxml2 library is pervasive and thus well tested.
- this lib is still under active development and a community participation

- the Downsides of working with this libxml2-approach includes: Compliance to the spec. ( https://www.w3.org/TR/xpath/all/ ) It's strict. Things like default namespace handling are easier in other libraries. We are able to use of native code. I want to apply this on a little project. This is not very Pythonic. i am trying to achive a simple path selection - and therefore i try to stick with ElementTree ( which is included in Python cf http://effbot.org/zone/element-xpath.htm ). If we need full spec compliance or raw speed and can cope with the distribution of native code, we can go
with libxml2 which is the XML C parser and toolkit developed for the Gnome project (but usable outside of the Gnome platform),


That said - here i will try to accomplish a sample of libxml2 XPath Use
import libxml2

doc = libxml2.parseFile("tst.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//*")
if len(res) != 2:
    print "xpath query: wrong node set size"
    sys.exit(1)
if res[0].name != "doc" or res[1].name != "foo":
    print "xpath query: wrong node set value"
    sys.exit(1)
doc.freeDoc()
ctxt.xpathFreeContext()
Sample of ElementTree XPath Use

from elementtree.ElementTree import ElementTree
mydoc = ElementTree(file='tst.xml')
for e in mydoc.findall('/html/body/div[7]/div/section/div/section/div/div[1]/h3'):
    print e.get('title').text
btw:

import requests
from bs4 import BeautifulSoup
 
url = 'https://europa.eu/youth/volunteering/organisations_en#open'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
block = soup.find('div', class_="eyp-card block-is-flex")
again: the idea behind: i have some intersting ideas - the results - (note: more than 6000 pages or records) could be fetched each by each.

https://europa.eu/youth/volunteering/organisation/50160
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163
...and so forth and so forth ....
again - [note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting] therefore we can count the pages each by each - and count incremtal n+1

see examples
https://europa.eu/youth/volunteering/organisation/50120
Kulturlabor Stromboli
Krippgasse 11, 6060, Hall i Tirol, Austria
www.stromboli.at - +43522345111
DESCRIPTION OF ORGANISATION
(usw. usf. )
https://europa.eu/youth/volunteering/organisation/50160
Norwegian Judo Federation
Ullevaal Stadion, Sognsveien 75 K, 5th floor, N-0855, Oslow, Norway
www.judo.no - +47 21 02 98 20
ths can be operated with the xpaths

full xath:

/html/body/div[7]/div/section/div/section/div/div[1]/h3
<h3 class= eyp-project-heading underline of organisation </h3>
<p>
conclusio - i could run the incremental parser (n , n+1, n+2, n+3) and so forth

after fetching the pages i can parse and store the records in

a- csv Formate or in a sqlite db...

how do you think about the idea - love to hear from you -as i am not so familiar with the bs4 - and just starting with python
it would be great if you can give me a helping hand.. I appreciate any and ahll help - to get me a little step into a good direction..

i look forward to hear from you Smile





regards
Apollo Smile
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Reply


Messages In This Thread
use Xpath in Python :: libxml2 for a page-to-page skip-setting - by apollo - Mar-17-2020, 06:09 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  cant click button by script at page michael1834 1 1,059 Dec-08-2023, 04:44 PM
Last Post: SpongeB0B
  how to scrape page that works dynamicaly? samuelbachorik 0 718 Sep-23-2023, 10:38 AM
Last Post: samuelbachorik
  Need help for script access via webdriver to an open web page in Firefox Clixmaster 1 1,263 Apr-20-2023, 05:27 PM
Last Post: farshid
  I am scraping a web page but got an Error Sarmad54 3 1,455 Mar-02-2023, 08:20 PM
Last Post: Sarmad54
  Click on a button on web page using Selenium Pavel_47 7 4,697 Jan-05-2023, 04:20 AM
Last Post: ellapurnellrt
  Flask run function in background and auto refresh page raossabe 2 7,553 Aug-20-2022, 10:00 PM
Last Post: snippsat
  How can I get the Middle English and Modern English from this page? Pedroski55 5 2,320 Feb-04-2022, 08:49 AM
Last Post: Pedroski55
  Scraping the page without distorting content oleglpts 5 2,491 Dec-16-2021, 05:08 PM
Last Post: oleglpts
  How to Create Swagger/OpenAPI page tlopezdh 4 2,479 Nov-10-2021, 06:34 PM
Last Post: tlopezdh
  <title> django page title dynamic and other field (not working) lemonred 1 2,104 Nov-04-2021, 08:50 PM
Last Post: lemonred

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020