use Xpath in Python :: libxml2 for a page-to-page skip-setting

apollo · (This post was last modified: Mar-17-2020, 06:09 PM by apollo.)

hello dear Pythonists good day dear experts Smile

..good day dear Larz60+ and snipsat

my name is Apollo and i am pretty new to Python, BS4 and all that.

i have some intersting ideas of parsing - the results - (note: some pages of the following records) could be fetched each by each.
it is a interesting parsing job as it can teach how to do some mechanized (page by page) skipping and parsing.

see the pages:

https://europa.eu/youth/volunteering/organisation/50160
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163
https://europa.eu/youth/volunteering/organisation/50164

[note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting]

background-note:i work as a teacher in the field of Volunteering services -therefore i am interested in the data itself.

the technical part: i want to achieve this with the following approach - to use Xpath in Python.

some considerations regarding how this can be achieved. Ideas about technique and approaches: there is a full implementation - well - libxml2 has a number of advantages for this job i guess.

- Compliance to the specs ( https://www.w3.org/TR/xpath/all/ )
- Speed: This is really a python wrapper around a C implementation.
- Ubiquity availability: The libxml2 library is pervasive and thus well tested.
- this lib is still under active development and a community participation

- the Downsides of working with this libxml2-approach includes: Compliance to the spec. ( https://www.w3.org/TR/xpath/all/ ) It's strict. Things like default namespace handling are easier in other libraries. We are able to use of native code. I want to apply this on a little project. This is not very Pythonic. i am trying to achive a simple path selection - and therefore i try to stick with ElementTree ( which is included in Python cf http://effbot.org/zone/element-xpath.htm ). If we need full spec compliance or raw speed and can cope with the distribution of native code, we can go
with libxml2 which is the XML C parser and toolkit developed for the Gnome project (but usable outside of the Gnome platform),

That said - here i will try to accomplish a sample of libxml2 XPath Use

import libxml2

doc = libxml2.parseFile("tst.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//*")
if len(res) != 2:
    print "xpath query: wrong node set size"
    sys.exit(1)
if res[0].name != "doc" or res[1].name != "foo":
    print "xpath query: wrong node set value"
    sys.exit(1)
doc.freeDoc()
ctxt.xpathFreeContext()
Sample of ElementTree XPath Use

from elementtree.ElementTree import ElementTree
mydoc = ElementTree(file='tst.xml')
for e in mydoc.findall('/html/body/div[7]/div/section/div/section/div/div[1]/h3'):
    print e.get('title').text

btw:

import requests
from bs4 import BeautifulSoup
 
url = 'https://europa.eu/youth/volunteering/organisations_en#open'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
block = soup.find('div', class_="eyp-card block-is-flex")

again: the idea behind: i have some intersting ideas - the results - (note: more than 6000 pages or records) could be fetched each by each.

https://europa.eu/youth/volunteering/organisation/50160
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163

...and so forth and so forth ....
again - [note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting] therefore we can count the pages each by each - and count incremtal n+1

see examples
https://europa.eu/youth/volunteering/organisation/50120

Kulturlabor Stromboli
Krippgasse 11, 6060, Hall i Tirol, Austria
www.stromboli.at - +43522345111

DESCRIPTION OF ORGANISATION
(usw. usf. )
https://europa.eu/youth/volunteering/organisation/50160

Norwegian Judo Federation
Ullevaal Stadion, Sognsveien 75 K, 5th floor, N-0855, Oslow, Norway
www.judo.no - +47 21 02 98 20

ths can be operated with the xpaths

full xath:

/html/body/div[7]/div/section/div/section/div/div[1]/h3

<h3 class= eyp-project-heading underline of organisation </h3>
<p>

conclusio - i could run the incremental parser (n , n+1, n+2, n+3) and so forth

after fetching the pages i can parse and store the records in

a- csv Formate or in a sqlite db...

how do you think about the idea - love to hear from you -as i am not so familiar with the bs4 - and just starting with python
it would be great if you can give me a helping hand.. I appreciate any and ahll help - to get me a little step into a good direction..

i look forward to hear from you Smile

regards
Apollo Smile

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	cant click button by script at page	michael1834	1	1,059	Dec-08-2023, 04:44 PM Last Post: SpongeB0B
	how to scrape page that works dynamicaly?	samuelbachorik	0	718	Sep-23-2023, 10:38 AM Last Post: samuelbachorik
	Need help for script access via webdriver to an open web page in Firefox	Clixmaster	1	1,263	Apr-20-2023, 05:27 PM Last Post: farshid
	I am scraping a web page but got an Error	Sarmad54	3	1,455	Mar-02-2023, 08:20 PM Last Post: Sarmad54
	Click on a button on web page using Selenium	Pavel_47	7	4,697	Jan-05-2023, 04:20 AM Last Post: ellapurnellrt
	Flask run function in background and auto refresh page	raossabe	2	7,553	Aug-20-2022, 10:00 PM Last Post: snippsat
	How can I get the Middle English and Modern English from this page?	Pedroski55	5	2,320	Feb-04-2022, 08:49 AM Last Post: Pedroski55
	Scraping the page without distorting content	oleglpts	5	2,491	Dec-16-2021, 05:08 PM Last Post: oleglpts
	How to Create Swagger/OpenAPI page	tlopezdh	4	2,479	Nov-10-2021, 06:34 PM Last Post: tlopezdh
	<title> django page title dynamic and other field (not working)	lemonred	1	2,104	Nov-04-2021, 08:50 PM Last Post: lemonred

use Xpath in Python :: libxml2 for a page-to-page skip-setting

User Panel Messages

Announcements