Python Forum
use Xpath in Python :: libxml2 for a page-to-page skip-setting
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
use Xpath in Python :: libxml2 for a page-to-page skip-setting
#1
hello dear Pythonists good day dear experts Smile
..good day dear Larz60+ and snipsat


my name is Apollo and i am pretty new to Python, BS4 and all that.

i have some intersting ideas of parsing - the results - (note: some pages of the following records) could be fetched each by each.
it is a interesting parsing job as it can teach how to do some mechanized (page by page) skipping and parsing.

see the pages:

https://europa.eu/youth/volunteering/organisation/50160
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163
https://europa.eu/youth/volunteering/organisation/50164

[note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting]


background-note:i work as a teacher in the field of Volunteering services -therefore i am interested in the data itself.

the technical part: i want to achieve this with the following approach - to use Xpath in Python.

some considerations regarding how this can be achieved. Ideas about technique and approaches: there is a full implementation - well - libxml2 has a number of advantages for this job i guess.

- Compliance to the specs ( https://www.w3.org/TR/xpath/all/ )
- Speed: This is really a python wrapper around a C implementation.
- Ubiquity availability: The libxml2 library is pervasive and thus well tested.
- this lib is still under active development and a community participation

- the Downsides of working with this libxml2-approach includes: Compliance to the spec. ( https://www.w3.org/TR/xpath/all/ ) It's strict. Things like default namespace handling are easier in other libraries. We are able to use of native code. I want to apply this on a little project. This is not very Pythonic. i am trying to achive a simple path selection - and therefore i try to stick with ElementTree ( which is included in Python cf http://effbot.org/zone/element-xpath.htm ). If we need full spec compliance or raw speed and can cope with the distribution of native code, we can go
with libxml2 which is the XML C parser and toolkit developed for the Gnome project (but usable outside of the Gnome platform),


That said - here i will try to accomplish a sample of libxml2 XPath Use
import libxml2

doc = libxml2.parseFile("tst.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//*")
if len(res) != 2:
    print "xpath query: wrong node set size"
    sys.exit(1)
if res[0].name != "doc" or res[1].name != "foo":
    print "xpath query: wrong node set value"
    sys.exit(1)
doc.freeDoc()
ctxt.xpathFreeContext()
Sample of ElementTree XPath Use

from elementtree.ElementTree import ElementTree
mydoc = ElementTree(file='tst.xml')
for e in mydoc.findall('/html/body/div[7]/div/section/div/section/div/div[1]/h3'):
    print e.get('title').text
btw:

import requests
from bs4 import BeautifulSoup
 
url = 'https://europa.eu/youth/volunteering/organisations_en#open'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
block = soup.find('div', class_="eyp-card block-is-flex")
again: the idea behind: i have some intersting ideas - the results - (note: more than 6000 pages or records) could be fetched each by each.

https://europa.eu/youth/volunteering/organisation/50160
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163
...and so forth and so forth ....
again - [note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting] therefore we can count the pages each by each - and count incremtal n+1

see examples
https://europa.eu/youth/volunteering/organisation/50120
Kulturlabor Stromboli
Krippgasse 11, 6060, Hall i Tirol, Austria
www.stromboli.at - +43522345111
DESCRIPTION OF ORGANISATION
(usw. usf. )
https://europa.eu/youth/volunteering/organisation/50160
Norwegian Judo Federation
Ullevaal Stadion, Sognsveien 75 K, 5th floor, N-0855, Oslow, Norway
www.judo.no - +47 21 02 98 20
ths can be operated with the xpaths

full xath:

/html/body/div[7]/div/section/div/section/div/div[1]/h3
<h3 class= eyp-project-heading underline of organisation </h3>
<p>
conclusio - i could run the incremental parser (n , n+1, n+2, n+3) and so forth

after fetching the pages i can parse and store the records in

a- csv Formate or in a sqlite db...

how do you think about the idea - love to hear from you -as i am not so familiar with the bs4 - and just starting with python
it would be great if you can give me a helping hand.. I appreciate any and ahll help - to get me a little step into a good direction..

i look forward to hear from you Smile





regards
Apollo Smile
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Reply
#2
You are messing stuff up here,as you should not use lbxml2 alone.
libxml2 is build into lxml,so what use should use is lxml.

Here a demo with lxml using XPath.
As there is Norwegian Cool in a page i parse that.
from lxml import html
import requests

url = 'https://europa.eu/youth/volunteering/organisation/50160'
response = requests.get(url)
tree = html.fromstring(response.content)
tag_info = tree.xpath("//h5[contains(text(),'Norwegian')]")
print(tag_info[0].text)
Output:
Norwegian Judo Federation
I often use and prefer CSS Selector.
XPath and CSS Selector do same task so you should look into both.
With both BS or lxml can use this or mix with find() and findall().
So can use BS and select to use CSS Selector.
It's still fast as i use lxml as parser.
from bs4 import BeautifulSoup
import requests

url = 'https://europa.eu/youth/volunteering/organisation/50160'
resonse = requests.get(url)
soup = BeautifulSoup(resonse.content, 'lxml')
tag_info = soup.select('.col-md-12 > p:nth-child(3) > i:nth-child(1)')
print(tag_info[0].text)
Output:
Norwegian Judo Federation
Take a look at this,and if info needed is on same place then get it to work on one page.
Then try to iterate over some pages as start to see if work.
Reply
#3
dear snippsat, Wink

many thanks for your reply and your support. This is very kind and i am lucky.

you allready have seen (in the other thread: https://python-forum.io/Thread-preparing...#pid107604 ) that i am currently setting up several machines (win 10 in office) and mx-linux at home with python and a development-environment consisting of all what needs to be on board. Many thanks for this scaffolding-approach in coding - and for the help: in the terms of a learing theory-approach you, snippsat (and also you larsz60+ and all the friends here) you really do great tings:.... since you

- give us mini-lessons visible here in this thread and also with a walk through this forum and sub forum gives us many many examples for these kind of things great place with many many of such mini-lessions, which are;
- starting points for us to make the next steps... and you...
- describe concepts in multiple ways...;(see above with the different approaches - such as lxml as well as BS4;
- Incorporate practical steps and also aids with code and theoretical concepts - like the links to the tutorials;
- give us novices the time to do some things with the first steps and yet...
- encourage (all the novices here) to go ahead with little steps; therefore you offer mini-lessions...
- to summarize: more or less - these are essential features of scaffolding that facilitate learning
- you organize these so called scaffolds in "simple" skill acquisition or they may be dynamic and generative";
- one of the first is the interaction between the learner and the expert. cf. the concepts of instructional scaffolding: https://en.wikipedia.org/wiki/Instructional_scaffolding (**)

dear Snippsat, you have seen that i am having a bit of challenge in laying out the first approach for the code in Python. Many thanks for your help - you gave me a great head start. I want to learn scripting from the very basics.

i have read some books that teach python such as the following.
- Eric Matthes: python crash course - which is great.
- David Asher and Mark Lutz: Learn Python;
- Magnus Hetland; Practical Python and others more.
- Dan Bader: Python Tricks - with interesting online examples on the net.
- ....

besides these books i have watched many youtube-courses: courses by Schaefer, Mosh, Richard White (oo-programming),Trevor Payne, CS Dojo, Socratica, Chris Reeves tutorials python web scraping and others more.

and yes: i guess that a hands-on approach is quite a good thing to the next step. its kind of scaffolding and it helps me as it is encouraging to see first results with a real-live project that makes sense: Any here you with the great forum come into play. this is a great experience to me. A great place for idea-sharing and knowledge exchange. From the perspective of a teacher this is a relly great thing and experience.

the next steps here: within the next few days i will finsh the setup of the new python installation and the setup of ide / editors /(vscode, pycharm) then i will work on the code examples and next steps of the parser/(scraper) you have gave great steps to overcome the hurdle.

dear snippsat - you do a great job here. you encourage so many people and give them a starting point to make the next steps..
this is overwhelming to see. Smile

dear snippsat - and also dear buran, and larsz60+ Smile and all those of you that work here - you do a overwhelmingly good job.
keep up the great work - the great project here - it rocks.!!!

greetings
apollo Smile
- ps i am back in the next week...- i have to do some extra things for the office - in the meanwhile...


**see more here at the ressources of the open-source research community at massachusetts institute of technology: https://flosshub.org/biblio
with articles of: Andrea Hemetsberger; De Souza, Froehlich, Dourish; Michelle Morner et al; Eric van Hippel and others more. eg: Michele Morner and F. Lanzara: http://citeseerx.ist.psu.edu/viewdoc/dow...1&type=pdf and of course the concepts of "Communities of Practice" (Lave and Wenger)( https://en.wikipedia.org/wiki/Community_of_practice )- general Learning Theories...etc. etx. and also very interesting the concepts of instructional scaffolding: https://en.wikipedia.org/wiki/Instructional_scaffolding
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  cant click button by script at page michael1834 1 994 Dec-08-2023, 04:44 PM
Last Post: SpongeB0B
  how to scrape page that works dynamicaly? samuelbachorik 0 682 Sep-23-2023, 10:38 AM
Last Post: samuelbachorik
  Need help for script access via webdriver to an open web page in Firefox Clixmaster 1 1,214 Apr-20-2023, 05:27 PM
Last Post: farshid
  I am scraping a web page but got an Error Sarmad54 3 1,417 Mar-02-2023, 08:20 PM
Last Post: Sarmad54
  Click on a button on web page using Selenium Pavel_47 7 4,564 Jan-05-2023, 04:20 AM
Last Post: ellapurnellrt
  Flask run function in background and auto refresh page raossabe 2 7,247 Aug-20-2022, 10:00 PM
Last Post: snippsat
  How can I get the Middle English and Modern English from this page? Pedroski55 5 2,251 Feb-04-2022, 08:49 AM
Last Post: Pedroski55
  Scraping the page without distorting content oleglpts 5 2,442 Dec-16-2021, 05:08 PM
Last Post: oleglpts
  How to Create Swagger/OpenAPI page tlopezdh 4 2,431 Nov-10-2021, 06:34 PM
Last Post: tlopezdh
  <title> django page title dynamic and other field (not working) lemonred 1 2,069 Nov-04-2021, 08:50 PM
Last Post: lemonred

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020