Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 use Xpath in Python :: libxml2 for a page-to-page skip-setting
#1
hello dear Pythonists good day dear experts Smile
..good day dear Larz60+ and snipsat


my name is Apollo and i am pretty new to Python, BS4 and all that.

i have some intersting ideas of parsing - the results - (note: some pages of the following records) could be fetched each by each.
it is a interesting parsing job as it can teach how to do some mechanized (page by page) skipping and parsing.

see the pages:

https://europa.eu/youth/volunteering/organisation/50160
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163
https://europa.eu/youth/volunteering/organisation/50164

[note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting]


background-note:i work as a teacher in the field of Volunteering services -therefore i am interested in the data itself.

the technical part: i want to achieve this with the following approach - to use Xpath in Python.

some considerations regarding how this can be achieved. Ideas about technique and approaches: there is a full implementation - well - libxml2 has a number of advantages for this job i guess.

- Compliance to the specs ( https://www.w3.org/TR/xpath/all/ )
- Speed: This is really a python wrapper around a C implementation.
- Ubiquity availability: The libxml2 library is pervasive and thus well tested.
- this lib is still under active development and a community participation

- the Downsides of working with this libxml2-approach includes: Compliance to the spec. ( https://www.w3.org/TR/xpath/all/ ) It's strict. Things like default namespace handling are easier in other libraries. We are able to use of native code. I want to apply this on a little project. This is not very Pythonic. i am trying to achive a simple path selection - and therefore i try to stick with ElementTree ( which is included in Python cf http://effbot.org/zone/element-xpath.htm ). If we need full spec compliance or raw speed and can cope with the distribution of native code, we can go
with libxml2 which is the XML C parser and toolkit developed for the Gnome project (but usable outside of the Gnome platform),


That said - here i will try to accomplish a sample of libxml2 XPath Use
import libxml2

doc = libxml2.parseFile("tst.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//*")
if len(res) != 2:
    print "xpath query: wrong node set size"
    sys.exit(1)
if res[0].name != "doc" or res[1].name != "foo":
    print "xpath query: wrong node set value"
    sys.exit(1)
doc.freeDoc()
ctxt.xpathFreeContext()
Sample of ElementTree XPath Use

from elementtree.ElementTree import ElementTree
mydoc = ElementTree(file='tst.xml')
for e in mydoc.findall('/html/body/div[7]/div/section/div/section/div/div[1]/h3'):
    print e.get('title').text
btw:

import requests
from bs4 import BeautifulSoup
 
url = 'https://europa.eu/youth/volunteering/organisations_en#open'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
block = soup.find('div', class_="eyp-card block-is-flex")
again: the idea behind: i have some intersting ideas - the results - (note: more than 6000 pages or records) could be fetched each by each.

https://europa.eu/youth/volunteering/organisation/50160
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163
...and so forth and so forth ....
again - [note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting] therefore we can count the pages each by each - and count incremtal n+1

see examples
https://europa.eu/youth/volunteering/organisation/50120
Kulturlabor Stromboli
Krippgasse 11, 6060, Hall i Tirol, Austria
www.stromboli.at - +43522345111
DESCRIPTION OF ORGANISATION
(usw. usf. )
https://europa.eu/youth/volunteering/organisation/50160
Norwegian Judo Federation
Ullevaal Stadion, Sognsveien 75 K, 5th floor, N-0855, Oslow, Norway
www.judo.no - +47 21 02 98 20
ths can be operated with the xpaths

full xath:

/html/body/div[7]/div/section/div/section/div/div[1]/h3
<h3 class= eyp-project-heading underline of organisation </h3>
<p>
conclusio - i could run the incremental parser (n , n+1, n+2, n+3) and so forth

after fetching the pages i can parse and store the records in

a- csv Formate or in a sqlite db...

how do you think about the idea - love to hear from you -as i am not so familiar with the bs4 - and just starting with python
it would be great if you can give me a helping hand.. I appreciate any and ahll help - to get me a little step into a good direction..

i look forward to hear from you Smile





regards
Apollo Smile
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Quote
#2
You are messing stuff up here,as you should not use lbxml2 alone.
libxml2 is build into lxml,so what use should use is lxml.

Here a demo with lxml using XPath.
As there is Norwegian Cool in a page i parse that.
from lxml import html
import requests

url = 'https://europa.eu/youth/volunteering/organisation/50160'
response = requests.get(url)
tree = html.fromstring(response.content)
tag_info = tree.xpath("//h5[contains(text(),'Norwegian')]")
print(tag_info[0].text)
Output:
Norwegian Judo Federation
I often use and prefer CSS Selector.
XPath and CSS Selector do same task so you should look into both.
With both BS or lxml can use this or mix with find() and findall().
So can use BS and select to use CSS Selector.
It's still fast as i use lxml as parser.
from bs4 import BeautifulSoup
import requests

url = 'https://europa.eu/youth/volunteering/organisation/50160'
resonse = requests.get(url)
soup = BeautifulSoup(resonse.content, 'lxml')
tag_info = soup.select('.col-md-12 > p:nth-child(3) > i:nth-child(1)')
print(tag_info[0].text)
Output:
Norwegian Judo Federation
Take a look at this,and if info needed is on same place then get it to work on one page.
Then try to iterate over some pages as start to see if work.
apollo and buran like this post
Quote
#3
dear snippsat, Wink

many thanks for your reply and your support. This is very kind and i am lucky.

you allready have seen (in the other thread: https://python-forum.io/Thread-preparing...#pid107604 ) that i am currently setting up several machines (win 10 in office) and mx-linux at home with python and a development-environment consisting of all what needs to be on board. Many thanks for this scaffolding-approach in coding - and for the help: in the terms of a learing theory-approach you, snippsat (and also you larsz60+ and all the friends here) you really do great tings:.... since you

- give us mini-lessons visible here in this thread and also with a walk through this forum and sub forum gives us many many examples for these kind of things great place with many many of such mini-lessions, which are;
- starting points for us to make the next steps... and you...
- describe concepts in multiple ways...;(see above with the different approaches - such as lxml as well as BS4;
- Incorporate practical steps and also aids with code and theoretical concepts - like the links to the tutorials;
- give us novices the time to do some things with the first steps and yet...
- encourage (all the novices here) to go ahead with little steps; therefore you offer mini-lessions...
- to summarize: more or less - these are essential features of scaffolding that facilitate learning
- you organize these so called scaffolds in "simple" skill acquisition or they may be dynamic and generative";
- one of the first is the interaction between the learner and the expert. cf. the concepts of instructional scaffolding: https://en.wikipedia.org/wiki/Instructional_scaffolding (**)

dear Snippsat, you have seen that i am having a bit of challenge in laying out the first approach for the code in Python. Many thanks for your help - you gave me a great head start. I want to learn scripting from the very basics.

i have read some books that teach python such as the following.
- Eric Matthes: python crash course - which is great.
- David Asher and Mark Lutz: Learn Python;
- Magnus Hetland; Practical Python and others more.
- Dan Bader: Python Tricks - with interesting online examples on the net.
- ....

besides these books i have watched many youtube-courses: courses by Schaefer, Mosh, Richard White (oo-programming),Trevor Payne, CS Dojo, Socratica, Chris Reeves tutorials python web scraping and others more.

and yes: i guess that a hands-on approach is quite a good thing to the next step. its kind of scaffolding and it helps me as it is encouraging to see first results with a real-live project that makes sense: Any here you with the great forum come into play. this is a great experience to me. A great place for idea-sharing and knowledge exchange. From the perspective of a teacher this is a relly great thing and experience.

the next steps here: within the next few days i will finsh the setup of the new python installation and the setup of ide / editors /(vscode, pycharm) then i will work on the code examples and next steps of the parser/(scraper) you have gave great steps to overcome the hurdle.

dear snippsat - you do a great job here. you encourage so many people and give them a starting point to make the next steps..
this is overwhelming to see. Smile

dear snippsat - and also dear buran, and larsz60+ Smile and all those of you that work here - you do a overwhelmingly good job.
keep up the great work - the great project here - it rocks.!!!

greetings
apollo Smile
- ps i am back in the next week...- i have to do some extra things for the office - in the meanwhile...


**see more here at the ressources of the open-source research community at massachusetts institute of technology: https://flosshub.org/biblio
with articles of: Andrea Hemetsberger; De Souza, Froehlich, Dourish; Michelle Morner et al; Eric van Hippel and others more. eg: Michele Morner and F. Lanzara: http://citeseerx.ist.psu.edu/viewdoc/dow...1&type=pdf and of course the concepts of "Communities of Practice" (Lave and Wenger)( https://en.wikipedia.org/wiki/Community_of_practice )- general Learning Theories...etc. etx. and also very interesting the concepts of instructional scaffolding: https://en.wikipedia.org/wiki/Instructional_scaffolding
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Unable to click element in web page Kalpana 0 117 Jun-25-2020, 05:20 AM
Last Post: Kalpana
  [Django] css file is not applying to the page SheeppOSU 1 179 May-09-2020, 06:54 AM
Last Post: menator01
  [Flask]After login page is not redirecting me to dashboard shockwave 0 206 May-07-2020, 05:22 PM
Last Post: shockwave
  Flask - adding new page affects all other pages CMR 15 491 Mar-28-2020, 04:13 PM
Last Post: CMR
  Sending data to php page ebolisa 0 220 Mar-18-2020, 05:34 PM
Last Post: ebolisa
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 470 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  What i do wrong? In response i get home page code aruzo 1 315 Feb-23-2020, 11:32 PM
Last Post: micseydel
  Acces to page denied julio2000 1 1,252 Feb-08-2020, 02:37 AM
Last Post: Larz60+
  need help with xpath pythonprogrammer 1 728 Jan-18-2020, 11:28 PM
Last Post: snippsat
  Scraping next page of LinkedIn jobs RiteshMahto 6 768 Dec-09-2019, 09:43 PM
Last Post: Larz60+

Forum Jump:


Users browsing this thread: 1 Guest(s)