hello dear Pythonists good day dear experts
..good day dear Larz60+ and snipsat
my name is Apollo and i am pretty new to Python, BS4 and all that.
i have some intersting ideas of parsing - the results - (note: some pages of the following records) could be fetched each by each.
it is a interesting parsing job as it can teach how to do some mechanized (page by page) skipping and parsing.
see the pages:
https://europa.eu/youth/volunteering/organisation/50160
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163
https://europa.eu/youth/volunteering/organisation/50164
[note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting]
background-note:i work as a teacher in the field of Volunteering services -therefore i am interested in the data itself.
the technical part: i want to achieve this with the following approach - to use Xpath in Python.
some considerations regarding how this can be achieved. Ideas about technique and approaches: there is a full implementation - well - libxml2 has a number of advantages for this job i guess.
- Compliance to the specs ( https://www.w3.org/TR/xpath/all/ )
- Speed: This is really a python wrapper around a C implementation.
- Ubiquity availability: The libxml2 library is pervasive and thus well tested.
- this lib is still under active development and a community participation
- the Downsides of working with this libxml2-approach includes: Compliance to the spec. ( https://www.w3.org/TR/xpath/all/ ) It's strict. Things like default namespace handling are easier in other libraries. We are able to use of native code. I want to apply this on a little project. This is not very Pythonic. i am trying to achive a simple path selection - and therefore i try to stick with ElementTree ( which is included in Python cf http://effbot.org/zone/element-xpath.htm ). If we need full spec compliance or raw speed and can cope with the distribution of native code, we can go
with libxml2 which is the XML C parser and toolkit developed for the Gnome project (but usable outside of the Gnome platform),
That said - here i will try to accomplish a sample of libxml2 XPath Use
again - [note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting] therefore we can count the pages each by each - and count incremtal n+1
see examples
https://europa.eu/youth/volunteering/organisation/50120
(usw. usf. )
https://europa.eu/youth/volunteering/organisation/50160
full xath:
after fetching the pages i can parse and store the records in
a- csv Formate or in a sqlite db...
how do you think about the idea - love to hear from you -as i am not so familiar with the bs4 - and just starting with python
it would be great if you can give me a helping hand.. I appreciate any and ahll help - to get me a little step into a good direction..
i look forward to hear from you
regards
Apollo

..good day dear Larz60+ and snipsat
my name is Apollo and i am pretty new to Python, BS4 and all that.
i have some intersting ideas of parsing - the results - (note: some pages of the following records) could be fetched each by each.
it is a interesting parsing job as it can teach how to do some mechanized (page by page) skipping and parsing.
see the pages:
https://europa.eu/youth/volunteering/organisation/50160
https://europa.eu/youth/volunteering/organisation/50162
https://europa.eu/youth/volunteering/organisation/50163
https://europa.eu/youth/volunteering/organisation/50164
[note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting]
background-note:i work as a teacher in the field of Volunteering services -therefore i am interested in the data itself.
the technical part: i want to achieve this with the following approach - to use Xpath in Python.
some considerations regarding how this can be achieved. Ideas about technique and approaches: there is a full implementation - well - libxml2 has a number of advantages for this job i guess.
- Compliance to the specs ( https://www.w3.org/TR/xpath/all/ )
- Speed: This is really a python wrapper around a C implementation.
- Ubiquity availability: The libxml2 library is pervasive and thus well tested.
- this lib is still under active development and a community participation
- the Downsides of working with this libxml2-approach includes: Compliance to the spec. ( https://www.w3.org/TR/xpath/all/ ) It's strict. Things like default namespace handling are easier in other libraries. We are able to use of native code. I want to apply this on a little project. This is not very Pythonic. i am trying to achive a simple path selection - and therefore i try to stick with ElementTree ( which is included in Python cf http://effbot.org/zone/element-xpath.htm ). If we need full spec compliance or raw speed and can cope with the distribution of native code, we can go
with libxml2 which is the XML C parser and toolkit developed for the Gnome project (but usable outside of the Gnome platform),
That said - here i will try to accomplish a sample of libxml2 XPath Use
import libxml2 doc = libxml2.parseFile("tst.xml") ctxt = doc.xpathNewContext() res = ctxt.xpathEval("//*") if len(res) != 2: print "xpath query: wrong node set size" sys.exit(1) if res[0].name != "doc" or res[1].name != "foo": print "xpath query: wrong node set value" sys.exit(1) doc.freeDoc() ctxt.xpathFreeContext() Sample of ElementTree XPath Use from elementtree.ElementTree import ElementTree mydoc = ElementTree(file='tst.xml') for e in mydoc.findall('/html/body/div[7]/div/section/div/section/div/div[1]/h3'): print e.get('title').textbtw:
import requests from bs4 import BeautifulSoup url = 'https://europa.eu/youth/volunteering/organisations_en#open' response = requests.get(url) soup = BeautifulSoup(response.content, 'lxml') print(soup.find('title').text) block = soup.find('div', class_="eyp-card block-is-flex")again: the idea behind: i have some intersting ideas - the results - (note: more than 6000 pages or records) could be fetched each by each.
https://europa.eu/youth/volunteering/organisation/50160 https://europa.eu/youth/volunteering/organisation/50162 https://europa.eu/youth/volunteering/organisation/50163...and so forth and so forth ....
again - [note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting] therefore we can count the pages each by each - and count incremtal n+1
see examples
https://europa.eu/youth/volunteering/organisation/50120
Kulturlabor Stromboli Krippgasse 11, 6060, Hall i Tirol, Austria www.stromboli.at - +43522345111DESCRIPTION OF ORGANISATION
(usw. usf. )
https://europa.eu/youth/volunteering/organisation/50160
Norwegian Judo Federation Ullevaal Stadion, Sognsveien 75 K, 5th floor, N-0855, Oslow, Norway www.judo.no - +47 21 02 98 20ths can be operated with the xpaths
full xath:
/html/body/div[7]/div/section/div/section/div/div[1]/h3
<h3 class= eyp-project-heading underline of organisation </h3> <p>conclusio - i could run the incremental parser (n , n+1, n+2, n+3) and so forth
after fetching the pages i can parse and store the records in
a- csv Formate or in a sqlite db...
how do you think about the idea - love to hear from you -as i am not so familiar with the bs4 - and just starting with python
it would be great if you can give me a helping hand.. I appreciate any and ahll help - to get me a little step into a good direction..
i look forward to hear from you

regards
Apollo

Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python