Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Article Extraction - Wordpress
#1
Hi everyone!
Please be warned, I am a doctoral candidate who realized that there is no way around learning how to use Python, but I have come accross numerous roadblocks where I hope you may be able to help?

I have managed to scrape/crawl Twitter feeds of selected users, but now I am looking to extract all articles from wordpress pages (excl. images), including primarily the following:
- Title
- Article Link
- Time & Date
- Text
Optimally as an output within CSV / Excel.

I have come accross the following website:
https://indianpythonista.wordpress.com/2...iful-soup/
https://www.digitalocean.com/community/t...d-python-3
https://zach-adams.com/2015/04/python-sc...wordpress/

But truly am struggling to get any of these codes, in all of its variants to work. (Scrapy wont install on my PyCharm, so I resorted to BeautifulSoup.)

A sample of websites I want to scrape (particularly subsections may include infite scrolling):
1) https://electrek.co/guides/tesla/
2) https://www.teslarati.com/tag/tesla/

Is there one of you out there who would be able to give a hand to amend on of the beaoutiful-soup scripts to the above 2 sample pages? I would take it from there and use it on any other wordpress blogs, but I guess I need a starting hand!

Appreciate your time! Have a great weekend and stay safe.
Reply
#2
Take a look at part-1 and part-2.
svzekio Wrote:A sample of websites I want to scrape (particularly subsections may include infite scrolling):
Could use Selenium for the infinity scrolling part.
Here is a start without using Selenium,you would probably have struggle to figure this way of doing it Wink
So the infinity scrolling part return whole page in json,the can take data from dictionary and parse if return HTML with BeautifulSoup.
import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'electrek.co',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'accept': '*/*',
    'x-requested-with': 'XMLHttpRequest',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'origin': 'https://electrek.co',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://electrek.co/guides/tesla/',
    'accept-language': 'nb-NO,nb;q=0.9,no;q=0.8,nn;q=0.7,en-US;q=0.6,en;q=0.5',
}

params = (
    ('infinity', 'scrolling'),
)

# Try change eg change just page,may not need to change date
page = 3
date = '25.05.20'
data = {
  'action': 'infinite_scroll',
  'page': page,
  'currentday': date,
  'order': 'DESC',
  'query_args[ninetofive_guides]': 'tesla',
  'query_args[error]': '',
  'query_args[m]': '',
  'query_args[p]': '0',
  'query_args[post_parent]': '',
  'query_args[subpost]': '',
  'query_args[subpost_id]': '',
  'query_args[attachment]': '',
  'query_args[attachment_id]': '0',
  'query_args[name]': '',
  'query_args[pagename]': '',
  'query_args[page_id]': '0',
  'query_args[second]': '',
  'query_args[minute]': '',
  'query_args[hour]': '',
  'query_args[day]': '0',
  'query_args[monthnum]': '0',
  'query_args[year]': '0',
  'query_args[w]': '0',
  'query_args[category_name]': '',
  'query_args[tag]': '',
  'query_args[cat]': '',
  'query_args[tag_id]': '',
  'query_args[author]': '',
  'query_args[author_name]': '',
  'query_args[feed]': '',
  'query_args[tb]': '',
  'query_args[paged]': '0',
  'query_args[meta_key]': '',
  'query_args[meta_value]': '',
  'query_args[preview]': '',
  'query_args[s]': '',
  'query_args[sentence]': '',
  'query_args[title]': '',
  'query_args[fields]': '',
  'query_args[menu_order]': '',
  'query_args[embed]': '',
  'query_args[update_post_meta_cache]': 'false',
  'query_args[update_post_term_cache]': 'false',
  'query_args[ignore_sticky_posts]': 'false',
  'query_args[suppress_filters]': 'false',
  'query_args[cache_results]': 'false',
  'query_args[lazy_load_term_meta]': 'false',
  'query_args[post_type]': '',
  'query_args[posts_per_page]': '10',
  'query_args[nopaging]': 'false',
  'query_args[comments_per_page]': '50',
  'query_args[no_found_rows]': 'false',
  'query_args[taxonomy]': 'ninetofive_guides',
  'query_args[term]': 'tesla',
  'query_args[order]': 'DESC',
  'last_post_date': '2020-06-02 16:10:35'
}

# Getting json back
response = requests.post('https://electrek.co/', headers=headers, params=params, data=data)
json_data = response.json()
post = json_data['data']['posts'][0]['post_content']
print(post)

# Using BS
print('-' * 25)
soup = BeautifulSoup(post, 'lxml')
p_tag = soup.find('p')
print(p_tag.text)
Output:
<p>Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. <a href="https://electrek.co/2020/05/24/tesla-roadster-spacex-package-elon-musk-james-bond/#more-134925" class="more-link"><span class="moretext" data-layer-pagetype="post" data-layer-postcategory="tesla">expand full story</span> <span class="x-animate"></span></a></p> ------------------------- Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. expand full story
Reply
#3
Thank you soooo much for your suggestion/code here! I will give it a try and come back should I hit a few more walls a long the way.
Really appreciate it!

If any other suggestions come into mind, please do not hesitate to fire them across.

Wish everyone a fantastic start to the week!
Regards

(Jun-07-2020, 03:02 PM)snippsat Wrote: Take a look at part-1 and part-2.
svzekio Wrote:A sample of websites I want to scrape (particularly subsections may include infite scrolling):
Could use Selenium for the infinity scrolling part.
Here is a start without using Selenium,you would probably have struggle to figure this way of doing it Wink
So the infinity scrolling part return whole page in json,the can take data from dictionary and parse if return HTML with BeautifulSoup.
import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'electrek.co',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'accept': '*/*',
    'x-requested-with': 'XMLHttpRequest',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'origin': 'https://electrek.co',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://electrek.co/guides/tesla/',
    'accept-language': 'nb-NO,nb;q=0.9,no;q=0.8,nn;q=0.7,en-US;q=0.6,en;q=0.5',
}

params = (
    ('infinity', 'scrolling'),
)

# Try change eg change just page,may not need to change date
page = 3
date = '25.05.20'
data = {
  'action': 'infinite_scroll',
  'page': page,
  'currentday': date,
  'order': 'DESC',
  'query_args[ninetofive_guides]': 'tesla',
  'query_args[error]': '',
  'query_args[m]': '',
  'query_args[p]': '0',
  'query_args[post_parent]': '',
  'query_args[subpost]': '',
  'query_args[subpost_id]': '',
  'query_args[attachment]': '',
  'query_args[attachment_id]': '0',
  'query_args[name]': '',
  'query_args[pagename]': '',
  'query_args[page_id]': '0',
  'query_args[second]': '',
  'query_args[minute]': '',
  'query_args[hour]': '',
  'query_args[day]': '0',
  'query_args[monthnum]': '0',
  'query_args[year]': '0',
  'query_args[w]': '0',
  'query_args[category_name]': '',
  'query_args[tag]': '',
  'query_args[cat]': '',
  'query_args[tag_id]': '',
  'query_args[author]': '',
  'query_args[author_name]': '',
  'query_args[feed]': '',
  'query_args[tb]': '',
  'query_args[paged]': '0',
  'query_args[meta_key]': '',
  'query_args[meta_value]': '',
  'query_args[preview]': '',
  'query_args[s]': '',
  'query_args[sentence]': '',
  'query_args[title]': '',
  'query_args[fields]': '',
  'query_args[menu_order]': '',
  'query_args[embed]': '',
  'query_args[update_post_meta_cache]': 'false',
  'query_args[update_post_term_cache]': 'false',
  'query_args[ignore_sticky_posts]': 'false',
  'query_args[suppress_filters]': 'false',
  'query_args[cache_results]': 'false',
  'query_args[lazy_load_term_meta]': 'false',
  'query_args[post_type]': '',
  'query_args[posts_per_page]': '10',
  'query_args[nopaging]': 'false',
  'query_args[comments_per_page]': '50',
  'query_args[no_found_rows]': 'false',
  'query_args[taxonomy]': 'ninetofive_guides',
  'query_args[term]': 'tesla',
  'query_args[order]': 'DESC',
  'last_post_date': '2020-06-02 16:10:35'
}

# Getting json back
response = requests.post('https://electrek.co/', headers=headers, params=params, data=data)
json_data = response.json()
post = json_data['data']['posts'][0]['post_content']
print(post)

# Using BS
print('-' * 25)
soup = BeautifulSoup(post, 'lxml')
p_tag = soup.find('p')
print(p_tag.text)
Output:
<p>Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. <a href="https://electrek.co/2020/05/24/tesla-roadster-spacex-package-elon-musk-james-bond/#more-134925" class="more-link"><span class="moretext" data-layer-pagetype="post" data-layer-postcategory="tesla">expand full story</span> <span class="x-animate"></span></a></p> ------------------------- Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. expand full story
Reply
#4
I just tested it using the code provided, but it produced no results oddly enough.

I ensured to have installed all the packages. Am I missing something? :(
Reply
#5
(Jun-08-2020, 11:21 AM)svzekio Wrote: I just tested it using the code provided, but it produced no results oddly enough.

I ensured to have installed all the packages. Am I missing something? :(
Can of course not say anything when you don't give any information about you are doing.
Simple troubleshooting is to print json_data.
So on line 89 add:
print(json_data)
Now only use Requests that give back the json data.
Test now again and it works for me.
Here in a other environment not mine Notebook,as you see it work there to.
Reply
#6
Feel free to try my Wordpress blog scraper code that I used for my own blog.
I wrote this in 2018 so I don't remember much about how it worked:
https://stevepython.wordpress.com/2018/1...og-scraper
Reply
#7
(Jul-10-2020, 10:26 AM)steve_shambles Wrote: Feel free to try my Wordpress blog scraper code that I used for my own blog.
I wrote this in 2018 so I don't remember much about how it worked:
A okay project Steve,but it will not work for this task at all Wink
There is infinity scrolling part with Ajax and return of Json,that i have done work on to make this work.
Reply
#8
OK sorry it was no help.

(Jul-10-2020, 12:49 PM)snippsat Wrote:
(Jul-10-2020, 10:26 AM)steve_shambles Wrote: Feel free to try my Wordpress blog scraper code that I used for my own blog.
I wrote this in 2018 so I don't remember much about how it worked:
A okay project Steve,but it will not work for this task at all Wink
There is infinity scrolling part with Ajax and return of Json,that i have done work on to make this work.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Python, Salesforce and WordPress arthurk88 1 666 Nov-21-2023, 10:13 AM
Last Post: Larz60+
  Python API for Wordpress Simlock 4 3,681 May-23-2022, 06:47 PM
Last Post: LaverneDejardin
Question Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone? BrandonKastning 4 1,960 Jan-27-2022, 04:36 AM
Last Post: Larz60+
  how to run a python script in the background on my wordpress website rockie12us 3 2,656 Aug-13-2021, 05:39 PM
Last Post: ndc85430
  Python Scrapy Date Extraction Issue tr8585 1 3,236 Aug-05-2020, 04:32 AM
Last Post: tr8585
  If I use a php script, like WordPress and Elgg, can I program an plugin by Python? Abdulaziz 0 1,577 Jun-23-2020, 06:54 PM
Last Post: Abdulaziz
  Follow Up: Web Calendar based Extraction AgileAVS 0 1,469 Feb-23-2020, 05:39 AM
Last Post: AgileAVS
  Post comments to Wordpress Blog SergeyLV 1 2,426 Aug-01-2019, 01:38 AM
Last Post: Larz60+
  Download article without photo caption Helene_python 2 2,403 Feb-14-2019, 01:13 PM
Last Post: snippsat
  fb data extraction error periraviteja 1 2,136 Jan-05-2019, 01:07 AM
Last Post: stullis

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020