Python Forum

Full Version: Article Extraction - Wordpress
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi everyone!
Please be warned, I am a doctoral candidate who realized that there is no way around learning how to use Python, but I have come accross numerous roadblocks where I hope you may be able to help?

I have managed to scrape/crawl Twitter feeds of selected users, but now I am looking to extract all articles from wordpress pages (excl. images), including primarily the following:
- Title
- Article Link
- Time & Date
- Text
Optimally as an output within CSV / Excel.

I have come accross the following website:
https://indianpythonista.wordpress.com/2...iful-soup/
https://www.digitalocean.com/community/t...d-python-3
https://zach-adams.com/2015/04/python-sc...wordpress/

But truly am struggling to get any of these codes, in all of its variants to work. (Scrapy wont install on my PyCharm, so I resorted to BeautifulSoup.)

A sample of websites I want to scrape (particularly subsections may include infite scrolling):
1) https://electrek.co/guides/tesla/
2) https://www.teslarati.com/tag/tesla/

Is there one of you out there who would be able to give a hand to amend on of the beaoutiful-soup scripts to the above 2 sample pages? I would take it from there and use it on any other wordpress blogs, but I guess I need a starting hand!

Appreciate your time! Have a great weekend and stay safe.
Take a look at part-1 and part-2.
svzekio Wrote:A sample of websites I want to scrape (particularly subsections may include infite scrolling):
Could use Selenium for the infinity scrolling part.
Here is a start without using Selenium,you would probably have struggle to figure this way of doing it Wink
So the infinity scrolling part return whole page in json,the can take data from dictionary and parse if return HTML with BeautifulSoup.
import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'electrek.co',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'accept': '*/*',
    'x-requested-with': 'XMLHttpRequest',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'origin': 'https://electrek.co',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://electrek.co/guides/tesla/',
    'accept-language': 'nb-NO,nb;q=0.9,no;q=0.8,nn;q=0.7,en-US;q=0.6,en;q=0.5',
}

params = (
    ('infinity', 'scrolling'),
)

# Try change eg change just page,may not need to change date
page = 3
date = '25.05.20'
data = {
  'action': 'infinite_scroll',
  'page': page,
  'currentday': date,
  'order': 'DESC',
  'query_args[ninetofive_guides]': 'tesla',
  'query_args[error]': '',
  'query_args[m]': '',
  'query_args[p]': '0',
  'query_args[post_parent]': '',
  'query_args[subpost]': '',
  'query_args[subpost_id]': '',
  'query_args[attachment]': '',
  'query_args[attachment_id]': '0',
  'query_args[name]': '',
  'query_args[pagename]': '',
  'query_args[page_id]': '0',
  'query_args[second]': '',
  'query_args[minute]': '',
  'query_args[hour]': '',
  'query_args[day]': '0',
  'query_args[monthnum]': '0',
  'query_args[year]': '0',
  'query_args[w]': '0',
  'query_args[category_name]': '',
  'query_args[tag]': '',
  'query_args[cat]': '',
  'query_args[tag_id]': '',
  'query_args[author]': '',
  'query_args[author_name]': '',
  'query_args[feed]': '',
  'query_args[tb]': '',
  'query_args[paged]': '0',
  'query_args[meta_key]': '',
  'query_args[meta_value]': '',
  'query_args[preview]': '',
  'query_args[s]': '',
  'query_args[sentence]': '',
  'query_args[title]': '',
  'query_args[fields]': '',
  'query_args[menu_order]': '',
  'query_args[embed]': '',
  'query_args[update_post_meta_cache]': 'false',
  'query_args[update_post_term_cache]': 'false',
  'query_args[ignore_sticky_posts]': 'false',
  'query_args[suppress_filters]': 'false',
  'query_args[cache_results]': 'false',
  'query_args[lazy_load_term_meta]': 'false',
  'query_args[post_type]': '',
  'query_args[posts_per_page]': '10',
  'query_args[nopaging]': 'false',
  'query_args[comments_per_page]': '50',
  'query_args[no_found_rows]': 'false',
  'query_args[taxonomy]': 'ninetofive_guides',
  'query_args[term]': 'tesla',
  'query_args[order]': 'DESC',
  'last_post_date': '2020-06-02 16:10:35'
}

# Getting json back
response = requests.post('https://electrek.co/', headers=headers, params=params, data=data)
json_data = response.json()
post = json_data['data']['posts'][0]['post_content']
print(post)

# Using BS
print('-' * 25)
soup = BeautifulSoup(post, 'lxml')
p_tag = soup.find('p')
print(p_tag.text)
Output:
<p>Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. <a href="https://electrek.co/2020/05/24/tesla-roadster-spacex-package-elon-musk-james-bond/#more-134925" class="more-link"><span class="moretext" data-layer-pagetype="post" data-layer-postcategory="tesla">expand full story</span> <span class="x-animate"></span></a></p> ------------------------- Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. expand full story
Thank you soooo much for your suggestion/code here! I will give it a try and come back should I hit a few more walls a long the way.
Really appreciate it!

If any other suggestions come into mind, please do not hesitate to fire them across.

Wish everyone a fantastic start to the week!
Regards

(Jun-07-2020, 03:02 PM)snippsat Wrote: [ -> ]Take a look at part-1 and part-2.
svzekio Wrote:A sample of websites I want to scrape (particularly subsections may include infite scrolling):
Could use Selenium for the infinity scrolling part.
Here is a start without using Selenium,you would probably have struggle to figure this way of doing it Wink
So the infinity scrolling part return whole page in json,the can take data from dictionary and parse if return HTML with BeautifulSoup.
import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'electrek.co',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'accept': '*/*',
    'x-requested-with': 'XMLHttpRequest',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'origin': 'https://electrek.co',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://electrek.co/guides/tesla/',
    'accept-language': 'nb-NO,nb;q=0.9,no;q=0.8,nn;q=0.7,en-US;q=0.6,en;q=0.5',
}

params = (
    ('infinity', 'scrolling'),
)

# Try change eg change just page,may not need to change date
page = 3
date = '25.05.20'
data = {
  'action': 'infinite_scroll',
  'page': page,
  'currentday': date,
  'order': 'DESC',
  'query_args[ninetofive_guides]': 'tesla',
  'query_args[error]': '',
  'query_args[m]': '',
  'query_args[p]': '0',
  'query_args[post_parent]': '',
  'query_args[subpost]': '',
  'query_args[subpost_id]': '',
  'query_args[attachment]': '',
  'query_args[attachment_id]': '0',
  'query_args[name]': '',
  'query_args[pagename]': '',
  'query_args[page_id]': '0',
  'query_args[second]': '',
  'query_args[minute]': '',
  'query_args[hour]': '',
  'query_args[day]': '0',
  'query_args[monthnum]': '0',
  'query_args[year]': '0',
  'query_args[w]': '0',
  'query_args[category_name]': '',
  'query_args[tag]': '',
  'query_args[cat]': '',
  'query_args[tag_id]': '',
  'query_args[author]': '',
  'query_args[author_name]': '',
  'query_args[feed]': '',
  'query_args[tb]': '',
  'query_args[paged]': '0',
  'query_args[meta_key]': '',
  'query_args[meta_value]': '',
  'query_args[preview]': '',
  'query_args[s]': '',
  'query_args[sentence]': '',
  'query_args[title]': '',
  'query_args[fields]': '',
  'query_args[menu_order]': '',
  'query_args[embed]': '',
  'query_args[update_post_meta_cache]': 'false',
  'query_args[update_post_term_cache]': 'false',
  'query_args[ignore_sticky_posts]': 'false',
  'query_args[suppress_filters]': 'false',
  'query_args[cache_results]': 'false',
  'query_args[lazy_load_term_meta]': 'false',
  'query_args[post_type]': '',
  'query_args[posts_per_page]': '10',
  'query_args[nopaging]': 'false',
  'query_args[comments_per_page]': '50',
  'query_args[no_found_rows]': 'false',
  'query_args[taxonomy]': 'ninetofive_guides',
  'query_args[term]': 'tesla',
  'query_args[order]': 'DESC',
  'last_post_date': '2020-06-02 16:10:35'
}

# Getting json back
response = requests.post('https://electrek.co/', headers=headers, params=params, data=data)
json_data = response.json()
post = json_data['data']['posts'][0]['post_content']
print(post)

# Using BS
print('-' * 25)
soup = BeautifulSoup(post, 'lxml')
p_tag = soup.find('p')
print(p_tag.text)
Output:
<p>Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. <a href="https://electrek.co/2020/05/24/tesla-roadster-spacex-package-elon-musk-james-bond/#more-134925" class="more-link"><span class="moretext" data-layer-pagetype="post" data-layer-postcategory="tesla">expand full story</span> <span class="x-animate"></span></a></p> ------------------------- Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. expand full story
I just tested it using the code provided, but it produced no results oddly enough.

I ensured to have installed all the packages. Am I missing something? :(
(Jun-08-2020, 11:21 AM)svzekio Wrote: [ -> ]I just tested it using the code provided, but it produced no results oddly enough.

I ensured to have installed all the packages. Am I missing something? :(
Can of course not say anything when you don't give any information about you are doing.
Simple troubleshooting is to print json_data.
So on line 89 add:
print(json_data)
Now only use Requests that give back the json data.
Test now again and it works for me.
Here in a other environment not mine Notebook,as you see it work there to.
Feel free to try my Wordpress blog scraper code that I used for my own blog.
I wrote this in 2018 so I don't remember much about how it worked:
https://stevepython.wordpress.com/2018/1...og-scraper
(Jul-10-2020, 10:26 AM)steve_shambles Wrote: [ -> ]Feel free to try my Wordpress blog scraper code that I used for my own blog.
I wrote this in 2018 so I don't remember much about how it worked:
A okay project Steve,but it will not work for this task at all Wink
There is infinity scrolling part with Ajax and return of Json,that i have done work on to make this work.
OK sorry it was no help.

(Jul-10-2020, 12:49 PM)snippsat Wrote: [ -> ]
(Jul-10-2020, 10:26 AM)steve_shambles Wrote: [ -> ]Feel free to try my Wordpress blog scraper code that I used for my own blog.
I wrote this in 2018 so I don't remember much about how it worked:
A okay project Steve,but it will not work for this task at all Wink
There is infinity scrolling part with Ajax and return of Json,that i have done work on to make this work.