Posts: 3
Threads: 1
Joined: Jun 2020
Hi everyone!
Please be warned, I am a doctoral candidate who realized that there is no way around learning how to use Python, but I have come accross numerous roadblocks where I hope you may be able to help?
I have managed to scrape/crawl Twitter feeds of selected users, but now I am looking to extract all articles from wordpress pages (excl. images), including primarily the following:
- Title
- Article Link
- Time & Date
- Text
Optimally as an output within CSV / Excel.
I have come accross the following website:
https://indianpythonista.wordpress.com/2...iful-soup/
https://www.digitalocean.com/community/t...d-python-3
https://zach-adams.com/2015/04/python-sc...wordpress/
But truly am struggling to get any of these codes, in all of its variants to work. (Scrapy wont install on my PyCharm, so I resorted to BeautifulSoup.)
A sample of websites I want to scrape (particularly subsections may include infite scrolling):
1) https://electrek.co/guides/tesla/
2) https://www.teslarati.com/tag/tesla/
Is there one of you out there who would be able to give a hand to amend on of the beaoutiful-soup scripts to the above 2 sample pages? I would take it from there and use it on any other wordpress blogs, but I guess I need a starting hand!
Appreciate your time! Have a great weekend and stay safe.
Posts: 7,312
Threads: 123
Joined: Sep 2016
Jun-07-2020, 03:02 PM
(This post was last modified: Jun-07-2020, 03:03 PM by snippsat.)
Take a look at part-1 and part-2.
svzekio Wrote:A sample of websites I want to scrape (particularly subsections may include infite scrolling): Could use Selenium for the infinity scrolling part.
Here is a start without using Selenium,you would probably have struggle to figure this way of doing it
So the infinity scrolling part return whole page in json,the can take data from dictionary and parse if return HTML with BeautifulSoup.
import requests
from bs4 import BeautifulSoup
headers = {
'authority': 'electrek.co',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'accept': '*/*',
'x-requested-with': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'origin': 'https://electrek.co',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://electrek.co/guides/tesla/',
'accept-language': 'nb-NO,nb;q=0.9,no;q=0.8,nn;q=0.7,en-US;q=0.6,en;q=0.5',
}
params = (
('infinity', 'scrolling'),
)
# Try change eg change just page,may not need to change date
page = 3
date = '25.05.20'
data = {
'action': 'infinite_scroll',
'page': page,
'currentday': date,
'order': 'DESC',
'query_args[ninetofive_guides]': 'tesla',
'query_args[error]': '',
'query_args[m]': '',
'query_args[p]': '0',
'query_args[post_parent]': '',
'query_args[subpost]': '',
'query_args[subpost_id]': '',
'query_args[attachment]': '',
'query_args[attachment_id]': '0',
'query_args[name]': '',
'query_args[pagename]': '',
'query_args[page_id]': '0',
'query_args[second]': '',
'query_args[minute]': '',
'query_args[hour]': '',
'query_args[day]': '0',
'query_args[monthnum]': '0',
'query_args[year]': '0',
'query_args[w]': '0',
'query_args[category_name]': '',
'query_args[tag]': '',
'query_args[cat]': '',
'query_args[tag_id]': '',
'query_args[author]': '',
'query_args[author_name]': '',
'query_args[feed]': '',
'query_args[tb]': '',
'query_args[paged]': '0',
'query_args[meta_key]': '',
'query_args[meta_value]': '',
'query_args[preview]': '',
'query_args[s]': '',
'query_args[sentence]': '',
'query_args[title]': '',
'query_args[fields]': '',
'query_args[menu_order]': '',
'query_args[embed]': '',
'query_args[update_post_meta_cache]': 'false',
'query_args[update_post_term_cache]': 'false',
'query_args[ignore_sticky_posts]': 'false',
'query_args[suppress_filters]': 'false',
'query_args[cache_results]': 'false',
'query_args[lazy_load_term_meta]': 'false',
'query_args[post_type]': '',
'query_args[posts_per_page]': '10',
'query_args[nopaging]': 'false',
'query_args[comments_per_page]': '50',
'query_args[no_found_rows]': 'false',
'query_args[taxonomy]': 'ninetofive_guides',
'query_args[term]': 'tesla',
'query_args[order]': 'DESC',
'last_post_date': '2020-06-02 16:10:35'
}
# Getting json back
response = requests.post('https://electrek.co/', headers=headers, params=params, data=data)
json_data = response.json()
post = json_data['data']['posts'][0]['post_content']
print(post)
# Using BS
print('-' * 25)
soup = BeautifulSoup(post, 'lxml')
p_tag = soup.find('p')
print(p_tag.text) Output: <p>Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. <a href="https://electrek.co/2020/05/24/tesla-roadster-spacex-package-elon-musk-james-bond/#more-134925" class="more-link"><span class="moretext" data-layer-pagetype="post" data-layer-postcategory="tesla">expand full story</span> <span class="x-animate"></span></a></p>
-------------------------
Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. expand full story
Posts: 3
Threads: 1
Joined: Jun 2020
Thank you soooo much for your suggestion/code here! I will give it a try and come back should I hit a few more walls a long the way.
Really appreciate it!
If any other suggestions come into mind, please do not hesitate to fire them across.
Wish everyone a fantastic start to the week!
Regards
(Jun-07-2020, 03:02 PM)snippsat Wrote: Take a look at part-1 and part-2.
svzekio Wrote:A sample of websites I want to scrape (particularly subsections may include infite scrolling): Could use Selenium for the infinity scrolling part.
Here is a start without using Selenium,you would probably have struggle to figure this way of doing it
So the infinity scrolling part return whole page in json,the can take data from dictionary and parse if return HTML with BeautifulSoup.
import requests
from bs4 import BeautifulSoup
headers = {
'authority': 'electrek.co',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'accept': '*/*',
'x-requested-with': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'origin': 'https://electrek.co',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://electrek.co/guides/tesla/',
'accept-language': 'nb-NO,nb;q=0.9,no;q=0.8,nn;q=0.7,en-US;q=0.6,en;q=0.5',
}
params = (
('infinity', 'scrolling'),
)
# Try change eg change just page,may not need to change date
page = 3
date = '25.05.20'
data = {
'action': 'infinite_scroll',
'page': page,
'currentday': date,
'order': 'DESC',
'query_args[ninetofive_guides]': 'tesla',
'query_args[error]': '',
'query_args[m]': '',
'query_args[p]': '0',
'query_args[post_parent]': '',
'query_args[subpost]': '',
'query_args[subpost_id]': '',
'query_args[attachment]': '',
'query_args[attachment_id]': '0',
'query_args[name]': '',
'query_args[pagename]': '',
'query_args[page_id]': '0',
'query_args[second]': '',
'query_args[minute]': '',
'query_args[hour]': '',
'query_args[day]': '0',
'query_args[monthnum]': '0',
'query_args[year]': '0',
'query_args[w]': '0',
'query_args[category_name]': '',
'query_args[tag]': '',
'query_args[cat]': '',
'query_args[tag_id]': '',
'query_args[author]': '',
'query_args[author_name]': '',
'query_args[feed]': '',
'query_args[tb]': '',
'query_args[paged]': '0',
'query_args[meta_key]': '',
'query_args[meta_value]': '',
'query_args[preview]': '',
'query_args[s]': '',
'query_args[sentence]': '',
'query_args[title]': '',
'query_args[fields]': '',
'query_args[menu_order]': '',
'query_args[embed]': '',
'query_args[update_post_meta_cache]': 'false',
'query_args[update_post_term_cache]': 'false',
'query_args[ignore_sticky_posts]': 'false',
'query_args[suppress_filters]': 'false',
'query_args[cache_results]': 'false',
'query_args[lazy_load_term_meta]': 'false',
'query_args[post_type]': '',
'query_args[posts_per_page]': '10',
'query_args[nopaging]': 'false',
'query_args[comments_per_page]': '50',
'query_args[no_found_rows]': 'false',
'query_args[taxonomy]': 'ninetofive_guides',
'query_args[term]': 'tesla',
'query_args[order]': 'DESC',
'last_post_date': '2020-06-02 16:10:35'
}
# Getting json back
response = requests.post('https://electrek.co/', headers=headers, params=params, data=data)
json_data = response.json()
post = json_data['data']['posts'][0]['post_content']
print(post)
# Using BS
print('-' * 25)
soup = BeautifulSoup(post, 'lxml')
p_tag = soup.find('p')
print(p_tag.text) Output: <p>Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. <a href="https://electrek.co/2020/05/24/tesla-roadster-spacex-package-elon-musk-james-bond/#more-134925" class="more-link"><span class="moretext" data-layer-pagetype="post" data-layer-postcategory="tesla">expand full story</span> <span class="x-animate"></span></a></p>
-------------------------
Elon Musk made new comments about the new Tesla Roadster being equipped with a SpaceX package consisting of cold air thrusters. expand full story
Posts: 3
Threads: 1
Joined: Jun 2020
I just tested it using the code provided, but it produced no results oddly enough.
I ensured to have installed all the packages. Am I missing something? :(
Posts: 7,312
Threads: 123
Joined: Sep 2016
Jun-08-2020, 01:19 PM
(This post was last modified: Jun-08-2020, 01:20 PM by snippsat.)
(Jun-08-2020, 11:21 AM)svzekio Wrote: I just tested it using the code provided, but it produced no results oddly enough.
I ensured to have installed all the packages. Am I missing something? :( Can of course not say anything when you don't give any information about you are doing.
Simple troubleshooting is to print json_data .
So on line 89 add:
print(json_data) Now only use Requests that give back the json data.
Test now again and it works for me.
Here in a other environment not mine Notebook,as you see it work there to.
Posts: 76
Threads: 14
Joined: Jan 2019
Feel free to try my Wordpress blog scraper code that I used for my own blog.
I wrote this in 2018 so I don't remember much about how it worked:
https://stevepython.wordpress.com/2018/1...og-scraper
Posts: 7,312
Threads: 123
Joined: Sep 2016
(Jul-10-2020, 10:26 AM)steve_shambles Wrote: Feel free to try my Wordpress blog scraper code that I used for my own blog.
I wrote this in 2018 so I don't remember much about how it worked: A okay project Steve,but it will not work for this task at all
There is infinity scrolling part with Ajax and return of Json,that i have done work on to make this work.
Posts: 76
Threads: 14
Joined: Jan 2019
OK sorry it was no help.
(Jul-10-2020, 12:49 PM)snippsat Wrote: (Jul-10-2020, 10:26 AM)steve_shambles Wrote: Feel free to try my Wordpress blog scraper code that I used for my own blog.
I wrote this in 2018 so I don't remember much about how it worked: A okay project Steve,but it will not work for this task at all
There is infinity scrolling part with Ajax and return of Json,that i have done work on to make this work.
|