Python Forum
How to web scrape this? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How to web scrape this? (/thread-33795.html)



How to web scrape this? - Pedroski55 - May-27-2021

This is not really a Python question, sorry, but I don't know where to ask.

I was interested in recent posts here about web-scraping.

I often look up examples of Python on geeksforgeeks.org, they have simple, clear examples.

I am reading about the difference between propositional logic and predicate logic and geeksforgeeks.org has a short webpage about the difference between these two.

So I thought, "I'll webscrape it and save the text!", just as practice.

But, there is no .html or .php just:

Quote:https://www.geeksforgeeks.org/difference-between-propositional-logic-and-predicate-logic/#:~:text=Difference%20between%20Propositional%20Logic%20and%20Predicate%20Logic:%20,scope%20%20...%20%203%20more%20rows%20

Can this be webscraped? What language is this? What kind of webpage is this?


RE: How to web scrape this? - Larz60+ - May-27-2021

If you just want to save the page (from Firefox, assume something similar in your browser):
  • Navigate to Your URL
  • From File menu (Firefox) select: Save page as
  • Give a directory location where you'd like to save.
This will create a clone of the web page including all images, etc.


RE: How to web scrape this? - Pedroski55 - May-28-2021

Thanks, but what I'm really wondering is:

what is this webpage with no document in the form of a_webpage.html or a_webpage.php


What is this in place of an html document??

#:~:text=Difference%20between%20Propositional%20Logic%20and%20Predicate%20Logic:%20,scope%20%20...%20%203%20more%20rows%20


RE: How to web scrape this? - snippsat - May-28-2021

(May-27-2021, 10:24 PM)Pedroski55 Wrote: So I thought, "I'll webscrape it and save the text!", just as practice.

But, there is no .html or .php just:
Do you see .html or .php often as it's not common to have in a url address.
So on the web dos not filename extensions matter,
as web-server call .html files and map it to a serve name and browser also communicated with a name server(DNS) to translate the server name.
Read more about this.

So scraping it's the same way as it's just normal url address.
import requests
from bs4 import BeautifulSoup

url = 'https://www.geeksforgeeks.org/difference-between-propositional-logic-and-predicate-logic/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.select_one('div.title').text)
print(soup.select_one('#post-564612 > div.text > ol:nth-child(4) > li:nth-child(1)').text)
Output:
Difference between Propositional Logic and Predicate Logic If x is real, then x2 > 0



RE: How to web scrape this? - Pedroski55 - May-28-2021

Thanks I tried it, worked great. (Don't understand the:

Quote:print(soup.select_one('#post-564612 > div.text > ol:nth-child(4) > li:nth-child(1)').text)

part, but I will look it up!)

I thought all web documents had .html or .php as a basis. My mistake!

This thread has

https://python-forum.io/thread-33795.html as its basis.


RE: How to web scrape this? - Larz60+ - May-28-2021

Snippsat (one who answered your question) has two simple tutorials that will answer your questions.
see:
web scraping part 1
web scraping part 2


RE: How to web scrape this? - snippsat - May-28-2021

(May-28-2021, 07:07 AM)Pedroski55 Wrote: Thanks I tried it, worked great. (Don't understand the:)
print(soup.select_one('#post-564612 > div.text > ol:nth-child(4) > li:nth-child(1)').text)
It CSS selector can copy it from browser when in dev-tool(F12),
right click over tag wanted then Copy ➡ Copy selector,in BS two ways to call the selector .select() or .select_one().


RE: How to web scrape this? - nilamo - May-28-2021

(May-28-2021, 07:07 AM)Pedroski55 Wrote: I thought all web documents had .html or .php as a basis.

File extensions on the web are mostly all imaginary. Web servers send Content-Type headers along with whatever the document's contents are, so browsers know what to do with it (parse it for html, or display it for images, or save it for pdfs, etc).