Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to web scrape this?
#1
This is not really a Python question, sorry, but I don't know where to ask.

I was interested in recent posts here about web-scraping.

I often look up examples of Python on geeksforgeeks.org, they have simple, clear examples.

I am reading about the difference between propositional logic and predicate logic and geeksforgeeks.org has a short webpage about the difference between these two.

So I thought, "I'll webscrape it and save the text!", just as practice.

But, there is no .html or .php just:

Quote:https://www.geeksforgeeks.org/difference...%20rows%20

Can this be webscraped? What language is this? What kind of webpage is this?
Reply
#2
If you just want to save the page (from Firefox, assume something similar in your browser):
  • Navigate to Your URL
  • From File menu (Firefox) select: Save page as
  • Give a directory location where you'd like to save.
This will create a clone of the web page including all images, etc.
Reply
#3
Thanks, but what I'm really wondering is:

what is this webpage with no document in the form of a_webpage.html or a_webpage.php


What is this in place of an html document??

#:~:text=Difference%20between%20Propositional%20Logic%20and%20Predicate%20Logic:%20,scope%20%20...%20%203%20more%20rows%20
Reply
#4
(May-27-2021, 10:24 PM)Pedroski55 Wrote: So I thought, "I'll webscrape it and save the text!", just as practice.

But, there is no .html or .php just:
Do you see .html or .php often as it's not common to have in a url address.
So on the web dos not filename extensions matter,
as web-server call .html files and map it to a serve name and browser also communicated with a name server(DNS) to translate the server name.
Read more about this.

So scraping it's the same way as it's just normal url address.
import requests
from bs4 import BeautifulSoup

url = 'https://www.geeksforgeeks.org/difference-between-propositional-logic-and-predicate-logic/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.select_one('div.title').text)
print(soup.select_one('#post-564612 > div.text > ol:nth-child(4) > li:nth-child(1)').text)
Output:
Difference between Propositional Logic and Predicate Logic If x is real, then x2 > 0
Pedroski55 likes this post
Reply
#5
Thanks I tried it, worked great. (Don't understand the:

Quote:print(soup.select_one('#post-564612 > div.text > ol:nth-child(4) > li:nth-child(1)').text)

part, but I will look it up!)

I thought all web documents had .html or .php as a basis. My mistake!

This thread has

https://python-forum.io/thread-33795.html as its basis.
Reply
#6
Snippsat (one who answered your question) has two simple tutorials that will answer your questions.
see:
web scraping part 1
web scraping part 2
Pedroski55 likes this post
Reply
#7
(May-28-2021, 07:07 AM)Pedroski55 Wrote: Thanks I tried it, worked great. (Don't understand the:)
print(soup.select_one('#post-564612 > div.text > ol:nth-child(4) > li:nth-child(1)').text)
It CSS selector can copy it from browser when in dev-tool(F12),
right click over tag wanted then Copy ➡ Copy selector,in BS two ways to call the selector .select() or .select_one().
Reply
#8
(May-28-2021, 07:07 AM)Pedroski55 Wrote: I thought all web documents had .html or .php as a basis.

File extensions on the web are mostly all imaginary. Web servers send Content-Type headers along with whatever the document's contents are, so browsers know what to do with it (parse it for html, or display it for images, or save it for pdfs, etc).
Pedroski55 likes this post
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020