Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
why I can't scrape a website?
#1
I'm trying to get all the names of restaurants with 'food' keyword from the following website. However, I need to learn more things in order to just pick out the names. So I just want to try out to scrape all the page first but having trouble.

My codes:

import requests
from bs4 import BeautifulSoup

url = 'https://map.naver.com'
req = requests.get(url)
bs = BeautifulSoup(req.text, 'html.parser')
print(bs)

The result:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br/>
</p>
</body></html>
Reply
#2
try adding in a bogus header
import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}

url = 'https://map.naver.com'
req = requests.get(url, headers=headers)
bs = BeautifulSoup(req.text, 'html.parser')
print(bs)
However based on the results being javascript, you may need selenium instead of requests to obtain the correct HTML.
Recommended Tutorials:
Reply
#3
Thanks for the response, Metulburr. I really appreciate it. I could scrape the page.
I will take a look into Selenium and tutorials you have mentioned to get better.

I have one more question.

I typed in 'vr' in a search box from https://map.naver.com/ but the address doesn't get changed. I was expecting longer address instead of just https://map.naver.com/

So I clicked onto one of the names of the result and ctrl+shift+i > network tab > searched for a request including arcade information which is added to my code below.

My question is how I can scrape this page? I used the header but I've got an unexpected result.

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
        

url = 'https://map.naver.com/search2/local.nhn?sm=hty&searchCoord=127.1302174%3B37.4119829&isFirstSearch=true&query=vr&menu=location&mpx=02135620%3A37.4119829%2C127.1302174%3AZ11%3A0.0195464%2C0.0085920'
req = requests.get(url, headers=headers)
bs = BeautifulSoup(req.text, 'html.parser')
print(bs)
Error:
{ "error": { "code": "HMAC_AUTH_FAILED", "msg": "Hmac Authentication has failed.", "extraInfo": null } }
Reply
#4
the entire website is ran off of javascript. If you turn off javascript in your browser the entire website does nothing and the search bar does nothing. You will need selenium to do anything on that site.
Recommended Tutorials:
Reply
#5
The first ting you should look at with sites like this if there is a API.
NAVER Open API sample code Python on GitHub.
Only if can not find what you look for in API should try to scrape,but this can be difficult/not possible even with Seleium on sites like this.
Reply
#6
Hi,

First off, thank you for all of your support, metulburr and snippsat.

I solved how to scrape the page by using their APIs. I learned I should send a certain type of header in order to make use of their APIs such as an approved client id and key.

My next mission will be how to collect names and print in excel or notepad. I will ask you more questions!

I'm sharing a sample code for using their APIs:

import os
import sys
import urllib.request
client_id = "YOUR_CLIENT_ID"
client_secret = "YOUR_CLIENT_SECRET"
encText = urllib.parse.quote("검색할 단어")
url = "https://openapi.naver.com/v1/search/blog?query=" + encText # json 결과
# url = "https://openapi.naver.com/v1/search/blog.xml?query=" + encText # xml 결과
request = urllib.request.Request(url)
request.add_header("X-Naver-Client-Id",client_id)
request.add_header("X-Naver-Client-Secret",client_secret)
response = urllib.request.urlopen(request)
rescode = response.getcode()
if(rescode==200):
    response_body = response.read()
    print(response_body.decode('utf-8'))
else:
    print("Error Code:" + rescode)
P.S. Could you please recommend a well-received Python editor? I'm using Geany and having trouble when I copy and paste the result of my code because it is shown on Windows Command. So I have to type out all the characters.
Reply
#7
(Sep-25-2019, 03:49 AM)kmkim319 Wrote: result of my code because it is shown on Windows Command
You can edit the windows command console to copy text from. It is off by default.
Recommended Tutorials:
Reply
#8
Hi, metulburr.

Now I learned how to write my scraped-result in Excel, but still can't figure out how to collect only desired information of vr arcades under 'vr' keyword.

I'd like to collect 'title', 'category', 'telephone', 'address' for each item in list or dictionary and print out each set in excel. Please give me any advice.

import os
import sys
import urllib.request
    
client_id = "confidential"
client_secret = "confidential"
url = "https://openapi.naver.com/v1/search/local.json?query=vr&display=3&start=1"
request = urllib.request.Request(url)
request.add_header("X-Naver-Client-Id",client_id)
request.add_header("X-Naver-Client-Secret",client_secret)
response = urllib.request.urlopen(request)
rescode = response.getcode()
if(rescode==200):
    responseBody = response.read().decode('utf-8')
    print(responseBody)
    
    file = open("./vr_arcade.csv", "w+")
    file.write(responseBody)
    
else:
    print("Error Code:" + rescode)
Output:
{ lastBuildDate: "Fri total: 874 start: 1 display: 3 items: [ { title: "arcade1" link: "" category: "category1" description: "" telephone: "telephone" address: "address" roadAddress: "roadAddress" mapx: "310767" mapy: "550356" } { title: "arcade2" link: "https://link2" category: "category2" description: "" telephone: "telephone" address: "address" roadAddress: "roadAddress" mapx: "312508" mapy: "552087" } { title: "arcade3" link: "arcade3.com" category: "arcade" description: "" telephone: "telephone" address: "address" roadAddress: "roadAddress" mapx: "x" mapy: "y" } ] }
Thanks
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Unable to Scrape Website muhamdasim 2 2,564 Dec-27-2021, 07:49 PM
Last Post: JohnRaven
  how to scrape a website from a keyword list greenpine 2 2,328 Dec-04-2020, 03:50 PM
Last Post: greenpine
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 5,087 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  Read url from CSV and Scrape website Prince_Bhatia 3 10,190 Jan-08-2020, 09:08 AM
Last Post: binaryanimal
  How do i scrape website whose page changes using javsacript _dopostback function and Prince_Bhatia 1 7,159 Aug-06-2018, 09:45 AM
Last Post: wavic
  Scrape A tags from a website Prince_Bhatia 1 4,177 Oct-15-2017, 12:56 AM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020