why I can't scrape a website?

kmkim319 · Sep-18-2019, 03:46 PM

I'm trying to get all the names of restaurants with 'food' keyword from the following website. However, I need to learn more things in order to just pick out the names. So I just want to try out to scrape all the page first but having trouble.

My codes:

import requests
from bs4 import BeautifulSoup

url = 'https://map.naver.com'
req = requests.get(url)
bs = BeautifulSoup(req.text, 'html.parser')
print(bs)

The result:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br/>
</p>
</body></html>

***metulburr*** · (This post was last modified: Sep-18-2019, 06:03 PM by metulburr.)

try adding in a bogus header

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}

url = 'https://map.naver.com'
req = requests.get(url, headers=headers)
bs = BeautifulSoup(req.text, 'html.parser')
print(bs)

However based on the results being javascript, you may need selenium instead of requests to obtain the correct HTML.

kmkim319 · (This post was last modified: Sep-20-2019, 06:07 PM by kmkim319.)

Thanks for the response, Metulburr. I really appreciate it. I could scrape the page.
I will take a look into Selenium and tutorials you have mentioned to get better.

I have one more question.

I typed in 'vr' in a search box from https://map.naver.com/ but the address doesn't get changed. I was expecting longer address instead of just https://map.naver.com/

So I clicked onto one of the names of the result and ctrl+shift+i > network tab > searched for a request including arcade information which is added to my code below.

My question is how I can scrape this page? I used the header but I've got an unexpected result.

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
        

url = 'https://map.naver.com/search2/local.nhn?sm=hty&searchCoord=127.1302174%3B37.4119829&isFirstSearch=true&query=vr&menu=location&mpx=02135620%3A37.4119829%2C127.1302174%3AZ11%3A0.0195464%2C0.0085920'
req = requests.get(url, headers=headers)
bs = BeautifulSoup(req.text, 'html.parser')
print(bs)

Error:{
"error": {
"code": "HMAC_AUTH_FAILED",
"msg": "Hmac Authentication has failed.",
"extraInfo": null
}
}

***metulburr*** · Sep-20-2019, 06:38 PM

the entire website is ran off of javascript. If you turn off javascript in your browser the entire website does nothing and the search bar does nothing. You will need selenium to do anything on that site.

***snippsat*** · Sep-20-2019, 07:41 PM

The first ting you should look at with sites like this if there is a API.
NAVER Open API sample code Python on GitHub.
Only if can not find what you look for in API should try to scrape,but this can be difficult/not possible even with Seleium on sites like this.

kmkim319 · (This post was last modified: Sep-25-2019, 03:49 AM by kmkim319.)

Hi,

First off, thank you for all of your support, metulburr and snippsat.

I solved how to scrape the page by using their APIs. I learned I should send a certain type of header in order to make use of their APIs such as an approved client id and key.

My next mission will be how to collect names and print in excel or notepad. I will ask you more questions!

I'm sharing a sample code for using their APIs:

import os
import sys
import urllib.request
client_id = "YOUR_CLIENT_ID"
client_secret = "YOUR_CLIENT_SECRET"
encText = urllib.parse.quote("검색할 단어")
url = "https://openapi.naver.com/v1/search/blog?query=" + encText # json 결과
# url = "https://openapi.naver.com/v1/search/blog.xml?query=" + encText # xml 결과
request = urllib.request.Request(url)
request.add_header("X-Naver-Client-Id",client_id)
request.add_header("X-Naver-Client-Secret",client_secret)
response = urllib.request.urlopen(request)
rescode = response.getcode()
if(rescode==200):
    response_body = response.read()
    print(response_body.decode('utf-8'))
else:
    print("Error Code:" + rescode)

P.S. Could you please recommend a well-received Python editor? I'm using Geany and having trouble when I copy and paste the result of my code because it is shown on Windows Command. So I have to type out all the characters.

***metulburr*** · Sep-25-2019, 11:31 AM

(Sep-25-2019, 03:49 AM)kmkim319 Wrote: result of my code because it is shown on Windows Command

You can edit the windows command console to copy text from. It is off by default.

kmkim319 · (This post was last modified: Sep-27-2019, 03:14 PM by kmkim319.)

Hi, metulburr.

Now I learned how to write my scraped-result in Excel, but still can't figure out how to collect only desired information of vr arcades under 'vr' keyword.

I'd like to collect 'title', 'category', 'telephone', 'address' for each item in list or dictionary and print out each set in excel. Please give me any advice.

import os
import sys
import urllib.request
    
client_id = "confidential"
client_secret = "confidential"
url = "https://openapi.naver.com/v1/search/local.json?query=vr&display=3&start=1"
request = urllib.request.Request(url)
request.add_header("X-Naver-Client-Id",client_id)
request.add_header("X-Naver-Client-Secret",client_secret)
response = urllib.request.urlopen(request)
rescode = response.getcode()
if(rescode==200):
    responseBody = response.read().decode('utf-8')
    print(responseBody)
    
    file = open("./vr_arcade.csv", "w+")
    file.write(responseBody)
    
else:
    print("Error Code:" + rescode)

Output:{
lastBuildDate: "Fri
total: 874
start: 1
display: 3
items: [
{
title: "arcade1"
link: ""
category: "category1"
description: ""
telephone: "telephone"
address: "address"
roadAddress: "roadAddress"
mapx: "310767"
mapy: "550356"

}
{
title: "arcade2"
link: "https://link2"
category: "category2"
description: ""
telephone: "telephone"
address: "address"
roadAddress: "roadAddress"
mapx: "312508"
mapy: "552087"

}
{
title: "arcade3"
link: "arcade3.com"
category: "arcade"
description: ""
telephone: "telephone"
address: "address"
roadAddress: "roadAddress"
mapx: "x"
mapy: "y"

}
]
}

Thanks

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Unable to Scrape Website	muhamdasim	2	2,610	Dec-27-2021, 07:49 PM Last Post: JohnRaven
	how to scrape a website from a keyword list	greenpine	2	2,379	Dec-04-2020, 03:50 PM Last Post: greenpine
	scrape data 1 go to next page scrape data 2 and so on	alkaline3	6	5,183	Mar-13-2020, 07:59 PM Last Post: alkaline3
	Read url from CSV and Scrape website	Prince_Bhatia	3	10,259	Jan-08-2020, 09:08 AM Last Post: binaryanimal
	How do i scrape website whose page changes using javsacript _dopostback function and	Prince_Bhatia	1	7,223	Aug-06-2018, 09:45 AM Last Post: wavic
	Scrape A tags from a website	Prince_Bhatia	1	4,224	Oct-15-2017, 12:56 AM Last Post: metulburr

why I can't scrape a website?

User Panel Messages

Announcements