why I can't scrape a website? - Printable Version

why I can't scrape a website? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: why I can't scrape a website? (/thread-21193.html)

why I can't scrape a website? - kmkim319 - Sep-18-2019

I'm trying to get all the names of restaurants with 'food' keyword from the following website. However, I need to learn more things in order to just pick out the names. So I just want to try out to scrape all the page first but having trouble.

My codes:

import requests
from bs4 import BeautifulSoup

url = 'https://map.naver.com'
req = requests.get(url)
bs = BeautifulSoup(req.text, 'html.parser')
print(bs)

The result:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br/>
</p>
</body></html>

RE: why I can't scrape a website? - metulburr - Sep-18-2019

try adding in a bogus header

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}

url = 'https://map.naver.com'
req = requests.get(url, headers=headers)
bs = BeautifulSoup(req.text, 'html.parser')
print(bs)

However based on the results being javascript, you may need selenium instead of requests to obtain the correct HTML.

RE: why I can't scrape a website? - kmkim319 - Sep-20-2019

Thanks for the response, Metulburr. I really appreciate it. I could scrape the page.
I will take a look into Selenium and tutorials you have mentioned to get better.

I have one more question.

I typed in 'vr' in a search box from https://map.naver.com/ but the address doesn't get changed. I was expecting longer address instead of just https://map.naver.com/

So I clicked onto one of the names of the result and ctrl+shift+i > network tab > searched for a request including arcade information which is added to my code below.

My question is how I can scrape this page? I used the header but I've got an unexpected result.

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
        

url = 'https://map.naver.com/search2/local.nhn?sm=hty&searchCoord=127.1302174%3B37.4119829&isFirstSearch=true&query=vr&menu=location&mpx=02135620%3A37.4119829%2C127.1302174%3AZ11%3A0.0195464%2C0.0085920'
req = requests.get(url, headers=headers)
bs = BeautifulSoup(req.text, 'html.parser')
print(bs)

Error:{
"error": {
"code": "HMAC_AUTH_FAILED",
"msg": "Hmac Authentication has failed.",
"extraInfo": null
}
}

RE: why I can't scrape a website? - metulburr - Sep-20-2019

the entire website is ran off of javascript. If you turn off javascript in your browser the entire website does nothing and the search bar does nothing. You will need selenium to do anything on that site.

RE: why I can't scrape a website? - snippsat - Sep-20-2019

The first ting you should look at with sites like this if there is a API.
NAVER Open API sample code Python on GitHub.
Only if can not find what you look for in API should try to scrape,but this can be difficult/not possible even with Seleium on sites like this.

RE: why I can't scrape a website? - kmkim319 - Sep-25-2019

Hi,

First off, thank you for all of your support, metulburr and snippsat.

I solved how to scrape the page by using their APIs. I learned I should send a certain type of header in order to make use of their APIs such as an approved client id and key.

My next mission will be how to collect names and print in excel or notepad. I will ask you more questions!

I'm sharing a sample code for using their APIs:

import os
import sys
import urllib.request
client_id = "YOUR_CLIENT_ID"
client_secret = "YOUR_CLIENT_SECRET"
encText = urllib.parse.quote("검색할 단어")
url = "https://openapi.naver.com/v1/search/blog?query=" + encText # json 결과
# url = "https://openapi.naver.com/v1/search/blog.xml?query=" + encText # xml 결과
request = urllib.request.Request(url)
request.add_header("X-Naver-Client-Id",client_id)
request.add_header("X-Naver-Client-Secret",client_secret)
response = urllib.request.urlopen(request)
rescode = response.getcode()
if(rescode==200):
    response_body = response.read()
    print(response_body.decode('utf-8'))
else:
    print("Error Code:" + rescode)

P.S. Could you please recommend a well-received Python editor? I'm using Geany and having trouble when I copy and paste the result of my code because it is shown on Windows Command. So I have to type out all the characters.

RE: why I can't scrape a website? - metulburr - Sep-25-2019

(Sep-25-2019, 03:49 AM)kmkim319 Wrote: result of my code because it is shown on Windows Command

You can edit the windows command console to copy text from. It is off by default.

RE: why I can't scrape a website? - kmkim319 - Sep-27-2019

Hi, metulburr.

Now I learned how to write my scraped-result in Excel, but still can't figure out how to collect only desired information of vr arcades under 'vr' keyword.

I'd like to collect 'title', 'category', 'telephone', 'address' for each item in list or dictionary and print out each set in excel. Please give me any advice.

import os
import sys
import urllib.request
    
client_id = "confidential"
client_secret = "confidential"
url = "https://openapi.naver.com/v1/search/local.json?query=vr&display=3&start=1"
request = urllib.request.Request(url)
request.add_header("X-Naver-Client-Id",client_id)
request.add_header("X-Naver-Client-Secret",client_secret)
response = urllib.request.urlopen(request)
rescode = response.getcode()
if(rescode==200):
    responseBody = response.read().decode('utf-8')
    print(responseBody)
    
    file = open("./vr_arcade.csv", "w+")
    file.write(responseBody)
    
else:
    print("Error Code:" + rescode)

Output:{
lastBuildDate: "Fri
total: 874
start: 1
display: 3
items: [
{
title: "arcade1"
link: ""
category: "category1"
description: ""
telephone: "telephone"
address: "address"
roadAddress: "roadAddress"
mapx: "310767"
mapy: "550356"

}
{
title: "arcade2"
link: "https://link2"
category: "category2"
description: ""
telephone: "telephone"
address: "address"
roadAddress: "roadAddress"
mapx: "312508"
mapy: "552087"

}
{
title: "arcade3"
link: "arcade3.com"
category: "arcade"
description: ""
telephone: "telephone"
address: "address"
roadAddress: "roadAddress"
mapx: "x"
mapy: "y"

}
]
}

Thanks