Python Forum
webscraping - failing to extract specific text from data.gov - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: webscraping - failing to extract specific text from data.gov (/thread-10379.html)



webscraping - failing to extract specific text from data.gov - rontar - May-18-2018

Wanted to extract how many data sets are on 'https://catalog.data.gov/dataset#sec-organization_type'.

The HTML file was:
<body>
...
<div class="new-results">

<!-- Snippet snippets/search_result_text.html start -->

184,298 datasets found
<!-- Snippet snippets/search_result_text.html end -->

</div>

I used this python code:
from lxml import html
import requests
response = requests.get('https://catalog.data.gov/dataset#sec-organization_type')
doc = html.fromstring(response.text)
link = doc.cssselect('div.new-results')
for i in link:
    print(i.text)
I don't know where the problem is


RE: webscraping - failing to extract specific text from data.gov - buran - May-18-2018

from lxml import html
import requests
response = requests.get('https://catalog.data.gov/dataset#sec-organization_type')
doc = html.fromstring(response.text)
link = doc.cssselect('div.new-results')
print(link[0].text_content().strip())

or using BeautifulSoup and lxml as parser

import requests
from bs4 import BeautifulSoup
response = requests.get('https://catalog.data.gov/dataset#sec-organization_type')
soup = BeautifulSoup(response.text, 'lxml')
div = soup.find('div', {'class':'new-results'})
print(div.text.strip())
or

import requests
from bs4 import BeautifulSoup
response = requests.get('https://catalog.data.gov/dataset#sec-organization_type')
soup = BeautifulSoup(response.text, 'lxml')
div = soup.select('div.new-results')
print(div[0].text.strip())



RE: webscraping - failing to extract specific text from data.gov - rontar - May-19-2018

Thanks a lot Buran!