Python Forum

Full Version: webscraping - failing to extract specific text from data.gov
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Wanted to extract how many data sets are on 'https://catalog.data.gov/dataset#sec-organization_type'.

The HTML file was:
<body>
...
<div class="new-results">

<!-- Snippet snippets/search_result_text.html start -->

184,298 datasets found
<!-- Snippet snippets/search_result_text.html end -->

</div>

I used this python code:
from lxml import html
import requests
response = requests.get('https://catalog.data.gov/dataset#sec-organization_type')
doc = html.fromstring(response.text)
link = doc.cssselect('div.new-results')
for i in link:
    print(i.text)
I don't know where the problem is
from lxml import html
import requests
response = requests.get('https://catalog.data.gov/dataset#sec-organization_type')
doc = html.fromstring(response.text)
link = doc.cssselect('div.new-results')
print(link[0].text_content().strip())

or using BeautifulSoup and lxml as parser

import requests
from bs4 import BeautifulSoup
response = requests.get('https://catalog.data.gov/dataset#sec-organization_type')
soup = BeautifulSoup(response.text, 'lxml')
div = soup.find('div', {'class':'new-results'})
print(div.text.strip())
or

import requests
from bs4 import BeautifulSoup
response = requests.get('https://catalog.data.gov/dataset#sec-organization_type')
soup = BeautifulSoup(response.text, 'lxml')
div = soup.select('div.new-results')
print(div[0].text.strip())
Thanks a lot Buran!