webscraping - failing to extract specific text from data.gov - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: webscraping - failing to extract specific text from data.gov (/thread-10379.html) |
webscraping - failing to extract specific text from data.gov - rontar - May-18-2018 Wanted to extract how many data sets are on 'https://catalog.data.gov/dataset#sec-organization_type'. The HTML file was: <body> ... <div class="new-results"> <!-- Snippet snippets/search_result_text.html start --> 184,298 datasets found <!-- Snippet snippets/search_result_text.html end --> </div> I used this python code: from lxml import html import requests response = requests.get('https://catalog.data.gov/dataset#sec-organization_type') doc = html.fromstring(response.text) link = doc.cssselect('div.new-results') for i in link: print(i.text)I don't know where the problem is RE: webscraping - failing to extract specific text from data.gov - buran - May-18-2018 from lxml import html import requests response = requests.get('https://catalog.data.gov/dataset#sec-organization_type') doc = html.fromstring(response.text) link = doc.cssselect('div.new-results') print(link[0].text_content().strip()) or using BeautifulSoup and lxml as parser import requests from bs4 import BeautifulSoup response = requests.get('https://catalog.data.gov/dataset#sec-organization_type') soup = BeautifulSoup(response.text, 'lxml') div = soup.find('div', {'class':'new-results'}) print(div.text.strip())or import requests from bs4 import BeautifulSoup response = requests.get('https://catalog.data.gov/dataset#sec-organization_type') soup = BeautifulSoup(response.text, 'lxml') div = soup.select('div.new-results') print(div[0].text.strip()) RE: webscraping - failing to extract specific text from data.gov - rontar - May-19-2018 Thanks a lot Buran! |