Sitemap.xml and pull URLs and get response code - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: General (https://python-forum.io/forum-1.html) +--- Forum: News and Discussions (https://python-forum.io/forum-31.html) +--- Thread: Sitemap.xml and pull URLs and get response code (/thread-26168.html) |
Sitemap.xml and pull URLs and get response code - motunyabgu - Apr-22-2020 What I want to do is enter a site's robots.txt list. It then pulls out all the URLs and contains their response code. Save this to csv. I can do this to a point in a different way. But I can't save to csv and get the response code for each url individually. Also, sitemaps of plugins like All in one SEO are different from the manually created Sitemap. I use to code: import requests from bs4 import BeautifulSoup d = open("sitemap.txt", "a+") url = 'https://site.com/postsitemap.xml' page = requests.get(url) print('Sitemap yanıt kodu: %s' % page) data = [[r["loc"], r["lastmod"]] for r in raw["urlset"]["url"]] print("Sitemap URL sayısı:", len(data)) df = pd.DataFrame(data, columns=["links", "lastmod"]) sitemap_index = BeautifulSoup(page.content, 'html.parser') print('Created %s object' % type(sitemap_index)) urls = [element.text for element in sitemap_index.findAll('loc')] for link in sorted(x for x in (urls)): d.write(link+("\n")) with open("sitemap.txt") as f: print(f.read()) f.close() |