Python Forum
Sitemap.xml and pull URLs and get response code - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: General (https://python-forum.io/forum-1.html)
+--- Forum: News and Discussions (https://python-forum.io/forum-31.html)
+--- Thread: Sitemap.xml and pull URLs and get response code (/thread-26168.html)



Sitemap.xml and pull URLs and get response code - motunyabgu - Apr-22-2020

What I want to do is enter a site's robots.txt list. It then pulls out all the URLs and contains their response code. Save this to csv.

I can do this to a point in a different way. But I can't save to csv and get the response code for each url individually.

Also, sitemaps of plugins like All in one SEO are different from the manually created Sitemap.


I use to code:

import requests
from bs4 import BeautifulSoup


d = open("sitemap.txt", "a+")
url = 'https://site.com/postsitemap.xml'
page = requests.get(url)
print('Sitemap yanıt kodu: %s' % page)

data = [[r["loc"], r["lastmod"]] for r in raw["urlset"]["url"]]
print("Sitemap URL sayısı:", len(data))
df = pd.DataFrame(data, columns=["links", "lastmod"])

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))
urls = [element.text for element in sitemap_index.findAll('loc')]


for link in sorted(x for x in (urls)):
    d.write(link+("\n"))

with open("sitemap.txt") as f:
    print(f.read())
    f.close()