Need help with BeautifulSoup - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Need help with BeautifulSoup (/thread-2942.html) |
Need help with BeautifulSoup - mor2k - Apr-20-2017 hey! my code is working very well, but still the output is not good enough for me. here's my code: from bs4 import BeautifulSoup from urllib import urlopen import os os.chdir('C:/Users/yuvi/Desktop') html = urlopen('the url is in here. i will post it in a comment because i cant post clickable links on my first post') soup = BeautifulSoup(html) with open('LOA.txt', 'w') as f: for section in soup.findAll('a', {'class':'s xst'}): f.write('{}'.format(section) + '\n')this code should take all the posts from LOA forum first page, and print it out to a text file. but when it prints this out this is what i get: how i can take off all the 'html code' and stay only with the title of the post? thanks! html = urlopen('http://community.gtarcade.com/forum/2036-1.html') from bs4 import BeautifulSoup from urllib import urlopen import os os.chdir('C:/Users/yuvi/Desktop') html = urlopen('http://community.gtarcade.com/forum/2036-1.html') soup = BeautifulSoup(html) with open('LOA.txt', 'w') as f: for section in soup.findAll('a', {'class':'s xst'}): f.write('{}'.format(section) + '\n') RE: Need help with BeautifulSoup - snippsat - Apr-20-2017 (Apr-20-2017, 05:27 PM)mor2k Wrote: how i can take off all the 'html code' and stay only with the title of the post? from bs4 import BeautifulSoup html ='''\ <a class="s xst" href="thread/314240-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;"> <em class="youzu_none">Event</em>[NEW]Preview of New Version on April 20th: New Amulet Is Added!</a> <a class="s xst" href="thread/314233-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;"> <em class="youzu_none">Event</em>[HOT]Cross-server Resource Tycoon: New Hero Celestial Blade Shows Up!</a>''' soup = BeautifulSoup(html, 'html.parser') post = soup.find_all('a') for title in post: print(title.text.strip()) Use Requests when reading a site,and not urllib.Eg: import requests from bs4 import BeautifulSoup url = 'http://community.gtarcade.com/forum/2036-1.html' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'html.parser') print(soup.find('title').text) -
|