Python Forum

Full Version: Need help with BeautifulSoup
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
hey! 
my code is working very well, but still the output is not good enough for me. 
here's my code:
from bs4 import BeautifulSoup
from urllib import urlopen
import os

os.chdir('C:/Users/yuvi/Desktop')

html = urlopen('the url is in here.  i will post it in a comment because i cant post clickable links on my first post')
soup = BeautifulSoup(html)

with open('LOA.txt', 'w') as f:
   for section in soup.findAll('a', {'class':'s xst'}):
       f.write('{}'.format(section) + '\n')
this code should take all the posts from LOA forum first page, and print it out to a text file. 
but when it prints this out this is what i get: 
Output:
<a class="s xst" href="thread/314240-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;"> <em class="youzu_none">Event</em>[NEW]Preview of New Version on April 20th: New Amulet Is Added!</a> <a class="s xst" href="thread/314233-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;"> <em class="youzu_none">Event</em>[HOT]Cross-server Resource Tycoon: New Hero Celestial Blade Shows Up!</a>
how i can take off all the 'html code' and stay only with the title of the post? 
thanks!

html = urlopen('http://community.gtarcade.com/forum/2036-1.html')

from bs4 import BeautifulSoup
from urllib import urlopen
import os

os.chdir('C:/Users/yuvi/Desktop')

html = urlopen('http://community.gtarcade.com/forum/2036-1.html')
soup = BeautifulSoup(html)

with open('LOA.txt', 'w') as f:
   for section in soup.findAll('a', {'class':'s xst'}):
       f.write('{}'.format(section) + '\n')
(Apr-20-2017, 05:27 PM)mor2k Wrote: [ -> ]how i can take off all the 'html code' and stay only with the title of the post? 
from bs4 import BeautifulSoup

html ='''\
<a class="s xst" href="thread/314240-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;">
<em class="youzu_none">Event</em>[NEW]Preview of New Version on April 20th: New Amulet Is Added!</a>
<a class="s xst" href="thread/314233-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;">
<em class="youzu_none">Event</em>[HOT]Cross-server Resource Tycoon: New Hero Celestial Blade Shows Up!</a>'''

soup = BeautifulSoup(html, 'html.parser')
post = soup.find_all('a')
for title in post:
   print(title.text.strip())
Output:
Event[NEW]Preview of New Version on April 20th: New Amulet Is Added! Event[HOT]Cross-server Resource Tycoon: New Hero Celestial Blade Shows Up!
Use Requests when reading a site,and not urllib.
Eg:
import requests
from bs4 import BeautifulSoup

url = 'http://community.gtarcade.com/forum/2036-1.html'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
print(soup.find('title').text)
Output:
News and Announcements-League of Angels Forum
-