Python Forum
Thread Rating:
  • 2 Vote(s) - 2.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help with BeautifulSoup
#1
hey! 
my code is working very well, but still the output is not good enough for me. 
here's my code:
from bs4 import BeautifulSoup
from urllib import urlopen
import os

os.chdir('C:/Users/yuvi/Desktop')

html = urlopen('the url is in here.  i will post it in a comment because i cant post clickable links on my first post')
soup = BeautifulSoup(html)

with open('LOA.txt', 'w') as f:
   for section in soup.findAll('a', {'class':'s xst'}):
       f.write('{}'.format(section) + '\n')
this code should take all the posts from LOA forum first page, and print it out to a text file. 
but when it prints this out this is what i get: 
Output:
<a class="s xst" href="thread/314240-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;"> <em class="youzu_none">Event</em>[NEW]Preview of New Version on April 20th: New Amulet Is Added!</a> <a class="s xst" href="thread/314233-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;"> <em class="youzu_none">Event</em>[HOT]Cross-server Resource Tycoon: New Hero Celestial Blade Shows Up!</a>
how i can take off all the 'html code' and stay only with the title of the post? 
thanks!

html = urlopen('http://community.gtarcade.com/forum/2036-1.html')

from bs4 import BeautifulSoup
from urllib import urlopen
import os

os.chdir('C:/Users/yuvi/Desktop')

html = urlopen('http://community.gtarcade.com/forum/2036-1.html')
soup = BeautifulSoup(html)

with open('LOA.txt', 'w') as f:
   for section in soup.findAll('a', {'class':'s xst'}):
       f.write('{}'.format(section) + '\n')
Reply
#2
(Apr-20-2017, 05:27 PM)mor2k Wrote: how i can take off all the 'html code' and stay only with the title of the post? 
from bs4 import BeautifulSoup

html ='''\
<a class="s xst" href="thread/314240-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;">
<em class="youzu_none">Event</em>[NEW]Preview of New Version on April 20th: New Amulet Is Added!</a>
<a class="s xst" href="thread/314233-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;">
<em class="youzu_none">Event</em>[HOT]Cross-server Resource Tycoon: New Hero Celestial Blade Shows Up!</a>'''

soup = BeautifulSoup(html, 'html.parser')
post = soup.find_all('a')
for title in post:
   print(title.text.strip())
Output:
Event[NEW]Preview of New Version on April 20th: New Amulet Is Added! Event[HOT]Cross-server Resource Tycoon: New Hero Celestial Blade Shows Up!
Use Requests when reading a site,and not urllib.
Eg:
import requests
from bs4 import BeautifulSoup

url = 'http://community.gtarcade.com/forum/2036-1.html'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
print(soup.find('title').text)
Output:
News and Announcements-League of Angels Forum
-
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020