Need help with BeautifulSoup - Printable Version

Need help with BeautifulSoup - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Need help with BeautifulSoup (/thread-2942.html)

Need help with BeautifulSoup - mor2k - Apr-20-2017

hey!
my code is working very well, but still the output is not good enough for me.
here's my code:

from bs4 import BeautifulSoup
from urllib import urlopen
import os

os.chdir('C:/Users/yuvi/Desktop')

html = urlopen('the url is in here.  i will post it in a comment because i cant post clickable links on my first post')
soup = BeautifulSoup(html)

with open('LOA.txt', 'w') as f:
   for section in soup.findAll('a', {'class':'s xst'}):
       f.write('{}'.format(section) + '\n')

this code should take all the posts from LOA forum first page, and print it out to a text file.
but when it prints this out this is what i get:

Output:<a class="s xst" href="thread/314240-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;">
<em class="youzu_none">Event</em>[NEW]Preview of New Version on April 20th: New Amulet Is Added!</a>
<a class="s xst" href="thread/314233-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;">
<em class="youzu_none">Event</em>[HOT]Cross-server Resource Tycoon: New Hero Celestial Blade Shows Up!</a>

how i can take off all the 'html code' and stay only with the title of the post?
thanks!

html = urlopen('http://community.gtarcade.com/forum/2036-1.html')

from bs4 import BeautifulSoup
from urllib import urlopen
import os

os.chdir('C:/Users/yuvi/Desktop')

html = urlopen('http://community.gtarcade.com/forum/2036-1.html')
soup = BeautifulSoup(html)

with open('LOA.txt', 'w') as f:
   for section in soup.findAll('a', {'class':'s xst'}):
       f.write('{}'.format(section) + '\n')

RE: Need help with BeautifulSoup - snippsat - Apr-20-2017

(Apr-20-2017, 05:27 PM)mor2k Wrote: how i can take off all the 'html code' and stay only with the title of the post?

from bs4 import BeautifulSoup

html ='''\
<a class="s xst" href="thread/314240-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;">
<em class="youzu_none">Event</em>[NEW]Preview of New Version on April 20th: New Amulet Is Added!</a>
<a class="s xst" href="thread/314233-1-1.html" onclick="atarget(this)" style="font-weight: bold;color: #EE1B2E;">
<em class="youzu_none">Event</em>[HOT]Cross-server Resource Tycoon: New Hero Celestial Blade Shows Up!</a>'''

soup = BeautifulSoup(html, 'html.parser')
post = soup.find_all('a')
for title in post:
   print(title.text.strip())

Output:Event[NEW]Preview of New Version on April 20th: New Amulet Is Added!
Event[HOT]Cross-server Resource Tycoon: New Hero Celestial Blade Shows Up!

Use Requests when reading a site,and not urllib.
Eg:

import requests
from bs4 import BeautifulSoup

url = 'http://community.gtarcade.com/forum/2036-1.html'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
print(soup.find('title').text)

Output:
News and Announcements-League of Angels Forum