Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
html to text problem
#1
I am very new to python and this forum. I apologize for any newbie mistakes.

I am trying to convert html to text. When I do this, it is moving everything after the first tab to a new line.

For example, instead of:

****************************
November 5, 2008
****************************

it saves"

****************************
November 5,
2008
****************************

The current code is:

    soup=BeautifulSoup(download_target.text, 'html.parser')
    f_text=soup.get_text()
    text_file = open(file_loc+"\\"+url_rename[2]+"\\"+url_rename[3]+"\\"+url_rename[1]+".txt","w")
    text_file.write(str(f_text.encode('ascii', errors='ignore')).replace("\n", "\r\n").replace("\\n", "\r\n").replace("\\t", ""))
I think the fix has to do with:

.join(f_text.splitlines())
But when I use "f_text" in the splitlines command, I get an error.

  
text_file.write(str(f_text.encode('ascii', errors='ignore'))).replace("\\n", "\r\n").replace("\n", "\r\n").replace("\\t", "\t").join(f_text.splitlines())
AttributeError: 'int' object has no attribute 'replace'

Any help is much appreciated. Thanks in advance.
Reply
#2
soup.get_text() will be problem if doing this on a middle to large site.
Web-sites are not meant to return all in text,the usual way is to parse only text needed(not whole site).
Here is a post where i use html2tex(more specialized for this task) and text=True in BS.
Reply
#3
Snippsat, thanks alot for the help.

It is not a large site or file though.

Your method removes some of the formatting that is important, which I need to keep. So everything is combined and not on the correct line any longer.

Any other ideas?
Reply
#4
(Apr-27-2018, 05:15 PM)Kyle Wrote:
text_file.write(str(f_text.encode('ascii', errors='ignore'))).replace("\\n", "\r\n").replace("\n", "\r\n").replace("\\t", "\t").join(f_text.splitlines())
AttributeError: 'int' object has no attribute 'replace'
Look at your parentheses. You're calling replace() on the return value of text_file.write()
Reply
#5
(Apr-27-2018, 07:55 PM)Kyle Wrote: Any other ideas?
Have to look url address or you have post the raw html and output you want out.

Think of why you can just parse the normal way.
Example a <p> tag with text you want,then just parse what's inside <p>.
from bs4 import BeautifulSoup

html = '''\
<!DOCTYPE html>
<html>
  <head>
    <title>HTML p Tag</title>
  </head>
  <body>
    <p>This paragraph is defined using the HTML p<br />
       A new line<br />
       Another new line<br />
    </p>
  </body>
</html>'''
soup = BeautifulSoup(html, 'lxml')
Test:
>>> p = soup.find('p')
>>> p
<p>This paragraph is defined using the HTML p<br/>
      A new line<br/>
      Another new line<br/>
</p>

>>> # Using text br will be \n
>>> p = soup.find('p').text
>>> p
('This paragraph is defined using the HTML p\n'
 '      A new line\n'
 '      Another new line\n')

>>> print(p)
This paragraph is defined using the HTML p
      A new line
      Another new line

>>> # Can clean a little more
>>> for line in p.split('\n'):
...     print(line.lstrip())
     
This paragraph is defined using the HTML p
A new line
Another new line
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,617 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,459 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,357 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Web crawler extracting specific text from HTML lewdow 1 3,396 Jan-03-2020, 11:21 PM
Last Post: snippsat
  Help on parsing simple text on HTML amaumox 5 3,459 Jan-03-2020, 05:50 PM
Last Post: amaumox
  Extract text between bold headlines from HTML CostasG 1 2,318 Aug-31-2019, 10:53 AM
Last Post: snippsat
  Getting a specific text inside an html with soup mathieugrimbert 9 15,910 Jul-10-2019, 12:40 PM
Last Post: mathieugrimbert
  Beutifulsoup: how to pick text that's not in HTML tags? pitonas 4 4,705 Oct-08-2018, 01:43 PM
Last Post: pitonas
  Decoding html to text string PeterPython 1 2,651 Aug-12-2018, 07:23 PM
Last Post: Larz60+
  Problem parsing website html file thefpgarace 2 3,194 May-01-2018, 11:09 AM
Last Post: Standard_user

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020