Python Forum

Full Version: html to text problem
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am very new to python and this forum. I apologize for any newbie mistakes.

I am trying to convert html to text. When I do this, it is moving everything after the first tab to a new line.

For example, instead of:

****************************
November 5, 2008
****************************

it saves"

****************************
November 5,
2008
****************************

The current code is:

    soup=BeautifulSoup(download_target.text, 'html.parser')
    f_text=soup.get_text()
    text_file = open(file_loc+"\\"+url_rename[2]+"\\"+url_rename[3]+"\\"+url_rename[1]+".txt","w")
    text_file.write(str(f_text.encode('ascii', errors='ignore')).replace("\n", "\r\n").replace("\\n", "\r\n").replace("\\t", ""))
I think the fix has to do with:

.join(f_text.splitlines())
But when I use "f_text" in the splitlines command, I get an error.

  
text_file.write(str(f_text.encode('ascii', errors='ignore'))).replace("\\n", "\r\n").replace("\n", "\r\n").replace("\\t", "\t").join(f_text.splitlines())
AttributeError: 'int' object has no attribute 'replace'

Any help is much appreciated. Thanks in advance.
soup.get_text() will be problem if doing this on a middle to large site.
Web-sites are not meant to return all in text,the usual way is to parse only text needed(not whole site).
Here is a post where i use html2tex(more specialized for this task) and text=True in BS.
Snippsat, thanks alot for the help.

It is not a large site or file though.

Your method removes some of the formatting that is important, which I need to keep. So everything is combined and not on the correct line any longer.

Any other ideas?
(Apr-27-2018, 05:15 PM)Kyle Wrote: [ -> ]
text_file.write(str(f_text.encode('ascii', errors='ignore'))).replace("\\n", "\r\n").replace("\n", "\r\n").replace("\\t", "\t").join(f_text.splitlines())
AttributeError: 'int' object has no attribute 'replace'
Look at your parentheses. You're calling replace() on the return value of text_file.write()
(Apr-27-2018, 07:55 PM)Kyle Wrote: [ -> ]Any other ideas?
Have to look url address or you have post the raw html and output you want out.

Think of why you can just parse the normal way.
Example a <p> tag with text you want,then just parse what's inside <p>.
from bs4 import BeautifulSoup

html = '''\
<!DOCTYPE html>
<html>
  <head>
    <title>HTML p Tag</title>
  </head>
  <body>
    <p>This paragraph is defined using the HTML p<br />
       A new line<br />
       Another new line<br />
    </p>
  </body>
</html>'''
soup = BeautifulSoup(html, 'lxml')
Test:
>>> p = soup.find('p')
>>> p
<p>This paragraph is defined using the HTML p<br/>
      A new line<br/>
      Another new line<br/>
</p>

>>> # Using text br will be \n
>>> p = soup.find('p').text
>>> p
('This paragraph is defined using the HTML p\n'
 '      A new line\n'
 '      Another new line\n')

>>> print(p)
This paragraph is defined using the HTML p
      A new line
      Another new line

>>> # Can clean a little more
>>> for line in p.split('\n'):
...     print(line.lstrip())
     
This paragraph is defined using the HTML p
A new line
Another new line