html to text problem

Kyle · Apr-27-2018, 05:15 PM

I am very new to python and this forum. I apologize for any newbie mistakes.

I am trying to convert html to text. When I do this, it is moving everything after the first tab to a new line.

For example, instead of:

****************************
November 5, 2008
****************************

it saves"

****************************
November 5,
2008
****************************

The current code is:

    soup=BeautifulSoup(download_target.text, 'html.parser')
    f_text=soup.get_text()
    text_file = open(file_loc+"\\"+url_rename[2]+"\\"+url_rename[3]+"\\"+url_rename[1]+".txt","w")
    text_file.write(str(f_text.encode('ascii', errors='ignore')).replace("\n", "\r\n").replace("\\n", "\r\n").replace("\\t", ""))

I think the fix has to do with:

.join(f_text.splitlines())

But when I use "f_text" in the splitlines command, I get an error.

  
text_file.write(str(f_text.encode('ascii', errors='ignore'))).replace("\\n", "\r\n").replace("\n", "\r\n").replace("\\t", "\t").join(f_text.splitlines())

AttributeError: 'int' object has no attribute 'replace'

Any help is much appreciated. Thanks in advance.

***snippsat*** · (This post was last modified: Apr-27-2018, 06:15 PM by snippsat.)

soup.get_text() will be problem if doing this on a middle to large site.
Web-sites are not meant to return all in text,the usual way is to parse only text needed(not whole site).
Here is a post where i use html2tex(more specialized for this task) and text=True in BS.

Kyle · Apr-27-2018, 07:55 PM

Snippsat, thanks alot for the help.

It is not a large site or file though.

Your method removes some of the formatting that is important, which I need to keep. So everything is combined and not on the correct line any longer.

Any other ideas?

**nilamo** · Apr-27-2018, 08:27 PM

(Apr-27-2018, 05:15 PM)Kyle Wrote:
text_file.write(str(f_text.encode('ascii', errors='ignore'))).replace("\\n", "\r\n").replace("\n", "\r\n").replace("\\t", "\t").join(f_text.splitlines())
AttributeError: 'int' object has no attribute 'replace'

Look at your parentheses. You're calling replace() on the return value of text_file.write()

***snippsat*** · (This post was last modified: Apr-27-2018, 09:02 PM by snippsat.)

(Apr-27-2018, 07:55 PM)Kyle Wrote: Any other ideas?

Have to look url address or you have post the raw html and output you want out.

Think of why you can just parse the normal way.
Example a <p> tag with text you want,then just parse what's inside <p>.

from bs4 import BeautifulSoup

html = '''\
<!DOCTYPE html>
<html>
  <head>
    <title>HTML p Tag</title>
  </head>
  <body>
    <p>This paragraph is defined using the HTML p<br />
       A new line<br />
       Another new line<br />
    </p>
  </body>
</html>'''
soup = BeautifulSoup(html, 'lxml')

Test:

>>> p = soup.find('p')
>>> p
<p>This paragraph is defined using the HTML p<br/>
      A new line<br/>
      Another new line<br/>
</p>

>>> # Using text br will be \n
>>> p = soup.find('p').text
>>> p
('This paragraph is defined using the HTML p\n'
 '      A new line\n'
 '      Another new line\n')

>>> print(p)
This paragraph is defined using the HTML p
      A new line
      Another new line

>>> # Can clean a little more
>>> for line in p.split('\n'):
...     print(line.lstrip())
     
This paragraph is defined using the HTML p
A new line
Another new line

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,617	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	3,459	Nov-02-2020, 08:12 PM Last Post: Larz60+
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,357	Mar-22-2020, 06:10 AM Last Post: BrandonKastning
	Web crawler extracting specific text from HTML	lewdow	1	3,396	Jan-03-2020, 11:21 PM Last Post: snippsat
	Help on parsing simple text on HTML	amaumox	5	3,459	Jan-03-2020, 05:50 PM Last Post: amaumox
	Extract text between bold headlines from HTML	CostasG	1	2,318	Aug-31-2019, 10:53 AM Last Post: snippsat
	Getting a specific text inside an html with soup	mathieugrimbert	9	15,910	Jul-10-2019, 12:40 PM Last Post: mathieugrimbert
	Beutifulsoup: how to pick text that's not in HTML tags?	pitonas	4	4,705	Oct-08-2018, 01:43 PM Last Post: pitonas
	Decoding html to text string	PeterPython	1	2,651	Aug-12-2018, 07:23 PM Last Post: Larz60+
	Problem parsing website html file	thefpgarace	2	3,194	May-01-2018, 11:09 AM Last Post: Standard_user

html to text problem

User Panel Messages

Announcements