Python Forum

Full Version: [SOLVED] [Beautiful Soup] How to deprettify?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

I made the mistake of using soup.prettify() to save soups to files, and I now have whitespaces that show up as useless spaces when viewing the files in an HTML WYSIWYG editor.

The following code doesn't work to remove those useless whitespaces.

Before I write a Python script to run the files through Tidy instead, does someone know if it can be fixed with BS?

Thank you.

for file in glob.glob("*.html"):
	BASE = Path(file).stem
	OUTPUTFILE = fr"{BASE}.CONV.html" 
	
	soup = BeautifulSoup(open(file,"br"),"lxml")
	for tag in soup.find_all():
		if tag.string:
			tag.string.replace_with(' '.join(tag.string.split()))
			print(tag.string)
		else:
			print(tag.name, " no string")
			pass

	with open(OUTPUTFILE, 'w', encoding='utf-8') as outp:
		outp.write(str(soup))
To show the problem.
from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph</p>
  <p>blue car</p>
</body>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print('-' * 25)
print(str(soup))
Output:
<body> <h1> This is a Heading </h1> <p> This is a paragraph </p> <p> blue car </p> </body> ------------------------- <body> <h1>This is a Heading</h1> <p>This is a paragraph</p> <p>blue car</p> </body>
So the new line is annoying(i tried to fix it a long time ago),now just ways under.
Easy fix is to use to html formatting online eg code beautify.
Or install Prettier,has a command line tool eg use prettier --write . formatt all html file in a folder.
G:\div_code\html_file
λ prettier --write .
h1.html 170ms
h2.html 5ms
Then output of both from BS option over will be correct formatted html.
Output:
<body> <h1>This is a Heading</h1> <p>This is a paragraph</p> <p>blue car</p> </body>
Thank you.
For others' benefit, here's how to do it in Beautiful Soup:

import sys
import os
import glob
import shutil
from bs4 import BeautifulSoup

ROOT = r"c:\temp"
os.chdir(ROOT)
for file in glob.glob("*.html"):
	print("Handling ", file)

	#save original file
	ORIGFILE = fr"{file}.orig"
	#grab original times
	mtime = os.stat(file).st_mtime
	atime = os.stat(file).st_atime
	tup = (atime, mtime)
	dest = shutil.copyfile(file, ORIGFILE)
	os.utime(ORIGFILE, tup)

	#Remove all carriage returns
	with open(file, "r") as f:
		dna = f.read().replace("\n", "")

	#trim each string
	soup = BeautifulSoup(dna,"lxml")
	_ = [s.replace_with(s.text.strip()) for s in soup.find_all(string=True)]
	#save soup back to file
	with open(file, 'w', encoding='utf-8') as outp:
		outp.write(str(soup))
	#Must close before updating time
	os.utime(file, tup)