Python Forum
How to print particular text areas fron an HTML file (not site)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to print particular text areas fron an HTML file (not site)
#1
I am new to python and need some help to be able to print particular text areas from a .html file .The file is in my c:\temp\test.html

filename=input("type file path")
infile=open(filename,'r','utf-8')
data =infile.red(filename,'r','utf-8')
print(data)
Somehow i need to make a equation
text = subtext("bla this is the text i wonna print bla bla", "this", "bla")
Reply
#2
me first attemp tis to be able to open this html file saved in my c:\temp\test.html
and search for all the text areas starting with http and finishing with .com

import re
import io
filename =input("type the file path:")
with io.open('filename','r','utf-8') as f:
for line in f:
line = line.strip()
if re.match("http","com",0,true,true) and len(line)==7:
print(line)

got this error
Traceback (most recent call last):
File "C:/temp/testestlast.py", line 4, in <module>
with io.open('filename','r','utf-8') as f:
TypeError: an integer is required (got type str)
Reply
#3
After a quick glance of the docs for io.open function, it appears that you are missing a "buffering" argument, which requires an integer argument.
Reply
#4
first of all, post all of your code, traceback, output, etc. in respective tags. See BBcode help for more info.

That said, you don't need to work directly withio module - this is going too low level. Just use built-in open() function to access the file. And use with context manager, e.g.
with open('test.html') as f:
    # do something
Also, given that this is html file, you may want to use special packages (e.g. Beautiful Soup) to parse the content of the file. re (RegEx) in not a tool to use in this case.
Reply
#5
(Dec-07-2017, 12:18 PM)buran Wrote: first of all, post all of your code, traceback, output, etc. in respective tags. See BBcode help for more info.

That said, you don't need to work directly withio module - this is going too low level. Just use built-in open() function to access the file. And use with context manager, e.g.
with open('test.html') as f:
    # do something
Also, given that this is html file, you may want to use special packages (e.g. Beautiful Soup) to parse the content of the file. re (RegEx) in not a tool to use in this case.

I am pretty new in python. I am able to print context of a txt file and find the particular words so far. The question is that if the folder is html format then I cant open it at all. . In case I need beautiful soup to be able to have the output. my instructor walkthrough is just wrong. infile= open("filename",'r') for text and infile = open("filename",'r'utf-8') for html
Reply
#6
Quote: The question is that if the folder is html format then I cant open it at all.

Folders have no format.

A text file that happens to have the .html extension is still just a text file, and is the same as any .txt file (or .py file for that matter).  You don't need to supply an encoding to open() to open an html file, unless the file is actually encoded in a special way (most of them are not, though).
Reply
#7
In my c:\temp\test.html
<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title>Title of the document</title>
  </head>
  <body>
    <p id='foo'>Hello world</p>
  </body>
</html>
from bs4 import BeautifulSoup

with open('C:/temp/test.html', encoding='utf-8') as f:
    html_file = f.read()

soup = BeautifulSoup(html_file, 'lxml')
print(soup.select('#foo')[0].text)
Output:
Hello world
Reply
#8
with your help I have managed to print the requested outcome using beautifull soup
from bs4 import BeautifulSoup
import urllib.request

import warnings
warnings.filterwarnings('ignore')

resp = urllib.request.urlopen("file:///C:/seminaria.html")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print (link['href'])
    print (link.string)
    print ("")
" I am printing all urls from seminaria.html and their names"
Was wandering if anyone got any tips if is possible to print same result withough using beautiful soup using
def
Reply
#9
It's better to use with open(),when reading from disk that urllib.
Without a parser can use string tool like eg find(),and slice out the text.
import urllib.request

resp = urllib.request.urlopen("file:///C:/Temp/test.html")
html_file = resp.read().decode('utf-8')

start = html_file.find("foo'>") + 5
end = html_file.find("</p")
print(html_file[start:end])
Output:
Hello world
def means that you shall make function for this.
Reply
#10
:) helpful tips so far lads! tryin to make the function and I will post it
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Lightbulb Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB BrandonKastning 5 2,817 Dec-29-2021, 02:26 AM
Last Post: BrandonKastning
  show csv file in flask template.html rr28rizal 8 34,522 Apr-12-2021, 09:24 AM
Last Post: adamabusamra
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,529 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,410 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  Open and read a tab delimited file from html using python cgi luffy 2 2,633 Aug-24-2020, 06:25 AM
Last Post: luffy
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,328 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Web crawler extracting specific text from HTML lewdow 1 3,343 Jan-03-2020, 11:21 PM
Last Post: snippsat
  Help on parsing simple text on HTML amaumox 5 3,396 Jan-03-2020, 05:50 PM
Last Post: amaumox
  Sending file html ? JohnnyCoffee 3 57,322 Sep-06-2019, 04:32 PM
Last Post: snippsat
  Extract text between bold headlines from HTML CostasG 1 2,273 Aug-31-2019, 10:53 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020