Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 How to print particular text areas fron an HTML file (not site)
#1
I am new to python and need some help to be able to print particular text areas from a .html file .The file is in my c:\temp\test.html

filename=input("type file path")
infile=open(filename,'r','utf-8')
data =infile.red(filename,'r','utf-8')
print(data)
Somehow i need to make a equation
text = subtext("bla this is the text i wonna print bla bla", "this", "bla")
Quote
#2
me first attemp tis to be able to open this html file saved in my c:\temp\test.html
and search for all the text areas starting with http and finishing with .com

import re
import io
filename =input("type the file path:")
with io.open('filename','r','utf-8') as f:
for line in f:
line = line.strip()
if re.match("http","com",0,true,true) and len(line)==7:
print(line)

got this error
Traceback (most recent call last):
File "C:/temp/testestlast.py", line 4, in <module>
with io.open('filename','r','utf-8') as f:
TypeError: an integer is required (got type str)
Quote
#3
After a quick glance of the docs for io.open function, it appears that you are missing a "buffering" argument, which requires an integer argument.
Quote
#4
first of all, post all of your code, traceback, output, etc. in respective tags. See BBcode help for more info.

That said, you don't need to work directly withio module - this is going too low level. Just use built-in open() function to access the file. And use with context manager, e.g.
with open('test.html') as f:
    # do something
Also, given that this is html file, you may want to use special packages (e.g. Beautiful Soup) to parse the content of the file. re (RegEx) in not a tool to use in this case.
Quote
#5
(Dec-07-2017, 12:18 PM)buran Wrote: first of all, post all of your code, traceback, output, etc. in respective tags. See BBcode help for more info.

That said, you don't need to work directly withio module - this is going too low level. Just use built-in open() function to access the file. And use with context manager, e.g.
with open('test.html') as f:
    # do something
Also, given that this is html file, you may want to use special packages (e.g. Beautiful Soup) to parse the content of the file. re (RegEx) in not a tool to use in this case.

I am pretty new in python. I am able to print context of a txt file and find the particular words so far. The question is that if the folder is html format then I cant open it at all. . In case I need beautiful soup to be able to have the output. my instructor walkthrough is just wrong. infile= open("filename",'r') for text and infile = open("filename",'r'utf-8') for html
Quote
#6
Quote: The question is that if the folder is html format then I cant open it at all.

Folders have no format.

A text file that happens to have the .html extension is still just a text file, and is the same as any .txt file (or .py file for that matter).  You don't need to supply an encoding to open() to open an html file, unless the file is actually encoded in a special way (most of them are not, though).
buran likes this post
Quote
#7
In my c:\temp\test.html
<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title>Title of the document</title>
  </head>
  <body>
    <p id='foo'>Hello world</p>
  </body>
</html>
from bs4 import BeautifulSoup

with open('C:/temp/test.html', encoding='utf-8') as f:
    html_file = f.read()

soup = BeautifulSoup(html_file, 'lxml')
print(soup.select('#foo')[0].text)
Output:
Hello world
Quote
#8
with your help I have managed to print the requested outcome using beautifull soup
from bs4 import BeautifulSoup
import urllib.request

import warnings
warnings.filterwarnings('ignore')

resp = urllib.request.urlopen("file:///C:/seminaria.html")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print (link['href'])
    print (link.string)
    print ("")
" I am printing all urls from seminaria.html and their names"
Was wandering if anyone got any tips if is possible to print same result withough using beautiful soup using
def
Quote
#9
It's better to use with open(),when reading from disk that urllib.
Without a parser can use string tool like eg find(),and slice out the text.
import urllib.request

resp = urllib.request.urlopen("file:///C:/Temp/test.html")
html_file = resp.read().decode('utf-8')

start = html_file.find("foo'>") + 5
end = html_file.find("</p")
print(html_file[start:end])
Output:
Hello world
def means that you shall make function for this.
Quote
#10
:) helpful tips so far lads! tryin to make the function and I will post it
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 64 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Web crawler extracting specific text from HTML lewdow 1 649 Jan-03-2020, 11:21 PM
Last Post: snippsat
  Help on parsing simple text on HTML amaumox 5 288 Jan-03-2020, 05:50 PM
Last Post: amaumox
  Sending file html ? JohnnyCoffee 3 338 Sep-06-2019, 04:32 PM
Last Post: snippsat
  Extract text between bold headlines from HTML CostasG 1 306 Aug-31-2019, 10:53 AM
Last Post: snippsat
  Getting a specific text inside an html with soup mathieugrimbert 9 3,921 Jul-10-2019, 12:40 PM
Last Post: mathieugrimbert
  Beutifulsoup: how to pick text that's not in HTML tags? pitonas 4 987 Oct-08-2018, 01:43 PM
Last Post: pitonas
  Reading a html file peterl 4 1,053 Aug-20-2018, 03:16 PM
Last Post: peterl
  convert html to pdf in django site shahpy 4 3,063 Aug-17-2018, 11:10 AM
Last Post: Larz60+
  Decoding html to text string PeterPython 1 718 Aug-12-2018, 07:23 PM
Last Post: Larz60+

Forum Jump:


Users browsing this thread: 1 Guest(s)