Python Forum
How to print particular text areas fron an HTML file (not site) - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: How to print particular text areas fron an HTML file (not site) (/thread-6778.html)

Pages: 1 2


How to print particular text areas fron an HTML file (not site) - Chris - Dec-07-2017

I am new to python and need some help to be able to print particular text areas from a .html file .The file is in my c:\temp\test.html

filename=input("type file path")
infile=open(filename,'r','utf-8')
data =infile.red(filename,'r','utf-8')
print(data)
Somehow i need to make a equation
text = subtext("bla this is the text i wonna print bla bla", "this", "bla")


RE: How to print particular text areas fron an HTML file (not site) - Chris - Dec-07-2017

me first attemp tis to be able to open this html file saved in my c:\temp\test.html
and search for all the text areas starting with http and finishing with .com

import re
import io
filename =input("type the file path:")
with io.open('filename','r','utf-8') as f:
for line in f:
line = line.strip()
if re.match("http","com",0,true,true) and len(line)==7:
print(line)

got this error
Traceback (most recent call last):
File "C:/temp/testestlast.py", line 4, in <module>
with io.open('filename','r','utf-8') as f:
TypeError: an integer is required (got type str)


RE: How to print particular text areas fron an HTML file (not site) - j.crater - Dec-07-2017

After a quick glance of the docs for io.open function, it appears that you are missing a "buffering" argument, which requires an integer argument.


RE: How to print particular text areas fron an HTML file (not site) - buran - Dec-07-2017

first of all, post all of your code, traceback, output, etc. in respective tags. See BBcode help for more info.

That said, you don't need to work directly withio module - this is going too low level. Just use built-in open() function to access the file. And use with context manager, e.g.
with open('test.html') as f:
    # do something
Also, given that this is html file, you may want to use special packages (e.g. Beautiful Soup) to parse the content of the file. re (RegEx) in not a tool to use in this case.


RE: How to print particular text areas fron an HTML file (not site) - Chris - Dec-07-2017

(Dec-07-2017, 12:18 PM)buran Wrote: first of all, post all of your code, traceback, output, etc. in respective tags. See BBcode help for more info.

That said, you don't need to work directly withio module - this is going too low level. Just use built-in open() function to access the file. And use with context manager, e.g.
with open('test.html') as f:
    # do something
Also, given that this is html file, you may want to use special packages (e.g. Beautiful Soup) to parse the content of the file. re (RegEx) in not a tool to use in this case.

I am pretty new in python. I am able to print context of a txt file and find the particular words so far. The question is that if the folder is html format then I cant open it at all. . In case I need beautiful soup to be able to have the output. my instructor walkthrough is just wrong. infile= open("filename",'r') for text and infile = open("filename",'r'utf-8') for html


RE: How to print particular text areas fron an HTML file (not site) - nilamo - Dec-08-2017

Quote: The question is that if the folder is html format then I cant open it at all.

Folders have no format.

A text file that happens to have the .html extension is still just a text file, and is the same as any .txt file (or .py file for that matter).  You don't need to supply an encoding to open() to open an html file, unless the file is actually encoded in a special way (most of them are not, though).


RE: How to print particular text areas fron an HTML file (not site) - snippsat - Dec-08-2017

In my c:\temp\test.html
<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title>Title of the document</title>
  </head>
  <body>
    <p id='foo'>Hello world</p>
  </body>
</html>
from bs4 import BeautifulSoup

with open('C:/temp/test.html', encoding='utf-8') as f:
    html_file = f.read()

soup = BeautifulSoup(html_file, 'lxml')
print(soup.select('#foo')[0].text)
Output:
Hello world



RE: How to print particular text areas fron an HTML file (not site) - Chris - Dec-09-2017

with your help I have managed to print the requested outcome using beautifull soup
from bs4 import BeautifulSoup
import urllib.request

import warnings
warnings.filterwarnings('ignore')

resp = urllib.request.urlopen("file:///C:/seminaria.html")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print (link['href'])
    print (link.string)
    print ("")
" I am printing all urls from seminaria.html and their names"
Was wandering if anyone got any tips if is possible to print same result withough using beautiful soup using
def



RE: How to print particular text areas fron an HTML file (not site) - snippsat - Dec-09-2017

It's better to use with open(),when reading from disk that urllib.
Without a parser can use string tool like eg find(),and slice out the text.
import urllib.request

resp = urllib.request.urlopen("file:///C:/Temp/test.html")
html_file = resp.read().decode('utf-8')

start = html_file.find("foo'>") + 5
end = html_file.find("</p")
print(html_file[start:end])
Output:
Hello world
def means that you shall make function for this.


RE: How to print particular text areas fron an HTML file (not site) - Chris - Dec-09-2017

:) helpful tips so far lads! tryin to make the function and I will post it