Posts: 5
Threads: 1
Joined: Dec 2017
I am new to python and need some help to be able to print particular text areas from a .html file .The file is in my c:\temp\test.html
filename=input("type file path")
infile=open(filename,'r','utf-8')
data =infile.red(filename,'r','utf-8')
print(data)
Somehow i need to make a equation
text = subtext("bla this is the text i wonna print bla bla", "this", "bla")
Posts: 5
Threads: 1
Joined: Dec 2017
me first attemp tis to be able to open this html file saved in my c:\temp\test.html
and search for all the text areas starting with http and finishing with .com
import re
import io
filename =input("type the file path:")
with io.open('filename','r','utf-8') as f:
for line in f:
line = line.strip()
if re.match("http","com",0,true,true) and len(line)==7:
print(line)
got this error
Traceback (most recent call last):
File "C:/temp/testestlast.py", line 4, in <module>
with io.open('filename','r','utf-8') as f:
TypeError: an integer is required (got type str)
Posts: 1,150
Threads: 42
Joined: Sep 2016
After a quick glance of the docs for io.open function, it appears that you are missing a "buffering" argument, which requires an integer argument.
Posts: 8,085
Threads: 153
Joined: Sep 2016
first of all, post all of your code, traceback, output, etc. in respective tags. See BBcode help for more info.
That said, you don't need to work directly with io module - this is going too low level. Just use built-in open() function to access the file. And use with context manager, e.g.
with open('test.html') as f:
# do something Also, given that this is html file, you may want to use special packages (e.g. Beautiful Soup) to parse the content of the file. re (RegEx) in not a tool to use in this case.
Posts: 5
Threads: 1
Joined: Dec 2017
Dec-07-2017, 12:47 PM
(This post was last modified: Dec-07-2017, 12:48 PM by Chris.)
(Dec-07-2017, 12:18 PM)buran Wrote: first of all, post all of your code, traceback, output, etc. in respective tags. See BBcode help for more info.
That said, you don't need to work directly withio module - this is going too low level. Just use built-in open() function to access the file. And use with context manager, e.g.
with open('test.html') as f:
# do something Also, given that this is html file, you may want to use special packages (e.g. Beautiful Soup) to parse the content of the file. re (RegEx) in not a tool to use in this case.
I am pretty new in python. I am able to print context of a txt file and find the particular words so far. The question is that if the folder is html format then I cant open it at all. . In case I need beautiful soup to be able to have the output. my instructor walkthrough is just wrong. infile= open("filename",'r') for text and infile = open("filename",'r'utf-8') for html
Posts: 3,458
Threads: 101
Joined: Sep 2016
Quote: The question is that if the folder is html format then I cant open it at all.
Folders have no format.
A text file that happens to have the .html extension is still just a text file, and is the same as any .txt file (or .py file for that matter). You don't need to supply an encoding to open() to open an html file, unless the file is actually encoded in a special way (most of them are not, though).
Posts: 7,068
Threads: 122
Joined: Sep 2016
In my c:\temp\test.html
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Title of the document</title>
</head>
<body>
<p id='foo'>Hello world</p>
</body>
</html> from bs4 import BeautifulSoup
with open('C:/temp/test.html', encoding='utf-8') as f:
html_file = f.read()
soup = BeautifulSoup(html_file, 'lxml')
print(soup.select('#foo')[0].text) Output: Hello world
Posts: 5
Threads: 1
Joined: Dec 2017
with your help I have managed to print the requested outcome using beautifull soup
from bs4 import BeautifulSoup
import urllib.request
import warnings
warnings.filterwarnings('ignore')
resp = urllib.request.urlopen("file:///C:/seminaria.html")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print (link['href'])
print (link.string)
print ("") " I am printing all urls from seminaria.html and their names"
Was wandering if anyone got any tips if is possible to print same result withough using beautiful soup using def
Posts: 7,068
Threads: 122
Joined: Sep 2016
Dec-09-2017, 02:54 PM
(This post was last modified: Dec-09-2017, 02:55 PM by snippsat.)
It's better to use with open() ,when reading from disk that urllib.
Without a parser can use string tool like eg find() ,and slice out the text.
import urllib.request
resp = urllib.request.urlopen("file:///C:/Temp/test.html")
html_file = resp.read().decode('utf-8')
start = html_file.find("foo'>") + 5
end = html_file.find("</p")
print(html_file[start:end]) Output: Hello world
def means that you shall make function for this.
Posts: 5
Threads: 1
Joined: Dec 2017
:) helpful tips so far lads! tryin to make the function and I will post it
|