Python Forum - Searching a text file to find words matching a pattern

Hello everyone,

First I'm a beginner at Python and I'm trying to learn by testing a few different thing and now I'm stuck. First of all I use this text file ss100.txt. It's in Swedish so there is of course åäö in the file and that's a problem as well. My question is as follows:
How do I find all words with a pattern like this: hxxxgxx where x is an unknown character?

I tried this but it's not working and I ran out of ideas:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

thefile = open("ss100.txt", "r")

for line in thefile:
    if re.match("h(.*)g(.*)(.*)", line) and len(line)==7:
        print line

A small example of the text file in the link above:

Quote:hopplös
hopplösa
hopplösare
hopplösares
hopplösas
hopplösast
hopplösaste
hopplösastes
hopplöse
hopplöses
hopplöshet
hopplösheten
hopplöshetens
hopplöshets
hopplöst

I'm looking for words with 7 characters starting with 'h' and a 'g' in position 5.
What is wrong with my code? Or is there a better way to do this.

If you are a newbie, you should learn Python 3.x instead of Python 2.x.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import re
thefile = open("ss100.txt", "r")
 
for line in thefile:
    line = line.strip() # strip end-on-line
    if re.match(r"h...g..", line) and len(line)==7:
        print(line)

(Nov-07-2017, 06:37 PM)Micael Wrote: [ -> ]It's in Swedish so there is of course åäö in the file and that's a problem as well.

A lot have change regard Unicode,it was one the biggest changes moving to Python 3(as mention bye @heiner55 you should use Python 3).
In Python 3 open() has build in encoding parameter.
So the simple rule is to keep it UTF-8 in and out when reading a file.
Inside Python 3 is all strings sequences of Unicode character,if not encode in or Python 3 do not not recognize encoding it will be bytes (b'hello').
Python 3 will not guess as Python 2 do.

So if borrow code from @heiner55 it look like this:

import re

with open('ss.txt', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if re.match(r"h...g..", line) and len(line)==7:
            print(line)

There is no need for # -*- coding: utf-8 -*- in Python 3,because UTF-8 is default.

In Python 2 it would look like this,same rule UTF-8 in and out.
But has to use a library io or codecs and # -*- coding: utf-8 -*- because Python 2 has ASCII default encoding.

# -*- coding: utf-8 -*-
import re
import io

with io.open('ss.txt', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if re.match(r"h...g..", line) and len(line)==7:
            print(line)

Thank you heiner55 and snippsat for your quick replies.

I tried the code heiner55 showed me and got this error message:

Quote:Traceback (most recent call last):
File "test3.py", line 6, in <module>
for line in thefile:
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 343: invalid continuation byte

Any idea how to find out if it's utf-8?

-- UPDATE --
Forget the question as I'm using Linux Mint I tried Kate editor and found out that the file was not coded in utf-8 so I saved it as a utf-8 and it works fine.

Thanks again. The two of you saved my day. Smile

Now I just have to understand the code really well.