Python Forum
Searching a text file to find words matching a pattern
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Searching a text file to find words matching a pattern
#1


Hello everyone,

First I'm a beginner at Python and I'm trying to learn by testing a few different thing and now I'm stuck. First of all I use this text file ss100.txt. It's in Swedish so there is of course åäö in the file and that's a problem as well. My question is as follows:
How do I find all words with a pattern like this: hxxxgxx where x is an unknown character?

I tried this but it's not working and I ran out of ideas:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

thefile = open("ss100.txt", "r")

for line in thefile:
    if re.match("h(.*)g(.*)(.*)", line) and len(line)==7:
        print line



A small example of the text file in the link above:
Quote:hopplös
hopplösa
hopplösare
hopplösares
hopplösas
hopplösast
hopplösaste
hopplösastes
hopplöse
hopplöses
hopplöshet
hopplösheten
hopplöshetens
hopplöshets
hopplöst


I'm looking for words with 7 characters starting with 'h' and a 'g' in position 5.
What is wrong with my code? Or is there a better way to do this.
Reply
#2
If you are a newbie, you should learn Python 3.x instead of Python 2.x.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import re
thefile = open("ss100.txt", "r")
 
for line in thefile:
    line = line.strip() # strip end-on-line
    if re.match(r"h...g..", line) and len(line)==7:
        print(line)
Reply
#3
(Nov-07-2017, 06:37 PM)Micael Wrote: It's in Swedish so there is of course åäö in the file and that's a problem as well.
A lot have change regard Unicode,it was one the biggest changes moving to Python 3(as mention bye @heiner55 you should use Python 3).
In Python 3 open() has build in encoding parameter.
So the simple rule is to keep it UTF-8 in and out when reading a file.
Inside Python 3 is all strings sequences of Unicode character,if not encode in or Python 3 do not not recognize encoding it will be bytes (b'hello').
Python 3 will not guess as Python 2 do.

So if borrow code from @heiner55 it look like this:
import re

with open('ss.txt', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if re.match(r"h...g..", line) and len(line)==7:
            print(line)
There is no need for # -*- coding: utf-8 -*- in Python 3,because UTF-8 is default.

In Python 2 it would look like this,same rule UTF-8 in and out.
But has to use a library io or codecs and # -*- coding: utf-8 -*- because Python 2 has ASCII default encoding.
# -*- coding: utf-8 -*-
import re
import io

with io.open('ss.txt', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if re.match(r"h...g..", line) and len(line)==7:
            print(line)
Reply
#4
Thank you heiner55 and snippsat for your quick replies.

I tried the code heiner55 showed me and got this error message:
Quote:Traceback (most recent call last):
File "test3.py", line 6, in <module>
for line in thefile:
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 343: invalid continuation byte
Any idea how to find out if it's utf-8?

-- UPDATE --
Forget the question as I'm using Linux Mint I tried Kate editor and found out that the file was not coded in utf-8 so I saved it as a utf-8 and it works fine.

Thanks again. The two of you saved my day. Smile Now I just have to understand the code really well.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Matching string from a file tester_V 5 442 Mar-05-2024, 05:46 AM
Last Post: Danishhafeez
  find and group similar words with re? cartonics 4 731 Oct-27-2023, 05:36 PM
Last Post: deanhystad
  Form that puts diacritics on the words in the text Melcu54 13 1,477 Aug-22-2023, 07:07 AM
Last Post: Pedroski55
  FileNotFoundError: [WinError 2] The system cannot find the file specified NewBiee 2 1,572 Jul-31-2023, 11:42 AM
Last Post: deanhystad
  splitting file into multiple files by searching for string AlphaInc 2 897 Jul-01-2023, 10:35 PM
Last Post: Pedroski55
  Cannot find py credentials file standenman 5 1,647 Feb-25-2023, 08:30 PM
Last Post: Jeff900
  selenium can't find a file in my desk ? SouAmego22 0 745 Feb-14-2023, 03:21 PM
Last Post: SouAmego22
  Pypdf2 will not find text standenman 2 939 Feb-03-2023, 10:52 PM
Last Post: standenman
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 1,124 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  Find (each) element from a list in a file tester_V 3 1,212 Nov-15-2022, 08:40 PM
Last Post: tester_V

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020