Python Forum
"for loop" not indexing correctly?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
"for loop" not indexing correctly?
#1
Wall
I am using a for loop with regular expressions to scrape a file
for ordered pairs of geographic co-ordinates. All seemed to be
working until I noticed on a file with a single entry that the
scrape did not work. Closer inspection then revealed when multi-
ples were found, every other entry was being skipped.

Pairing everything back to minimal code revealed the below was
failing to capture alternating entries in the scraped file.
import re

filename = "bugs_bunny2.txt"

f = open(filename)

for x in f:
    text_string = f.readline()

    scraped_data = re.findall("{LON=-\d\d.\d\d\d\d\d\d}{LAT=\d\d.\d\d\d\d\d\d}", text_string)

    print(scraped_data)
The data in the bugs_bunny2.txt file is:

Source = Goniometer, Accuracy < 1  %  {LON=-78.555550}{LAT=39.111222}
Source = Goniometer, Accuracy < 7  %  {LON=-78.555551}{LAT=39.111223}
Source = Altimeter,  Accuracy <15  %  {LON=-78.456432}{LAT=38.999999}
Source = GPS,        Accuracy <  .1%  {LON=-78.555593}{LAT=39.111199}
Source = Goniometer  Accuracy < 2  %  {LON=-78.555594}{LAT=39.111190}  GPS CLOCK CORRECTED
Source = Goniometer  Accuracy < 1  %  {LON=-78.555565}{LAT=39.111191}  GPS CLOCK CORRECTED
Source = Goniometer  Accuracy <  .9%  {LON=-78.555516}{LAT=38.111065}  GPS CLOCK CORRECTED
The above file is a structured sample of what the file is like
being scraped. The longitude entries are deliberately forged so
that the last digit of the longitude entry increments by one
starting at 0 and ending at 6 for a total of seven possible
entries that can be recovered.

When the script is run only entries with zero or even numbers
at the end of the longitude entry are returned. On an outside
chance the last digit was influencing the result the last digit
in the longitude entries was changed by '1' so the numbers ran
from 1 through 7 instead of 0 through 6. The same entries were
displayed in the test with the last digit altered. Lines
0,2,4 & 6 were displayed.

{LON=-78.555550}{LAT=39.111222}
{LON=-78.456432}{LAT=38.999999}
{LON=-78.555594}{LAT=39.111190}
{LON=-78.555516}{LAT=38.111065}

Altered Entries

{LON=-78.555551}{LAT=39.111222}
{LON=-78.456433}{LAT=38.999999}
{LON=-78.555595}{LAT=39.111190}
{LON=-78.555517}{LAT=38.111065}
If the file being scraped has only one matching entry to the regular
expression then it is not displayed. Checking the output of the
variable scraped_data will return an empty set of brackets, "[]".

What am I overlooking here. I see nothing that should step over
any line in the text file. At this point even straws are welcomed.

Mel
Reply
#2
I tried to reproduce your output, but as I thought I would get a different output. The Problem is that open gives you an iterator over the file. By using the for-loop you already are iterating over the lines. Since iterators only visit each element once, you will lose the first, third, ... entry since you are not using the x. Instead you use the readline command. So the first line is stored in the x variable of the for-loop. the readline reads the next line. Then the third line is stored in x and readline reads the forth line, etc. The value of the entries has nothing to do with the output you are getting.
I suggest you try this:
import re
 
filename = "bugs_bunny2.txt"
 
f = open(filename)
 
for x in f:
    scraped_data = re.findall("{LON=-\d\d.\d\d\d\d\d\d}{LAT=\d\d.\d\d\d\d\d\d}", x)
 
    print(scraped_data)
Also try to use:
with open(filename) as f:
    for x in f:
       # rest of the code
then the file will be closed automatically.
Reply
#3
The problem is strange. I haven't reproduced it, instead I go a different way.

First of all, your regex is wrong.
{LON=-\d\d.\d\d\d\d\d\d}{LAT=\d\d.\d\d\d\d\d\d}
The corrected version:
{LON=([+-]?\d{2}\.\d{6})}{LAT=([+-]?\d{2}\.\d{6})}

  1. + or - or no sign in front of the number. The ? means zero or one occurrence
  2. Quantifier for \d
  3. Escaped the dot ., otherwise it could be any char
  4. Grouped longitude and latitude

Use regex101 to check it.


Instead of looking up the whole data in memory, you could use an iterative solution: Line by line
You could return a dict, or a list for each result.


import re


def parse(file, regex, *, to_float=False):
    with open(file) as fd: 
        for line in fd:
            match = regex.search(line)
            if match:
                lon, lat = match.group(1), match.group(2)
                if to_float:
                    lon, lat = float(lon), float(lat)
                yield {'lon': lon, 'lat': lat}


filename = "bugs_bunny2.txt"
pattern = re.compile(r"{LON=([+-]?\d{2}\.\d{6})}{LAT=([+-]?\d{2}\.\d{6})}")
for data in parse(filename, pattern):
    print(data)
Since Python 3.8, you could write one line lesser (assignment expression):

import re


def parse(file, regex, *, to_float=False):
    with open(file) as fd: 
        for line in fd:
            if match := regex.search(line):
                lon, lat = match.group(1), match.group(2)
                if to_float:
                    lon, lat = float(lon), float(lat)
                yield {'lon': lon, 'lat': lat}


filename = "bugs_bunny2.txt"
pattern = re.compile(r"{LON=([+-]?\d{2}\.\d{6})}{LAT=([+-]?\d{2}\.\d{6})}")
for data in parse(filename, pattern):
    print(data)
Output:
{'lon': '-78.555550', 'lat': '39.111222'} {'lon': '-78.555551', 'lat': '39.111223'} {'lon': '-78.456432', 'lat': '38.999999'} {'lon': '-78.555593', 'lat': '39.111199'} {'lon': '-78.555594', 'lat': '39.111190'} {'lon': '-78.555565', 'lat': '39.111191'} {'lon': '-78.555516', 'lat': '38.111065'}
I tried this also with one and zero lines and it works as expected.
This output is without converting the str to float.


If the file is 100 TiB big, you're still able to use this code,
because it doesn't load the whole content of the file into memory.
A nice side effect of an iterative solution.

The use of re.findall requires the whole content to be in memory.
For toy applications, it's ok.

With medium data (fits on disk, but not memory) you need an iterative solution.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
(Jan-24-2020, 02:31 PM)DeaD_EyE Wrote: Instead of looking up the whole data in memory, you could use an iterative solution: Line by line
they process the file line by line, but as @ThiefOfTime explained, they were consuming 2 lines at a time (one in the loop, next with readline()), but processing just the second one.
the comment about the regex itself is valid though
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
Thief of Time and DeadEye, thank you for the direction. As Thief of Time found it was a double iteration snafu. Once I removed the f.readline() and used 'x' in the loop not only as the iterator but as the variable all pieces fell into place. Obviously I did not understand the capability of 'for x in f:'. Like a dummy I assumed x was a simple numeric iterator.

I have to find a book which is clear to 'me' on Python and capabilities.

Regards to all and many thanks.

Mel
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Matrix indexing and initialization in " for in" loop QuintenR 2 1,845 Dec-23-2020, 05:59 PM
Last Post: QuintenR
  Nested loop indexing Morte 4 3,891 Aug-04-2020, 07:24 AM
Last Post: Morte
  How to change 0 based indexing to 1 based indexing in python..?? Ruthra 2 4,315 Jan-22-2020, 05:13 PM
Last Post: Ruthra
  Why doesn't my loop work correctly? (problem with a break statement) steckinreinhart619 2 3,195 Jun-11-2019, 10:02 AM
Last Post: steckinreinhart619

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020