"for loop" not indexing correctly?

"for loop" not indexing correctly? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: "for loop" not indexing correctly? (/thread-23953.html)

"for loop" not indexing correctly? - melblanc - Jan-24-2020

Wall

I am using a for loop with regular expressions to scrape a file
for ordered pairs of geographic co-ordinates. All seemed to be
working until I noticed on a file with a single entry that the
scrape did not work. Closer inspection then revealed when multi-
ples were found, every other entry was being skipped.

Pairing everything back to minimal code revealed the below was
failing to capture alternating entries in the scraped file.

import re

filename = "bugs_bunny2.txt"

f = open(filename)

for x in f:
    text_string = f.readline()

    scraped_data = re.findall("{LON=-\d\d.\d\d\d\d\d\d}{LAT=\d\d.\d\d\d\d\d\d}", text_string)

    print(scraped_data)

The data in the bugs_bunny2.txt file is:

Source = Goniometer, Accuracy < 1  %  {LON=-78.555550}{LAT=39.111222}
Source = Goniometer, Accuracy < 7  %  {LON=-78.555551}{LAT=39.111223}
Source = Altimeter,  Accuracy <15  %  {LON=-78.456432}{LAT=38.999999}
Source = GPS,        Accuracy <  .1%  {LON=-78.555593}{LAT=39.111199}
Source = Goniometer  Accuracy < 2  %  {LON=-78.555594}{LAT=39.111190}  GPS CLOCK CORRECTED
Source = Goniometer  Accuracy < 1  %  {LON=-78.555565}{LAT=39.111191}  GPS CLOCK CORRECTED
Source = Goniometer  Accuracy <  .9%  {LON=-78.555516}{LAT=38.111065}  GPS CLOCK CORRECTED

The above file is a structured sample of what the file is like
being scraped. The longitude entries are deliberately forged so
that the last digit of the longitude entry increments by one
starting at 0 and ending at 6 for a total of seven possible
entries that can be recovered.

When the script is run only entries with zero or even numbers
at the end of the longitude entry are returned. On an outside
chance the last digit was influencing the result the last digit
in the longitude entries was changed by '1' so the numbers ran
from 1 through 7 instead of 0 through 6. The same entries were
displayed in the test with the last digit altered. Lines
0,2,4 & 6 were displayed.

{LON=-78.555550}{LAT=39.111222}
{LON=-78.456432}{LAT=38.999999}
{LON=-78.555594}{LAT=39.111190}
{LON=-78.555516}{LAT=38.111065}

Altered Entries

{LON=-78.555551}{LAT=39.111222}
{LON=-78.456433}{LAT=38.999999}
{LON=-78.555595}{LAT=39.111190}
{LON=-78.555517}{LAT=38.111065}

If the file being scraped has only one matching entry to the regular
expression then it is not displayed. Checking the output of the
variable scraped_data will return an empty set of brackets, "[]".

What am I overlooking here. I see nothing that should step over
any line in the text file. At this point even straws are welcomed.

Mel

RE: "for loop" not indexing correctly? - ThiefOfTime - Jan-24-2020

I tried to reproduce your output, but as I thought I would get a different output. The Problem is that open gives you an iterator over the file. By using the for-loop you already are iterating over the lines. Since iterators only visit each element once, you will lose the first, third, ... entry since you are not using the x. Instead you use the readline command. So the first line is stored in the x variable of the for-loop. the readline reads the next line. Then the third line is stored in x and readline reads the forth line, etc. The value of the entries has nothing to do with the output you are getting.
I suggest you try this:

import re
 
filename = "bugs_bunny2.txt"
 
f = open(filename)
 
for x in f:
    scraped_data = re.findall("{LON=-\d\d.\d\d\d\d\d\d}{LAT=\d\d.\d\d\d\d\d\d}", x)
 
    print(scraped_data)

Also try to use:

with open(filename) as f:
    for x in f:
       # rest of the code

then the file will be closed automatically.

RE: "for loop" not indexing correctly? - DeaD_EyE - Jan-24-2020

The problem is strange. I haven't reproduced it, instead I go a different way.

First of all, your regex is wrong.

{LON=-\d\d.\d\d\d\d\d\d}{LAT=\d\d.\d\d\d\d\d\d}

The corrected version:

{LON=([+-]?\d{2}\.\d{6})}{LAT=([+-]?\d{2}\.\d{6})}

+ or - or no sign in front of the number. The ? means zero or one occurrence
Quantifier for \d
Escaped the dot ., otherwise it could be any char
Grouped longitude and latitude

Use regex101 to check it.

Instead of looking up the whole data in memory, you could use an iterative solution: Line by line
You could return a dict, or a list for each result.

import re


def parse(file, regex, *, to_float=False):
    with open(file) as fd: 
        for line in fd:
            match = regex.search(line)
            if match:
                lon, lat = match.group(1), match.group(2)
                if to_float:
                    lon, lat = float(lon), float(lat)
                yield {'lon': lon, 'lat': lat}


filename = "bugs_bunny2.txt"
pattern = re.compile(r"{LON=([+-]?\d{2}\.\d{6})}{LAT=([+-]?\d{2}\.\d{6})}")
for data in parse(filename, pattern):
    print(data)

Since Python 3.8, you could write one line lesser (assignment expression):

import re


def parse(file, regex, *, to_float=False):
    with open(file) as fd: 
        for line in fd:
            if match := regex.search(line):
                lon, lat = match.group(1), match.group(2)
                if to_float:
                    lon, lat = float(lon), float(lat)
                yield {'lon': lon, 'lat': lat}


filename = "bugs_bunny2.txt"
pattern = re.compile(r"{LON=([+-]?\d{2}\.\d{6})}{LAT=([+-]?\d{2}\.\d{6})}")
for data in parse(filename, pattern):
    print(data)

Output:{'lon': '-78.555550', 'lat': '39.111222'}
{'lon': '-78.555551', 'lat': '39.111223'}
{'lon': '-78.456432', 'lat': '38.999999'}
{'lon': '-78.555593', 'lat': '39.111199'}
{'lon': '-78.555594', 'lat': '39.111190'}
{'lon': '-78.555565', 'lat': '39.111191'}
{'lon': '-78.555516', 'lat': '38.111065'}

I tried this also with one and zero lines and it works as expected.
This output is without converting the str to float.

If the file is 100 TiB big, you're still able to use this code,
because it doesn't load the whole content of the file into memory.
A nice side effect of an iterative solution.

The use of re.findall requires the whole content to be in memory.
For toy applications, it's ok.

With medium data (fits on disk, but not memory) you need an iterative solution.

RE: "for loop" not indexing correctly? - buran - Jan-24-2020

(Jan-24-2020, 02:31 PM)DeaD_EyE Wrote: Instead of looking up the whole data in memory, you could use an iterative solution: Line by line

they process the file line by line, but as @ThiefOfTime explained, they were consuming 2 lines at a time (one in the loop, next with readline()), but processing just the second one.
the comment about the regex itself is valid though

RE: "for loop" not indexing correctly? - melblanc - Jan-24-2020

Thief of Time and DeadEye, thank you for the direction. As Thief of Time found it was a double iteration snafu. Once I removed the f.readline() and used 'x' in the loop not only as the iterator but as the variable all pieces fell into place. Obviously I did not understand the capability of 'for x in f:'. Like a dummy I assumed x was a simple numeric iterator.

I have to find a book which is clear to 'me' on Python and capabilities.

Regards to all and many thanks.

Mel