"for loop" not indexing correctly? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: "for loop" not indexing correctly? (/thread-23953.html) |
"for loop" not indexing correctly? - melblanc - Jan-24-2020 I am using a for loop with regular expressions to scrape a file for ordered pairs of geographic co-ordinates. All seemed to be working until I noticed on a file with a single entry that the scrape did not work. Closer inspection then revealed when multi- ples were found, every other entry was being skipped. Pairing everything back to minimal code revealed the below was failing to capture alternating entries in the scraped file. import re filename = "bugs_bunny2.txt" f = open(filename) for x in f: text_string = f.readline() scraped_data = re.findall("{LON=-\d\d.\d\d\d\d\d\d}{LAT=\d\d.\d\d\d\d\d\d}", text_string) print(scraped_data)The data in the bugs_bunny2.txt file is: Source = Goniometer, Accuracy < 1 % {LON=-78.555550}{LAT=39.111222} Source = Goniometer, Accuracy < 7 % {LON=-78.555551}{LAT=39.111223} Source = Altimeter, Accuracy <15 % {LON=-78.456432}{LAT=38.999999} Source = GPS, Accuracy < .1% {LON=-78.555593}{LAT=39.111199} Source = Goniometer Accuracy < 2 % {LON=-78.555594}{LAT=39.111190} GPS CLOCK CORRECTED Source = Goniometer Accuracy < 1 % {LON=-78.555565}{LAT=39.111191} GPS CLOCK CORRECTED Source = Goniometer Accuracy < .9% {LON=-78.555516}{LAT=38.111065} GPS CLOCK CORRECTEDThe above file is a structured sample of what the file is like being scraped. The longitude entries are deliberately forged so that the last digit of the longitude entry increments by one starting at 0 and ending at 6 for a total of seven possible entries that can be recovered. When the script is run only entries with zero or even numbers at the end of the longitude entry are returned. On an outside chance the last digit was influencing the result the last digit in the longitude entries was changed by '1' so the numbers ran from 1 through 7 instead of 0 through 6. The same entries were displayed in the test with the last digit altered. Lines 0,2,4 & 6 were displayed. {LON=-78.555550}{LAT=39.111222} {LON=-78.456432}{LAT=38.999999} {LON=-78.555594}{LAT=39.111190} {LON=-78.555516}{LAT=38.111065} Altered Entries {LON=-78.555551}{LAT=39.111222} {LON=-78.456433}{LAT=38.999999} {LON=-78.555595}{LAT=39.111190} {LON=-78.555517}{LAT=38.111065}If the file being scraped has only one matching entry to the regular expression then it is not displayed. Checking the output of the variable scraped_data will return an empty set of brackets, "[]". What am I overlooking here. I see nothing that should step over any line in the text file. At this point even straws are welcomed. Mel RE: "for loop" not indexing correctly? - ThiefOfTime - Jan-24-2020 I tried to reproduce your output, but as I thought I would get a different output. The Problem is that open gives you an iterator over the file. By using the for-loop you already are iterating over the lines. Since iterators only visit each element once, you will lose the first, third, ... entry since you are not using the x. Instead you use the readline command. So the first line is stored in the x variable of the for-loop. the readline reads the next line. Then the third line is stored in x and readline reads the forth line, etc. The value of the entries has nothing to do with the output you are getting. I suggest you try this: import re filename = "bugs_bunny2.txt" f = open(filename) for x in f: scraped_data = re.findall("{LON=-\d\d.\d\d\d\d\d\d}{LAT=\d\d.\d\d\d\d\d\d}", x) print(scraped_data)Also try to use: with open(filename) as f: for x in f: # rest of the codethen the file will be closed automatically. RE: "for loop" not indexing correctly? - DeaD_EyE - Jan-24-2020 The problem is strange. I haven't reproduced it, instead I go a different way. First of all, your regex is wrong. {LON=-\d\d.\d\d\d\d\d\d}{LAT=\d\d.\d\d\d\d\d\d}The corrected version: {LON=([+-]?\d{2}\.\d{6})}{LAT=([+-]?\d{2}\.\d{6})}
Use regex101 to check it. Instead of looking up the whole data in memory, you could use an iterative solution: Line by line You could return a dict, or a list for each result. import re def parse(file, regex, *, to_float=False): with open(file) as fd: for line in fd: match = regex.search(line) if match: lon, lat = match.group(1), match.group(2) if to_float: lon, lat = float(lon), float(lat) yield {'lon': lon, 'lat': lat} filename = "bugs_bunny2.txt" pattern = re.compile(r"{LON=([+-]?\d{2}\.\d{6})}{LAT=([+-]?\d{2}\.\d{6})}") for data in parse(filename, pattern): print(data)Since Python 3.8, you could write one line lesser (assignment expression): import re def parse(file, regex, *, to_float=False): with open(file) as fd: for line in fd: if match := regex.search(line): lon, lat = match.group(1), match.group(2) if to_float: lon, lat = float(lon), float(lat) yield {'lon': lon, 'lat': lat} filename = "bugs_bunny2.txt" pattern = re.compile(r"{LON=([+-]?\d{2}\.\d{6})}{LAT=([+-]?\d{2}\.\d{6})}") for data in parse(filename, pattern): print(data) I tried this also with one and zero lines and it works as expected.This output is without converting the str to float. If the file is 100 TiB big, you're still able to use this code, because it doesn't load the whole content of the file into memory. A nice side effect of an iterative solution. The use of re.findall requires the whole content to be in memory.For toy applications, it's ok. With medium data (fits on disk, but not memory) you need an iterative solution. RE: "for loop" not indexing correctly? - buran - Jan-24-2020 (Jan-24-2020, 02:31 PM)DeaD_EyE Wrote: Instead of looking up the whole data in memory, you could use an iterative solution: Line by linethey process the file line by line, but as @ThiefOfTime explained, they were consuming 2 lines at a time (one in the loop, next with readline()), but processing just the second one. the comment about the regex itself is valid though RE: "for loop" not indexing correctly? - melblanc - Jan-24-2020 Thief of Time and DeadEye, thank you for the direction. As Thief of Time found it was a double iteration snafu. Once I removed the f.readline() and used 'x' in the loop not only as the iterator but as the variable all pieces fell into place. Obviously I did not understand the capability of 'for x in f:'. Like a dummy I assumed x was a simple numeric iterator. I have to find a book which is clear to 'me' on Python and capabilities. Regards to all and many thanks. Mel |