Posts: 14
Threads: 1
Joined: Mar 2020
@buran: No, I cannot change the way the file was created.
@DeaD_EyE: Yes, the file is Dutch origin.
demjson.decode() produces a string, not a dictionary and if I try loading it with json.loads(), it throws up an error: Expecting property name enclosed in double quotes.
Moreover, while demjson does produce a string for the first line, it also generates error for lines where value of key 'voornaam': has an apostrophe (e. g. ""M'hamed"").
Removing such entries by catching exceptions using try - except works okay.
Thanks for your input.
Posts: 8,090
Threads: 154
Joined: Sep 2016
Mar-09-2020, 09:31 AM
(This post was last modified: Mar-09-2020, 09:31 AM by buran.)
can you show how exactly looks a line with apostrophe in the name. I mean the whole line.
what you show - i.e. double double quotes around M'hamed will make it even more weird to parse, maybe with use of regex in order to make it valid json
EDIT: I saw that third line is that one.
Posts: 14
Threads: 1
Joined: Mar 2020
Yes, it is difficult:
a few lines of problematic entries:
"{'voornaam': ""M'hamed"", 'geslacht': 'M', 't8306': '103', 'n8589': '36', 'n9094': '14', 'n9599': '7', 'n0004': '8', 'p8589': '67', 'p9094': '24', 'p9599': '13', 'p0004': '15'}"
"{'voornaam': ""M'Hamed"", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}"
"{'voornaam': ""D'Angelo"", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}"
Posts: 8,090
Threads: 154
Joined: Sep 2016
Actually, is it just the voornaam value that can have apostrophe?
Posts: 14
Threads: 1
Joined: Mar 2020
yes, that's right. only voornam can have an apostrophe.
Posts: 8,090
Threads: 154
Joined: Sep 2016
Mar-09-2020, 01:56 PM
(This post was last modified: Mar-09-2020, 01:56 PM by buran.)
lines = [
'''"{'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}"''',
'''"{'voornaam': ""M'Hamed"", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}"''',
'''"{'voornaam': ""D'Angelo"", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}"'''
]
def parse_item(item):
key, value = item
key = key.strip()[1:-1]
value = value.strip().replace('""', "'")[1:-1]
return (key, value)
def parse_line(line):
line = [item.split(':') for item in line[2:-2].split(',')]
return dict((parse_item(item) for item in line))
for line in lines:
data = parse_line(line)
print(type(data), data) Output: <class 'dict'> {'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}
<class 'dict'> {'voornaam': "M'Hamed", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}
<class 'dict'> {'voornaam': "D'Angelo", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}
of course, you can cast value to int if possible and desired
also, you can combine both functions, but like this it's easier to test them
Posts: 14
Threads: 1
Joined: Mar 2020
Thaks a lot buran. The code appears to deal with the problematic entries in a simple and clean manner. Obviously, it takes experience to discern quirky patterns in the data. Hopefully I will spend more time poring over such data.
I am not in a position to run the code right now on the larger file but will do it tomorrow and update. I am quite sure, however, that this code will work.
Thanks again.
Posts: 14
Threads: 1
Joined: Mar 2020
@buran:
The script works like a charm. However, while trying to undrstand the logic, I rewrote the loops in parse_line() function a follows, introducing print() and input() statements:
lines = [
'''"{'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}"''',
'''"{'voornaam': ""M'Hamed"", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}"''',
'''"{'voornaam': ""D'Angelo"", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}"'''
]
def parse_item(item):
''' parse for key, value'''
key, value = item
key = key.strip()[1:-1]
value = value.strip().replace('""', "'")[1:-1]
return (key, value)
def parse_line(line):
'''parses line: takes line from position 2 (takes off string to beginning of 'voornam'')
splits at ',' creating a list. Then splits each item of the list at ':', creating a list of lists,
each list containing two elements. Then converts it into a dictiionary using parse_item() function
The original code is commented out'''
for item in line[2:-2].split(','):
print(item)
line = [item.split(':')]
print(line)
input()
for item in line:
return dict(parse_item(item))
#line = [item.split(':') for item in line[2:-2].split(',')]
#return dict((parse_item(item) for item in line))
for line in lines:
print(line)
data = parse_line(line)
print(type(data), data)
print()
input() I think the rewritten code follows the original but when I run it, it throws a ValuError as shown below:
Output: "{'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}"
'voornaam': 'Thomas'
[["'voornaam'", " 'Thomas'"]]
'geslacht': 'M'
[[" 'geslacht'", " 'M'"]]
't8306': '26794'
[[" 't8306'", " '26794'"]]
'n8589': '4856'
[[" 'n8589'", " '4856'"]]
'n9094': '6559'
[[" 'n9094'", " '6559'"]]
'n9599': '6412'
[[" 'n9599'", " '6412'"]]
'n0004': '5897'
[[" 'n0004'", " '5897'"]]
'p8589': '8972'
[[" 'p8589'", " '8972'"]]
'p9094': '11424'
[[" 'p9094'", " '11424'"]]
'p9599': '11760'
[[" 'p9599'", " '11760'"]]
'p0004': '11324'
[[" 'p0004'", " '11324'"]]
Traceback (most recent call last):
File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 31, in <module>
start(fakepyfile,mainpyfile)
File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 30, in start
exec(open(mainpyfile).read(), __main__.__dict__)
File "<string>", line 38, in <module>
File "<string>", line 28, in parse_line
ValueError: dictionary update sequence element #0 has length 5; 2 is required
[Program finished]
Am I forgetting something while rewriting?
Posts: 8,090
Threads: 154
Joined: Sep 2016
in slow motion
def parse_line(line):
result = dict() # you need this because of not using list comprehension like in the original code
for item in line[2:-2].split(','): # remove quotes and {} from line and split at comma. Iterate over items in resulting list
item = item.split(':') # split item at : and get a 2-element list
key, value = parse_item(item) # parse the item - remove quotes, strip leading and trailing spaces, etc. and assign to key and value names
input(f'{key} --> {value}') # see what you've got
result[key] = value # add element to dict
return result # return result dict
Posts: 14
Threads: 1
Joined: Mar 2020
I think I have found the cause. The rewritten parse_line() function should have been:
def parse_line(line):
'''parses line: takes line from position 2 (takes off string to beginning of 'voornam'')
splits at ',' creating a list. Then splits each item of the list at ':', creating a list of lists,
each list containing two elements. Then coverts it into a dictiionary using parse item'''
data_dict = {}
for item in line[2:-2].split(','):
print(item)
line = [item.split(':')]
print(line)
input()
for item in line:
key, value = parse_item(item)
data_dict[key] = value
return data_dict this works fine.
Sorry for the earlier posting without a good think through.
|