problem coverting string data file to dictionary

AKNL · Mar-09-2020, 09:11 AM

@buran: No, I cannot change the way the file was created.

@DeaD_EyE: Yes, the file is Dutch origin.

demjson.decode() produces a string, not a dictionary and if I try loading it with json.loads(), it throws up an error: Expecting property name enclosed in double quotes.

Moreover, while demjson does produce a string for the first line, it also generates error for lines where value of key 'voornaam': has an apostrophe (e. g. ""M'hamed"").

Removing such entries by catching exceptions using try - except works okay.

Thanks for your input.

**buran** · (This post was last modified: Mar-09-2020, 09:31 AM by buran.)

can you show how exactly looks a line with apostrophe in the name. I mean the whole line.
what you show - i.e. double double quotes around M'hamed will make it even more weird to parse, maybe with use of regex in order to make it valid json

EDIT: I saw that third line is that one.

AKNL · Mar-09-2020, 09:33 AM

Yes, it is difficult:

a few lines of problematic entries:

"{'voornaam': ""M'hamed"", 'geslacht': 'M', 't8306': '103', 'n8589': '36', 'n9094': '14', 'n9599': '7', 'n0004': '8', 'p8589': '67', 'p9094': '24', 'p9599': '13', 'p0004': '15'}"
"{'voornaam': ""M'Hamed"", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}"
"{'voornaam': ""D'Angelo"", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}"

**buran** · Mar-09-2020, 09:45 AM

Actually, is it just the voornaam value that can have apostrophe?

AKNL · Mar-09-2020, 10:21 AM

yes, that's right. only voornam can have an apostrophe.

**buran** · (This post was last modified: Mar-09-2020, 01:56 PM by buran.)

lines = [
'''"{'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}"''',
'''"{'voornaam': ""M'Hamed"", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}"''',
'''"{'voornaam': ""D'Angelo"", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}"'''
]

def parse_item(item):
    key, value = item
    key = key.strip()[1:-1]
    value = value.strip().replace('""', "'")[1:-1]
    return (key, value)

def parse_line(line):
    line = [item.split(':') for item in line[2:-2].split(',')]
    return dict((parse_item(item) for item in line))

for line in lines:
    data = parse_line(line)
    print(type(data), data)

Output:<class 'dict'> {'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}
<class 'dict'> {'voornaam': "M'Hamed", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}
<class 'dict'> {'voornaam': "D'Angelo", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}

of course, you can cast value to int if possible and desired
also, you can combine both functions, but like this it's easier to test them

AKNL · Mar-09-2020, 10:26 PM

Thaks a lot buran. The code appears to deal with the problematic entries in a simple and clean manner. Obviously, it takes experience to discern quirky patterns in the data. Hopefully I will spend more time poring over such data.

I am not in a position to run the code right now on the larger file but will do it tomorrow and update. I am quite sure, however, that this code will work.

Thanks again.

AKNL · Mar-10-2020, 12:34 PM

@buran:
The script works like a charm. However, while trying to undrstand the logic, I rewrote the loops in parse_line() function a follows, introducing print() and input()statements:

lines = [
'''"{'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}"''',
'''"{'voornaam': ""M'Hamed"", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}"''',
'''"{'voornaam': ""D'Angelo"", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}"'''
]

def parse_item(item):
	
	'''  parse for key, value'''
	key, value = item
	key = key.strip()[1:-1]
	value = value.strip().replace('""', "'")[1:-1]
	return (key, value)
 
def parse_line(line):
	
	'''parses line: takes line from position 2 (takes off string to beginning of 'voornam'')
		splits at ',' creating a list. Then splits each item of the list at ':', creating a list of lists,
		each list containing two elements. Then converts it into a dictiionary using parse_item() function
The original code is commented out'''
	for item in line[2:-2].split(','):
		print(item)
		line = [item.split(':')]
		print(line)
		input()
	for item in line:
		return dict(parse_item(item))

	#line = [item.split(':') for item in line[2:-2].split(',')]
	#return dict((parse_item(item) for item in line))
 
for line in lines:
	print(line)
	data = parse_line(line)
	print(type(data), data)
	print()
	input()

I think the rewritten code follows the original but when I run it, it throws a ValuError as shown below:

Output:"{'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}"
'voornaam': 'Thomas'
[["'voornaam'", " 'Thomas'"]]

 'geslacht': 'M'
[[" 'geslacht'", " 'M'"]]

 't8306': '26794'
[[" 't8306'", " '26794'"]]

 'n8589': '4856'
[[" 'n8589'", " '4856'"]]

 'n9094': '6559'
[[" 'n9094'", " '6559'"]]

 'n9599': '6412'
[[" 'n9599'", " '6412'"]]

 'n0004': '5897'
[[" 'n0004'", " '5897'"]]

 'p8589': '8972'
[[" 'p8589'", " '8972'"]]

 'p9094': '11424'
[[" 'p9094'", " '11424'"]]

 'p9599': '11760'
[[" 'p9599'", " '11760'"]]

 'p0004': '11324'
[[" 'p0004'", " '11324'"]]

Traceback (most recent call last):
  File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 31, in <module>
    start(fakepyfile,mainpyfile)
  File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 30, in start
    exec(open(mainpyfile).read(),  __main__.__dict__)
  File "<string>", line 38, in <module>
  File "<string>", line 28, in parse_line
ValueError: dictionary update sequence element #0 has length 5; 2 is required

[Program finished]

Am I forgetting something while rewriting?

**buran** · Mar-10-2020, 12:48 PM

in slow motion

def parse_line(line):
    result = dict() # you need this because of not using list comprehension like in the original code
    for item in line[2:-2].split(','): # remove quotes and {} from line and split at comma. Iterate over items in resulting list
        item = item.split(':') # split item at : and get a 2-element list
        key, value = parse_item(item) # parse the item - remove quotes, strip leading and trailing spaces, etc. and assign to key and value names
        input(f'{key} --> {value}') # see what you've got
        result[key] = value # add element to dict
    return result # return result dict

AKNL · Mar-10-2020, 12:49 PM

I think I have found the cause. The rewritten parse_line() function should have been:

def parse_line(line):
	
	'''parses line: takes line from position 2 (takes off string to beginning of 'voornam'')
		splits at ',' creating a list. Then splits each item of the list at ':', creating a list of lists,
		each list containing two elements. Then coverts it into a dictiionary using parse item'''
	data_dict = {}
	for item in line[2:-2].split(','):
		print(item)
		line = [item.split(':')]
		print(line)
		input()
		
		for item in line:
			key, value = parse_item(item)
			data_dict[key] = value
	return data_dict

this works fine.

Sorry for the earlier posting without a good think through.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Matching Data - Help - Dictionary	manuel174102	1	422	Feb-02-2024, 04:47 PM Last Post: deanhystad
	Need to replace a string with a file (HTML file)	tester_V	1	794	Aug-30-2023, 03:42 AM Last Post: Larz60+
	Convert string to float problem	vasik006	8	3,449	Jun-03-2022, 06:41 PM Last Post: deanhystad
	Converting '1a2b3c' string to Dictionary	PythonNoobLvl1	6	1,905	May-13-2022, 03:44 PM Last Post: deanhystad
	[SOLVED] Concat data from dictionary?	Winfried	4	1,749	Mar-30-2022, 02:55 PM Last Post: Winfried
	How do I skipkeys on json file read to python dictionary?	BrandonKastning	3	1,922	Mar-08-2022, 09:34 PM Last Post: BrandonKastning
	trying to write a dictionary in a csv file	CompleteNewb	13	6,682	Mar-04-2022, 04:43 AM Last Post: deanhystad
	Python, how to manage multiple data in list or dictionary with calculations and FIFO	Mikeardy	8	2,669	Dec-31-2021, 07:47 AM Last Post: Mikeardy
	f string concatenation problem	growSeb	3	2,292	Jun-28-2021, 05:00 AM Last Post: buran
	Problem with string and \n	Falassion	6	2,723	Jun-15-2021, 03:59 PM Last Post: Falassion

problem coverting string data file to dictionary

User Panel Messages

Announcements