Python Forum
problem coverting string data file to dictionary
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
problem coverting string data file to dictionary
#11
@buran: No, I cannot change the way the file was created.

@DeaD_EyE: Yes, the file is Dutch origin.

demjson.decode() produces a string, not a dictionary and if I try loading it with json.loads(), it throws up an error: Expecting property name enclosed in double quotes.

Moreover, while demjson does produce a string for the first line, it also generates error for lines where value of key 'voornaam': has an apostrophe (e. g. ""M'hamed"").

Removing such entries by catching exceptions using try - except works okay.

Thanks for your input.
Reply
#12
can you show how exactly looks a line with apostrophe in the name. I mean the whole line.
what you show - i.e. double double quotes around M'hamed will make it even more weird to parse, maybe with use of regex in order to make it valid json

EDIT: I saw that third line is that one.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#13
Yes, it is difficult:

a few lines of problematic entries:

"{'voornaam': ""M'hamed"", 'geslacht': 'M', 't8306': '103', 'n8589': '36', 'n9094': '14', 'n9599': '7', 'n0004': '8', 'p8589': '67', 'p9094': '24', 'p9599': '13', 'p0004': '15'}"
"{'voornaam': ""M'Hamed"", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}"
"{'voornaam': ""D'Angelo"", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}"
Reply
#14
Actually, is it just the voornaam value that can have apostrophe?
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#15
yes, that's right. only voornam can have an apostrophe.
Reply
#16
lines = [
'''"{'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}"''',
'''"{'voornaam': ""M'Hamed"", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}"''',
'''"{'voornaam': ""D'Angelo"", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}"'''
]

def parse_item(item):
    key, value = item
    key = key.strip()[1:-1]
    value = value.strip().replace('""', "'")[1:-1]
    return (key, value)

def parse_line(line):
    line = [item.split(':') for item in line[2:-2].split(',')]
    return dict((parse_item(item) for item in line))

for line in lines:
    data = parse_line(line)
    print(type(data), data)
Output:
<class 'dict'> {'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'} <class 'dict'> {'voornaam': "M'Hamed", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'} <class 'dict'> {'voornaam': "D'Angelo", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}
of course, you can cast value to int if possible and desired
also, you can combine both functions, but like this it's easier to test them
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#17
Thaks a lot buran. The code appears to deal with the problematic entries in a simple and clean manner. Obviously, it takes experience to discern quirky patterns in the data. Hopefully I will spend more time poring over such data.

I am not in a position to run the code right now on the larger file but will do it tomorrow and update. I am quite sure, however, that this code will work.

Thanks again.
Reply
#18
@buran:
The script works like a charm. However, while trying to undrstand the logic, I rewrote the loops in parse_line() function a follows, introducing print() and input()statements:
lines = [
'''"{'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}"''',
'''"{'voornaam': ""M'Hamed"", 'geslacht': 'M', 't8306': '46', 'n8589': '21', 'n9094': '8', 'n9599': '4', 'n0004': '0', 'p8589': '39', 'p9094': '14', 'p9599': '7', 'p0004': '0'}"''',
'''"{'voornaam': ""D'Angelo"", 'geslacht': 'M', 't8306': '46', 'n8589': '3', 'n9094': '5', 'n9599': '9', 'n0004': '24', 'p8589': '6', 'p9094': '9', 'p9599': '17', 'p0004': '46'}"'''
]

def parse_item(item):
	
	'''  parse for key, value'''
	key, value = item
	key = key.strip()[1:-1]
	value = value.strip().replace('""', "'")[1:-1]
	return (key, value)
 
def parse_line(line):
	
	'''parses line: takes line from position 2 (takes off string to beginning of 'voornam'')
		splits at ',' creating a list. Then splits each item of the list at ':', creating a list of lists,
		each list containing two elements. Then converts it into a dictiionary using parse_item() function
The original code is commented out'''
	for item in line[2:-2].split(','):
		print(item)
		line = [item.split(':')]
		print(line)
		input()
	for item in line:
		return dict(parse_item(item))

	#line = [item.split(':') for item in line[2:-2].split(',')]
	#return dict((parse_item(item) for item in line))
 
for line in lines:
	print(line)
	data = parse_line(line)
	print(type(data), data)
	print()
	input()
I think the rewritten code follows the original but when I run it, it throws a ValuError as shown below:
Output:
"{'voornaam': 'Thomas', 'geslacht': 'M', 't8306': '26794', 'n8589': '4856', 'n9094': '6559', 'n9599': '6412', 'n0004': '5897', 'p8589': '8972', 'p9094': '11424', 'p9599': '11760', 'p0004': '11324'}" 'voornaam': 'Thomas' [["'voornaam'", " 'Thomas'"]] 'geslacht': 'M' [[" 'geslacht'", " 'M'"]] 't8306': '26794' [[" 't8306'", " '26794'"]] 'n8589': '4856' [[" 'n8589'", " '4856'"]] 'n9094': '6559' [[" 'n9094'", " '6559'"]] 'n9599': '6412' [[" 'n9599'", " '6412'"]] 'n0004': '5897' [[" 'n0004'", " '5897'"]] 'p8589': '8972' [[" 'p8589'", " '8972'"]] 'p9094': '11424' [[" 'p9094'", " '11424'"]] 'p9599': '11760' [[" 'p9599'", " '11760'"]] 'p0004': '11324' [[" 'p0004'", " '11324'"]] Traceback (most recent call last): File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 31, in <module> start(fakepyfile,mainpyfile) File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 30, in start exec(open(mainpyfile).read(), __main__.__dict__) File "<string>", line 38, in <module> File "<string>", line 28, in parse_line ValueError: dictionary update sequence element #0 has length 5; 2 is required [Program finished]
Am I forgetting something while rewriting?
Reply
#19
in slow motion

def parse_line(line):
    result = dict() # you need this because of not using list comprehension like in the original code
    for item in line[2:-2].split(','): # remove quotes and {} from line and split at comma. Iterate over items in resulting list
        item = item.split(':') # split item at : and get a 2-element list
        key, value = parse_item(item) # parse the item - remove quotes, strip leading and trailing spaces, etc. and assign to key and value names
        input(f'{key} --> {value}') # see what you've got
        result[key] = value # add element to dict
    return result # return result dict
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#20
I think I have found the cause. The rewritten parse_line() function should have been:
def parse_line(line):
	
	'''parses line: takes line from position 2 (takes off string to beginning of 'voornam'')
		splits at ',' creating a list. Then splits each item of the list at ':', creating a list of lists,
		each list containing two elements. Then coverts it into a dictiionary using parse item'''
	data_dict = {}
	for item in line[2:-2].split(','):
		print(item)
		line = [item.split(':')]
		print(line)
		input()
		
		for item in line:
			key, value = parse_item(item)
			data_dict[key] = value
	return data_dict
this works fine.

Sorry for the earlier posting without a good think through.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Matching Data - Help - Dictionary manuel174102 1 405 Feb-02-2024, 04:47 PM
Last Post: deanhystad
  Need to replace a string with a file (HTML file) tester_V 1 770 Aug-30-2023, 03:42 AM
Last Post: Larz60+
  Convert string to float problem vasik006 8 3,409 Jun-03-2022, 06:41 PM
Last Post: deanhystad
  Converting '1a2b3c' string to Dictionary PythonNoobLvl1 6 1,870 May-13-2022, 03:44 PM
Last Post: deanhystad
  [SOLVED] Concat data from dictionary? Winfried 4 1,732 Mar-30-2022, 02:55 PM
Last Post: Winfried
Question How do I skipkeys on json file read to python dictionary? BrandonKastning 3 1,897 Mar-08-2022, 09:34 PM
Last Post: BrandonKastning
  trying to write a dictionary in a csv file CompleteNewb 13 6,616 Mar-04-2022, 04:43 AM
Last Post: deanhystad
  Python, how to manage multiple data in list or dictionary with calculations and FIFO Mikeardy 8 2,613 Dec-31-2021, 07:47 AM
Last Post: Mikeardy
  f string concatenation problem growSeb 3 2,259 Jun-28-2021, 05:00 AM
Last Post: buran
Question Problem with string and \n Falassion 6 2,697 Jun-15-2021, 03:59 PM
Last Post: Falassion

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020