getting unique values and counting amounts - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: getting unique values and counting amounts (/thread-16483.html) |
getting unique values and counting amounts - Truman - Mar-01-2019 # getting unique ngrams out of duplicates and counting how many times ngram appears import requests from bs4 import BeautifulSoup import re import string from collections import OrderedDict def cleanInput(input): input = re.sub('\n', " ", input) input = re.sub('\[[0-9]*\]', "", input) input = re.sub(' +', " ", input) input = bytes(input, "UTF-8") input = input.decode("ascii", "ignore") input = input.upper() cleanInput = [] input = input.split(' ') for item in input: item = item.strip(string.punctuation) if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): cleanInput.append(item) return cleanInput def getNgrams(input, n): input = cleanInput(input) output = dict() for i in range(len(input)-n+1): newNGram = " ".join(input[i:i+n]) if newNGram in output: output[newNGram] += 1 else: output[newNGram] = 1 return output html = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)") bsObj = BeautifulSoup(html.content, 'html.parser') content = bsObj.find("div", {"id": "mw-content-text"}).get_text() ngrams = getNgrams(content, 2) ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True)) print(ngrams) print("2-grams count is: "+str(len(ngrams)))I'm working on this example of n-gram extraction in the book of web scraping. What surprised me is that program didn't find any duplicate, for each n-gram that consists of two words there is number '1'. In the book the given output is much different. For example, is sorted first, therefore I'm wondering if I'm doing something wrong. Could you also give me an explanation of this line: ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))Particulary this chunk: key=lambda t: t[1] Don't understand what is it for... RE: n-gram - Larz60+ - Mar-02-2019 You've triggered my curiosity, what is the actual title of 'book of web scraping' RE: n-gram - Truman - Mar-02-2019 We've mentioned it before on a different thread - Web Scraping with Python by Ryan Mitchell. RE: n-gram - Larz60+ - Mar-02-2019 I own it, and have read parts, looks like I need to finish the book. Sorry I can't help with your problem, but thank you RE: n-gram - ichabod801 - Mar-02-2019 (Mar-01-2019, 11:28 PM)Truman Wrote: Could you also give me an explanation of this line: I'm not familiar with the packages you are using, but I can help you with this bit. The key parameter to sorted (or list.sort()) is a function that the value to actually do the sort by. This is very often done as a lambda function. If you are not familiar with those, they are simple throw-away functions. They have one expression, and the value of that expression is the return value of the function. In this case you are sorting dictionary items, which are key/value pairs. So if t is a dictionary item, t[1] is the value. It's returning the key/value pairs from the dictionary, sorted by the values. RE: n-gram - Larz60+ - Mar-02-2019 I ran your code (with added dictionary display) and gor a count of 6772 The only mod that I made was the addition of function of display_dict # getting unique ngrams out of duplicates and counting how many times ngram appears import requests from bs4 import BeautifulSoup import re import string from collections import OrderedDict def cleanInput(input): input = re.sub('\n', " ", input) input = re.sub('\[[0-9]*\]', "", input) input = re.sub(' +', " ", input) input = bytes(input, "UTF-8") input = input.decode("ascii", "ignore") input = input.upper() cleanInput = [] input = input.split(' ') for item in input: item = item.strip(string.punctuation) if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): cleanInput.append(item) return cleanInput def getNgrams(input, n): input = cleanInput(input) output = dict() for i in range(len(input)-n+1): newNGram = " ".join(input[i:i+n]) if newNGram in output: output[newNGram] += 1 else: output[newNGram] = 1 return output def display_dict(thedict): for key, value in thedict.items(): if isinstance(value, dict): print(f'{key}:') display_dict(value) else: print(f' {key}: {value}') html = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)") bsObj = BeautifulSoup(html.content, 'html.parser') content = bsObj.find("div", {"id": "mw-content-text"}).get_text() ngrams = getNgrams(content, 2) ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True)) display_dict(ngrams) # print(ngrams) print("2-grams count is: "+str(len(ngrams)))results will add attachment [attachment=575] Sorry for so many edits, gave me a bit of trouble attaching RE: n-gram - Truman - Mar-02-2019 Strange, when I run your code all n-grams have value 1.
(Mar-02-2019, 02:12 AM)ichabod801 Wrote:(Mar-01-2019, 11:28 PM)Truman Wrote: Could you also give me an explanation of this line: Does thas mean that t is a key and t[1] is a value in this dictionary? RE: n-gram - ichabod801 - Mar-02-2019 (Mar-02-2019, 10:55 PM)Truman Wrote: Does thas mean that t is a key and t[1] is a value in this dictionary? Not quite. t[0] is a key, and t[1] is the value associated with that key. RE: n-gram - Truman - Mar-03-2019 Ok, thanks. Larz, I don't quite understand if statement in display_dict function. p.s. Now I see that there is a web scraping subforum. You may want to remove this thread there. RE: n-gram - Larz60+ - Mar-03-2019 Quote:Larz, I don't quite understand if statement in display_dict function. isinstance is checking the type of value, which will be dict if it is a nested dictionary, and if so, recursively calls the display_dict function for the nested portion. |