getting unique values and counting amounts

getting unique values and counting amounts - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: getting unique values and counting amounts (/thread-16483.html)

Pages: 1 2 3

getting unique values and counting amounts - Truman - Mar-01-2019

#  getting unique ngrams out of duplicates and counting how many times ngram appears
import requests
from bs4 import BeautifulSoup
import re
import string
from collections import OrderedDict

def cleanInput(input):
	input = re.sub('\n', " ", input)
	input = re.sub('\[[0-9]*\]', "", input)
	input = re.sub(' +', " ", input)
	input = bytes(input, "UTF-8")
	input = input.decode("ascii", "ignore")
	input = input.upper()
	cleanInput = []
	input = input.split(' ')
	for item in input:
		item = item.strip(string.punctuation)
		if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
			cleanInput.append(item)
	return cleanInput

def getNgrams(input, n):
	input = cleanInput(input)
	output = dict()
	for i in range(len(input)-n+1):
		newNGram = " ".join(input[i:i+n])
		if newNGram in output:
			output[newNGram] += 1
		else:
			output[newNGram] = 1
	return output

html = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html.content, 'html.parser')
content = bsObj.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
print(ngrams)
print("2-grams count is: "+str(len(ngrams)))

I'm working on this example of n-gram extraction in the book of web scraping. What surprised me is that program didn't find any duplicate, for each n-gram that consists of two words there is number '1'. In the book the given output is much different. For example,

Output:
('of the', 38)

is sorted first, therefore I'm wondering if I'm doing something wrong.
Could you also give me an explanation of this line:

ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))

Particulary this chunk: key=lambda t: t[1]
Don't understand what is it for...

RE: n-gram - Larz60+ - Mar-02-2019

You've triggered my curiosity, what is the actual title of 'book of web scraping'

RE: n-gram - Truman - Mar-02-2019

We've mentioned it before on a different thread - Web Scraping with Python by Ryan Mitchell.

RE: n-gram - Larz60+ - Mar-02-2019

I own it, and have read parts, looks like I need to finish the book.
Sorry I can't help with your problem, but thank you

RE: n-gram - ichabod801 - Mar-02-2019

(Mar-01-2019, 11:28 PM)Truman Wrote: Could you also give me an explanation of this line:
Python Code: (Double-click to select all)
1

ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
Particulary this chunk: key=lambda t: t[1]
Don't understand what is it for...

I'm not familiar with the packages you are using, but I can help you with this bit. The key parameter to sorted (or list.sort()) is a function that the value to actually do the sort by. This is very often done as a lambda function. If you are not familiar with those, they are simple throw-away functions. They have one expression, and the value of that expression is the return value of the function. In this case you are sorting dictionary items, which are key/value pairs. So if t is a dictionary item, t[1] is the value. It's returning the key/value pairs from the dictionary, sorted by the values.

RE: n-gram - Larz60+ - Mar-02-2019

I ran your code (with added dictionary display) and gor a count of 6772
The only mod that I made was the addition of function of display_dict

#  getting unique ngrams out of duplicates and counting how many times ngram appears
import requests
from bs4 import BeautifulSoup
import re
import string
from collections import OrderedDict
 
def cleanInput(input):
    input = re.sub('\n', " ", input)
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    input = bytes(input, "UTF-8")
    input = input.decode("ascii", "ignore")
    input = input.upper()
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput
 
def getNgrams(input, n):
    input = cleanInput(input)
    output = dict()
    for i in range(len(input)-n+1):
        newNGram = " ".join(input[i:i+n])
        if newNGram in output:
            output[newNGram] += 1
        else:
            output[newNGram] = 1
    return output

def display_dict(thedict):
    for key, value in thedict.items():
        if isinstance(value, dict):
            print(f'{key}:')
            display_dict(value)
        else:
            print(f'    {key}: {value}')

html = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html.content, 'html.parser')
content = bsObj.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
display_dict(ngrams)
# print(ngrams)
print("2-grams count is: "+str(len(ngrams)))

results will add attachment
[attachment=575]

Sorry for so many edits, gave me a bit of trouble attaching

RE: n-gram - Truman - Mar-02-2019

Strange, when I run your code all n-grams have value 1.

Output: WIKIBOOKS RESOURCES: 1
 RESOURCES FROM: 1
 FROM WIKIVERSITY: 1
 WIKIVERSITY OFFICIAL: 1
 OFFICIAL WEBSITE: 1
 WEBSITE PYTHON: 1
 AT CURLIE: 1
 CURLIE VTEPROGRAMMING: 1
 VTEPROGRAMMING LANGUAGES: 1
 LANGUAGES COMPARISON: 1
 COMPARISON TIMELINE: 1
 TIMELINE HISTORY: 1
 HISTORY APL: 1
 APL ASSEMBLY: 1
 ASSEMBLY BASIC: 1
 BASIC COBOL: 1
 COBOL FORTRAN: 1
 FORTRAN GO: 1
 GO GROOVY: 1
 GROOVY HASKELL: 1
 HASKELL JAVA: 1
 JAVA JAVASCRIPT: 1
 JAVASCRIPT JS: 1
 JS JULIA: 1
 JULIA KOTLIN: 1
 KOTLIN LISP: 1
 LISP LUA: 1

etc.

(Mar-02-2019, 02:12 AM)ichabod801 Wrote:
(Mar-01-2019, 11:28 PM)Truman Wrote: Could you also give me an explanation of this line:
Python Code: (Double-click to select all)
1

ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
Particulary this chunk: key=lambda t: t[1]
Don't understand what is it for...

I'm not familiar with the packages you are using, but I can help you with this bit. The key parameter to sorted (or list.sort()) is a function that the value to actually do the sort by. This is very often done as a lambda function. If you are not familiar with those, they are simple throw-away functions. They have one expression, and the value of that expression is the return value of the function. In this case you are sorting dictionary items, which are key/value pairs. So if t is a dictionary item, t[1] is the value. It's returning the key/value pairs from the dictionary, sorted by the values.

Does thas mean that t is a key and t[1] is a value in this dictionary?

RE: n-gram - ichabod801 - Mar-02-2019

(Mar-02-2019, 10:55 PM)Truman Wrote: Does thas mean that t is a key and t[1] is a value in this dictionary?

Not quite. t[0] is a key, and t[1] is the value associated with that key.

RE: n-gram - Truman - Mar-03-2019

Ok, thanks.

Larz, I don't quite understand if statement in display_dict function.

p.s. Now I see that there is a web scraping subforum. You may want to remove this thread there.

RE: n-gram - Larz60+ - Mar-03-2019

Quote:Larz, I don't quite understand if statement in display_dict function.

isinstance is checking the type of value, which will be dict if it is a nested dictionary, and if so,
recursively calls the display_dict function for the nested portion.