Posts: 404
Threads: 94
Joined: Dec 2017
Mar-01-2019, 11:28 PM
(This post was last modified: Mar-05-2019, 06:20 AM by Yoriz.)
# getting unique ngrams out of duplicates and counting how many times ngram appears
import requests
from bs4 import BeautifulSoup
import re
import string
from collections import OrderedDict
def cleanInput(input):
input = re.sub('\n', " ", input)
input = re.sub('\[[0-9]*\]', "", input)
input = re.sub(' +', " ", input)
input = bytes(input, "UTF-8")
input = input.decode("ascii", "ignore")
input = input.upper()
cleanInput = []
input = input.split(' ')
for item in input:
item = item.strip(string.punctuation)
if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
cleanInput.append(item)
return cleanInput
def getNgrams(input, n):
input = cleanInput(input)
output = dict()
for i in range(len(input)-n+1):
newNGram = " ".join(input[i:i+n])
if newNGram in output:
output[newNGram] += 1
else:
output[newNGram] = 1
return output
html = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html.content, 'html.parser')
content = bsObj.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
print(ngrams)
print("2-grams count is: "+str(len(ngrams))) I'm working on this example of n-gram extraction in the book of web scraping. What surprised me is that program didn't find any duplicate, for each n-gram that consists of two words there is number '1'. In the book the given output is much different. For example, Output: ('of the', 38)
is sorted first, therefore I'm wondering if I'm doing something wrong.
Could you also give me an explanation of this line:
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True)) Particulary this chunk: key=lambda t: t[1]
Don't understand what is it for...
Posts: 12,024
Threads: 484
Joined: Sep 2016
You've triggered my curiosity, what is the actual title of 'book of web scraping'
Posts: 404
Threads: 94
Joined: Dec 2017
We've mentioned it before on a different thread - Web Scraping with Python by Ryan Mitchell.
Posts: 12,024
Threads: 484
Joined: Sep 2016
I own it, and have read parts, looks like I need to finish the book.
Sorry I can't help with your problem, but thank you
Posts: 4,220
Threads: 97
Joined: Sep 2016
(Mar-01-2019, 11:28 PM)Truman Wrote: Could you also give me an explanation of this line:
Python Code: (Double-click to select all)
1
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
Particulary this chunk: key=lambda t: t[1]
Don't understand what is it for...
I'm not familiar with the packages you are using, but I can help you with this bit. The key parameter to sorted (or list.sort()) is a function that the value to actually do the sort by. This is very often done as a lambda function. If you are not familiar with those, they are simple throw-away functions. They have one expression, and the value of that expression is the return value of the function. In this case you are sorting dictionary items, which are key/value pairs. So if t is a dictionary item, t[1] is the value. It's returning the key/value pairs from the dictionary, sorted by the values.
Posts: 12,024
Threads: 484
Joined: Sep 2016
Mar-02-2019, 02:27 AM
(This post was last modified: Mar-02-2019, 02:27 AM by Larz60+.)
I ran your code (with added dictionary display) and gor a count of 6772
The only mod that I made was the addition of function of display_dict
# getting unique ngrams out of duplicates and counting how many times ngram appears
import requests
from bs4 import BeautifulSoup
import re
import string
from collections import OrderedDict
def cleanInput(input):
input = re.sub('\n', " ", input)
input = re.sub('\[[0-9]*\]', "", input)
input = re.sub(' +', " ", input)
input = bytes(input, "UTF-8")
input = input.decode("ascii", "ignore")
input = input.upper()
cleanInput = []
input = input.split(' ')
for item in input:
item = item.strip(string.punctuation)
if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
cleanInput.append(item)
return cleanInput
def getNgrams(input, n):
input = cleanInput(input)
output = dict()
for i in range(len(input)-n+1):
newNGram = " ".join(input[i:i+n])
if newNGram in output:
output[newNGram] += 1
else:
output[newNGram] = 1
return output
def display_dict(thedict):
for key, value in thedict.items():
if isinstance(value, dict):
print(f'{key}:')
display_dict(value)
else:
print(f' {key}: {value}')
html = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html.content, 'html.parser')
content = bsObj.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
display_dict(ngrams)
# print(ngrams)
print("2-grams count is: "+str(len(ngrams))) results will add attachment
ngrams.txt (Size: 140.46 KB / Downloads: 348)
Sorry for so many edits, gave me a bit of trouble attaching
Posts: 404
Threads: 94
Joined: Dec 2017
Mar-02-2019, 10:55 PM
(This post was last modified: Mar-02-2019, 10:57 PM by Truman.)
Strange, when I run your code all n-grams have value 1.
Output: WIKIBOOKS RESOURCES: 1
RESOURCES FROM: 1
FROM WIKIVERSITY: 1
WIKIVERSITY OFFICIAL: 1
OFFICIAL WEBSITE: 1
WEBSITE PYTHON: 1
AT CURLIE: 1
CURLIE VTEPROGRAMMING: 1
VTEPROGRAMMING LANGUAGES: 1
LANGUAGES COMPARISON: 1
COMPARISON TIMELINE: 1
TIMELINE HISTORY: 1
HISTORY APL: 1
APL ASSEMBLY: 1
ASSEMBLY BASIC: 1
BASIC COBOL: 1
COBOL FORTRAN: 1
FORTRAN GO: 1
GO GROOVY: 1
GROOVY HASKELL: 1
HASKELL JAVA: 1
JAVA JAVASCRIPT: 1
JAVASCRIPT JS: 1
JS JULIA: 1
JULIA KOTLIN: 1
KOTLIN LISP: 1
LISP LUA: 1
etc.
(Mar-02-2019, 02:12 AM)ichabod801 Wrote: (Mar-01-2019, 11:28 PM)Truman Wrote: Could you also give me an explanation of this line:
Python Code: (Double-click to select all)
1
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
Particulary this chunk: key=lambda t: t[1]
Don't understand what is it for...
I'm not familiar with the packages you are using, but I can help you with this bit. The key parameter to sorted (or list.sort()) is a function that the value to actually do the sort by. This is very often done as a lambda function. If you are not familiar with those, they are simple throw-away functions. They have one expression, and the value of that expression is the return value of the function. In this case you are sorting dictionary items, which are key/value pairs. So if t is a dictionary item, t[1] is the value. It's returning the key/value pairs from the dictionary, sorted by the values.
Does thas mean that t is a key and t[1] is a value in this dictionary?
Posts: 4,220
Threads: 97
Joined: Sep 2016
(Mar-02-2019, 10:55 PM)Truman Wrote: Does thas mean that t is a key and t[1] is a value in this dictionary?
Not quite. t[0] is a key, and t[1] is the value associated with that key.
Posts: 404
Threads: 94
Joined: Dec 2017
Mar-03-2019, 01:18 AM
(This post was last modified: Mar-03-2019, 01:18 AM by Truman.)
Ok, thanks.
Larz, I don't quite understand if statement in display_dict function.
p.s. Now I see that there is a web scraping subforum. You may want to remove this thread there.
Posts: 12,024
Threads: 484
Joined: Sep 2016
Quote:Larz, I don't quite understand if statement in display_dict function.
isinstance is checking the type of value, which will be dict if it is a nested dictionary, and if so,
recursively calls the display_dict function for the nested portion.
|