Python Forum
getting unique values and counting amounts
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
getting unique values and counting amounts
#1
#  getting unique ngrams out of duplicates and counting how many times ngram appears
import requests
from bs4 import BeautifulSoup
import re
import string
from collections import OrderedDict

def cleanInput(input):
	input = re.sub('\n', " ", input)
	input = re.sub('\[[0-9]*\]', "", input)
	input = re.sub(' +', " ", input)
	input = bytes(input, "UTF-8")
	input = input.decode("ascii", "ignore")
	input = input.upper()
	cleanInput = []
	input = input.split(' ')
	for item in input:
		item = item.strip(string.punctuation)
		if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
			cleanInput.append(item)
	return cleanInput

def getNgrams(input, n):
	input = cleanInput(input)
	output = dict()
	for i in range(len(input)-n+1):
		newNGram = " ".join(input[i:i+n])
		if newNGram in output:
			output[newNGram] += 1
		else:
			output[newNGram] = 1
	return output

html = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html.content, 'html.parser')
content = bsObj.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
print(ngrams)
print("2-grams count is: "+str(len(ngrams)))
I'm working on this example of n-gram extraction in the book of web scraping. What surprised me is that program didn't find any duplicate, for each n-gram that consists of two words there is number '1'. In the book the given output is much different. For example,
Output:
('of the', 38)
is sorted first, therefore I'm wondering if I'm doing something wrong.
Could you also give me an explanation of this line:
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
Particulary this chunk: key=lambda t: t[1]
Don't understand what is it for...
Reply


Messages In This Thread
getting unique values and counting amounts - by Truman - Mar-01-2019, 11:28 PM
RE: n-gram - by Larz60+ - Mar-02-2019, 12:02 AM
RE: n-gram - by Truman - Mar-02-2019, 12:40 AM
RE: n-gram - by Larz60+ - Mar-02-2019, 01:25 AM
RE: n-gram - by ichabod801 - Mar-02-2019, 02:12 AM
RE: n-gram - by Larz60+ - Mar-02-2019, 02:27 AM
RE: n-gram - by Truman - Mar-02-2019, 10:55 PM
RE: n-gram - by ichabod801 - Mar-02-2019, 11:00 PM
RE: n-gram - by Truman - Mar-03-2019, 01:18 AM
RE: n-gram - by Larz60+ - Mar-03-2019, 01:31 AM
RE: n-gram - by Truman - Mar-03-2019, 01:43 AM
RE: n-gram - by Larz60+ - Mar-03-2019, 01:50 AM
RE: n-gram - by Truman - Mar-03-2019, 01:54 AM
RE: n-gram - by Larz60+ - Mar-03-2019, 01:59 AM
RE: n-gram - by Truman - Mar-05-2019, 12:17 AM
RE: n-gram - by Larz60+ - Mar-05-2019, 01:23 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Get an average of the unique values of a column with group by condition and assign it klllmmm 0 486 Feb-17-2024, 05:53 PM
Last Post: klllmmm
  Counting the values ​​in the dictionary Inkanus 7 3,848 Oct-26-2020, 01:28 PM
Last Post: Inkanus
  5 variants to invert dictionaries with non-unique values Drakax1 2 2,699 Aug-31-2020, 11:40 AM
Last Post: snippsat
  Finding Max and Min Values Associated with Unique Identifiers in Python ubk046 1 2,163 May-08-2020, 12:04 PM
Last Post: anbu23
  How to compare two columns and highlight the unique values of column two using pandas shubhamjainj 0 4,369 Feb-24-2020, 06:19 AM
Last Post: shubhamjainj
  Getting Unique values from LISTS aankrose 2 2,331 Oct-17-2019, 05:33 PM
Last Post: aankrose
  count unique values of a list with a list 3Pinter 2 4,939 Jul-05-2018, 11:52 AM
Last Post: 3Pinter
  code that takes inputs for user name amounts etc and then sends custom message shaumyabrata 5 5,444 Feb-12-2017, 11:37 AM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020