Python Forum
getting unique values and counting amounts
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
getting unique values and counting amounts
#11
Thank you. Still, don't have a completely clear picture. Why would it be a nested dictionary?
Reply
#12
Almost every dictionary that I create is nested. This is a generic dictionary function that I wrote for my own use, and I wanted it to be able to display any dictionary nested or not.
Reply
#13
So basically we don't really have a nested dictionary in this case, right?
Reply
#14
Correct.
I use this routine a lot when I'm scraping new sites because I usually build a dictionary witch is actually a partial sitemap of
the site I'm scraping. It makes it easier to find my target areas.
Reply
#15
I'm trying to do something similar here:
#! python3
# n-gram for an inauguration speech

import requests
from bs4 import BeautifulSoup
import re
import string
import operator

def cleanInput(input):
	input = re.sub('\n', " ", input).lower()
	input = re.sub('\[[0-9]*\]', "", input)
	input = re.sub(' +', " ", input)
	input = bytes(input, "UTF-8")
	input = input.decode("ascii", "ignore")
	cleanInput = []
	input = input.split(' ')
	for item in input:
		item = item.strip(string.punctuation)
		if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
			cleanInput.append(item)
	return cleanInput 

def getNgrams(input, n):
		input = cleanInput(input)
		output = {}
		for i in range(len(input)-n+1):
			ngramTemp = " ".join(input[i:i+n])
			if ngramTemp not in output:
				output[ngramTemp] = 0
			else:
				output[ngramTemp] += 1
		return output 
		
content = str(requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True)
print(sortedNGrams) 
but get this:
Error:
Traceback (most recent call last): File "C:\Python36\kodovi\speech.py", line 35, in <module> content = str(requests.get("http://pythonscraping.com/files/inaugurationSpee ch.txt").read(), 'utf-8') AttributeError: 'Response' object has no attribute 'read'
when I remove read() I keep getting different sort of errors. If it would help I can put it here.
Any idea how to fix this line?
Reply
#16
this line:
content = str(requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')
I would do this in several steps so that you can check status of the request
replace with:
response = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt")
if response.status_code == 200:
    content = response.text
Also, I don't know what your ultimate goal is, but this looks like something you might want to analyze with NLTK, at least read a bit about what it's capable of doing
See: https://www.nltk.org/
You can install it and play with it a bit.


for example to get n grams, using your context code would look something like:

import nltk, re, string, collections
from nltk.util import ngrams

tokenized = content.split()
esBigrams = ngrams(tokenized, 2)
That will get all of the ngrams.
Now you can go a step further:

get the frequency of each bigram in our corpus
esBigramFreq = collections.Counter(esBigrams)
get 10 most common ngrams in text:
esBigramFreq.most_common(10)
and lots more
Here's the tutorial for above code: https://www.kaggle.com/rtatman/tutorial-getting-n-grams
Reply
#17
Using your first 3 lines of code only instead of my previous code that raised error gives n-grams with values 0. Not sure why is that so.

kaggle is a data science place. Can we consider web scraping a part of data science then? Think

Will definitely take a look at NLTK stuff, although I don't think that I should spend too much time on this for now...

And this code:
import nltk, re, string, collections
from nltk.util import ngrams
import requests

response = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt")
if response.status_code == 200:
	content = response.text
	
tokenized = content.split()
esBigrams = ngrams(tokenized, 2)
gives
Output:
A**** told C**** that E**** knew B**** was a double agent. CENSORED gave the secret documents to CENSORED. Phone number found: 515-555-4444 555 (415) Batman Batmobile mobile 555-4444 Batwowowoman Batwowowowoman <_sre.SRE_Match object; span=(0, 6), match='HaHaHa'> None >>>
The mystery continues...what would happen if I would add some real functions cleanInput and getNgrams. Will work on that tomorrow after I study documentation on nltk.
Reply
#18
NLTK is great, but can be a bit tricky at first. I use the O'reilly book'Natural Language Processing with Python' as a guide, but I'm sure there are better examples on the web now as I purchased the book back in 2013, and it was published 2009. If you wish, I'll take a look and see what's available now, and if someone else is reading this post, and knows where to look, that would also help.

found this snippet:
import nltk
from nltk.util import ngrams

def word_grams(words, min=1, max=4):
    s = []
    for n in range(min, max):
        for ngram in ngrams(words, n):
            s.append(' '.join(str(i) for i in ngram))
    return s

print word_grams('one two three four'.split(' '))
Reply
#19
Thank you, no need look for anything newer as I'm not planning to focus that much on this field...at least not for now.
Will check the snippet.
Reply
#20
import nltk, re, string, collections
from nltk.util import ngrams     # function for making ngrams

with open("http://pythonscraping.com/files/inaugurationSpeech.txt", "r", encoding='latin-1') as file:
	text = file.read()
print(text[0:1000])
Error:
Traceback (most recent call last): File "C:\Python36\kodovi\ngramss.py", line 7, in <module> with open("http://pythonscraping.com/files/inaugurationSpeech.txt", "r", enc oding='latin-1') as file: OSError: [Errno 22] Invalid argument: 'http://pythonscraping.com/files/inaugurat ionSpeech.txt' >>>
Any idea why is this argument invalid?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Get an average of the unique values of a column with group by condition and assign it klllmmm 0 223 Feb-17-2024, 05:53 PM
Last Post: klllmmm
  Counting the values ​​in the dictionary Inkanus 7 3,521 Oct-26-2020, 01:28 PM
Last Post: Inkanus
  5 variants to invert dictionaries with non-unique values Drakax1 2 2,567 Aug-31-2020, 11:40 AM
Last Post: snippsat
  Finding Max and Min Values Associated with Unique Identifiers in Python ubk046 1 2,012 May-08-2020, 12:04 PM
Last Post: anbu23
  How to compare two columns and highlight the unique values of column two using pandas shubhamjainj 0 4,235 Feb-24-2020, 06:19 AM
Last Post: shubhamjainj
  Getting Unique values from LISTS aankrose 2 2,214 Oct-17-2019, 05:33 PM
Last Post: aankrose
  count unique values of a list with a list 3Pinter 2 4,823 Jul-05-2018, 11:52 AM
Last Post: 3Pinter
  code that takes inputs for user name amounts etc and then sends custom message shaumyabrata 5 5,248 Feb-12-2017, 11:37 AM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020