getting unique values and counting amounts

getting unique values and counting amounts - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: getting unique values and counting amounts (/thread-16483.html)

Pages: 1 2 3

RE: n-gram - Truman - Mar-03-2019

Thank you. Still, don't have a completely clear picture. Why would it be a nested dictionary?

RE: n-gram - Larz60+ - Mar-03-2019

Almost every dictionary that I create is nested. This is a generic dictionary function that I wrote for my own use, and I wanted it to be able to display any dictionary nested or not.

RE: n-gram - Truman - Mar-03-2019

So basically we don't really have a nested dictionary in this case, right?

RE: n-gram - Larz60+ - Mar-03-2019

Correct.
I use this routine a lot when I'm scraping new sites because I usually build a dictionary witch is actually a partial sitemap of
the site I'm scraping. It makes it easier to find my target areas.

RE: n-gram - Truman - Mar-05-2019

I'm trying to do something similar here:

#! python3
# n-gram for an inauguration speech

import requests
from bs4 import BeautifulSoup
import re
import string
import operator

def cleanInput(input):
	input = re.sub('\n', " ", input).lower()
	input = re.sub('\[[0-9]*\]', "", input)
	input = re.sub(' +', " ", input)
	input = bytes(input, "UTF-8")
	input = input.decode("ascii", "ignore")
	cleanInput = []
	input = input.split(' ')
	for item in input:
		item = item.strip(string.punctuation)
		if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
			cleanInput.append(item)
	return cleanInput 

def getNgrams(input, n):
		input = cleanInput(input)
		output = {}
		for i in range(len(input)-n+1):
			ngramTemp = " ".join(input[i:i+n])
			if ngramTemp not in output:
				output[ngramTemp] = 0
			else:
				output[ngramTemp] += 1
		return output 
		
content = str(requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True)
print(sortedNGrams)

but get this:

Error:Traceback (most recent call last):
  File "C:\Python36\kodovi\speech.py", line 35, in <module>
    content = str(requests.get("http://pythonscraping.com/files/inaugurationSpee
ch.txt").read(), 'utf-8')
AttributeError: 'Response' object has no attribute 'read'

when I remove read() I keep getting different sort of errors. If it would help I can put it here.
Any idea how to fix this line?

RE: n-gram - Larz60+ - Mar-05-2019

this line:

content = str(requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')

I would do this in several steps so that you can check status of the request
replace with:

response = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt")
if response.status_code == 200:
    content = response.text

Also, I don't know what your ultimate goal is, but this looks like something you might want to analyze with NLTK, at least read a bit about what it's capable of doing
See: https://www.nltk.org/
You can install it and play with it a bit.

for example to get n grams, using your context code would look something like:

import nltk, re, string, collections
from nltk.util import ngrams

tokenized = content.split()
esBigrams = ngrams(tokenized, 2)

That will get all of the ngrams.
Now you can go a step further:

get the frequency of each bigram in our corpus

esBigramFreq = collections.Counter(esBigrams)

get 10 most common ngrams in text:

esBigramFreq.most_common(10)

and lots more
Here's the tutorial for above code: https://www.kaggle.com/rtatman/tutorial-getting-n-grams

RE: getting unique values and counting amounts - Truman - Mar-06-2019

Using your first 3 lines of code only instead of my previous code that raised error gives n-grams with values 0. Not sure why is that so.

kaggle is a data science place. Can we consider web scraping a part of data science then? Think

Will definitely take a look at NLTK stuff, although I don't think that I should spend too much time on this for now...

And this code:

import nltk, re, string, collections
from nltk.util import ngrams
import requests

response = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt")
if response.status_code == 200:
	content = response.text
	
tokenized = content.split()
esBigrams = ngrams(tokenized, 2)

gives

Output:A**** told C**** that E**** knew B**** was a double agent.
CENSORED gave the secret documents to CENSORED.
Phone number found: 515-555-4444
555
(415)
Batman
Batmobile
mobile
555-4444
Batwowowoman
Batwowowowoman
<_sre.SRE_Match object; span=(0, 6), match='HaHaHa'>
None
>>>

The mystery continues...what would happen if I would add some real functions cleanInput and getNgrams. Will work on that tomorrow after I study documentation on nltk.

RE: getting unique values and counting amounts - Larz60+ - Mar-06-2019

NLTK is great, but can be a bit tricky at first. I use the O'reilly book'Natural Language Processing with Python' as a guide, but I'm sure there are better examples on the web now as I purchased the book back in 2013, and it was published 2009. If you wish, I'll take a look and see what's available now, and if someone else is reading this post, and knows where to look, that would also help.

found this snippet:

import nltk
from nltk.util import ngrams

def word_grams(words, min=1, max=4):
    s = []
    for n in range(min, max):
        for ngram in ngrams(words, n):
            s.append(' '.join(str(i) for i in ngram))
    return s

print word_grams('one two three four'.split(' '))

RE: getting unique values and counting amounts - Truman - Mar-06-2019

Thank you, no need look for anything newer as I'm not planning to focus that much on this field...at least not for now.
Will check the snippet.

RE: getting unique values and counting amounts - Truman - Mar-07-2019

import nltk, re, string, collections
from nltk.util import ngrams     # function for making ngrams

with open("http://pythonscraping.com/files/inaugurationSpeech.txt", "r", encoding='latin-1') as file:
	text = file.read()
print(text[0:1000])

Error:Traceback (most recent call last):
  File "C:\Python36\kodovi\ngramss.py", line 7, in <module>
    with open("http://pythonscraping.com/files/inaugurationSpeech.txt", "r", enc
oding='latin-1') as file:
OSError: [Errno 22] Invalid argument: 'http://pythonscraping.com/files/inaugurat
ionSpeech.txt'
>>>

Any idea why is this argument invalid?