Python Forum

Full Version: getting unique values and counting amounts
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
Thank you. Still, don't have a completely clear picture. Why would it be a nested dictionary?
Almost every dictionary that I create is nested. This is a generic dictionary function that I wrote for my own use, and I wanted it to be able to display any dictionary nested or not.
So basically we don't really have a nested dictionary in this case, right?
Correct.
I use this routine a lot when I'm scraping new sites because I usually build a dictionary witch is actually a partial sitemap of
the site I'm scraping. It makes it easier to find my target areas.
I'm trying to do something similar here:
#! python3
# n-gram for an inauguration speech

import requests
from bs4 import BeautifulSoup
import re
import string
import operator

def cleanInput(input):
	input = re.sub('\n', " ", input).lower()
	input = re.sub('\[[0-9]*\]', "", input)
	input = re.sub(' +', " ", input)
	input = bytes(input, "UTF-8")
	input = input.decode("ascii", "ignore")
	cleanInput = []
	input = input.split(' ')
	for item in input:
		item = item.strip(string.punctuation)
		if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
			cleanInput.append(item)
	return cleanInput 

def getNgrams(input, n):
		input = cleanInput(input)
		output = {}
		for i in range(len(input)-n+1):
			ngramTemp = " ".join(input[i:i+n])
			if ngramTemp not in output:
				output[ngramTemp] = 0
			else:
				output[ngramTemp] += 1
		return output 
		
content = str(requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True)
print(sortedNGrams) 
but get this:
Error:
Traceback (most recent call last): File "C:\Python36\kodovi\speech.py", line 35, in <module> content = str(requests.get("http://pythonscraping.com/files/inaugurationSpee ch.txt").read(), 'utf-8') AttributeError: 'Response' object has no attribute 'read'
when I remove read() I keep getting different sort of errors. If it would help I can put it here.
Any idea how to fix this line?
this line:
content = str(requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')
I would do this in several steps so that you can check status of the request
replace with:
response = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt")
if response.status_code == 200:
    content = response.text
Also, I don't know what your ultimate goal is, but this looks like something you might want to analyze with NLTK, at least read a bit about what it's capable of doing
See: https://www.nltk.org/
You can install it and play with it a bit.


for example to get n grams, using your context code would look something like:

import nltk, re, string, collections
from nltk.util import ngrams

tokenized = content.split()
esBigrams = ngrams(tokenized, 2)
That will get all of the ngrams.
Now you can go a step further:

get the frequency of each bigram in our corpus
esBigramFreq = collections.Counter(esBigrams)
get 10 most common ngrams in text:
esBigramFreq.most_common(10)
and lots more
Here's the tutorial for above code: https://www.kaggle.com/rtatman/tutorial-getting-n-grams
Using your first 3 lines of code only instead of my previous code that raised error gives n-grams with values 0. Not sure why is that so.

kaggle is a data science place. Can we consider web scraping a part of data science then? Think

Will definitely take a look at NLTK stuff, although I don't think that I should spend too much time on this for now...

And this code:
import nltk, re, string, collections
from nltk.util import ngrams
import requests

response = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt")
if response.status_code == 200:
	content = response.text
	
tokenized = content.split()
esBigrams = ngrams(tokenized, 2)
gives
Output:
A**** told C**** that E**** knew B**** was a double agent. CENSORED gave the secret documents to CENSORED. Phone number found: 515-555-4444 555 (415) Batman Batmobile mobile 555-4444 Batwowowoman Batwowowowoman <_sre.SRE_Match object; span=(0, 6), match='HaHaHa'> None >>>
The mystery continues...what would happen if I would add some real functions cleanInput and getNgrams. Will work on that tomorrow after I study documentation on nltk.
NLTK is great, but can be a bit tricky at first. I use the O'reilly book'Natural Language Processing with Python' as a guide, but I'm sure there are better examples on the web now as I purchased the book back in 2013, and it was published 2009. If you wish, I'll take a look and see what's available now, and if someone else is reading this post, and knows where to look, that would also help.

found this snippet:
import nltk
from nltk.util import ngrams

def word_grams(words, min=1, max=4):
    s = []
    for n in range(min, max):
        for ngram in ngrams(words, n):
            s.append(' '.join(str(i) for i in ngram))
    return s

print word_grams('one two three four'.split(' '))
Thank you, no need look for anything newer as I'm not planning to focus that much on this field...at least not for now.
Will check the snippet.
import nltk, re, string, collections
from nltk.util import ngrams     # function for making ngrams

with open("http://pythonscraping.com/files/inaugurationSpeech.txt", "r", encoding='latin-1') as file:
	text = file.read()
print(text[0:1000])
Error:
Traceback (most recent call last): File "C:\Python36\kodovi\ngramss.py", line 7, in <module> with open("http://pythonscraping.com/files/inaugurationSpeech.txt", "r", enc oding='latin-1') as file: OSError: [Errno 22] Invalid argument: 'http://pythonscraping.com/files/inaugurat ionSpeech.txt' >>>
Any idea why is this argument invalid?
Pages: 1 2 3