Python Forum
Identifying items in a csv file that also appear in a Text extract
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Identifying items in a csv file that also appear in a Text extract
#11
(Sep-21-2016, 09:11 PM)Jaynorth Wrote: Counter() just returns Counter() in the console when the script is run. country_codes is the names of the csv file which is read into the script and codes is just a variable that I used to assign the relevant column in the csv file - country_codes['English short name lower case']

I am not matching on the Alpha-2 or Alpha-3 columns in the csv file which uses the 3 letter representation of the country like "CAN" XD

But, why use any sort of Counter() function at all?  len() would do the exact same thing, wouldn't it?

>>> text = '''
... Once upon a time, there was the great country of Mexico.  Then there... blah blah blah'''
>>> [word for word in text.split()]
['Once', 'upon', 'a', 'time,', 'there', 'was', 'the', 'great', 'country', 'of', 'Mexico.', 'Then', 'there...', 'blah', 'blah', 'blah']

>>> import re
>>> [word for word in text.split() if re.sub(r'\W', '', word) in codes]
['Mexico.']

>>> len([word for word in text.split() if re.sub(r'\W', '', word) in codes])
1
Reply
#12
(Sep-21-2016, 09:21 PM)nilamo Wrote:
(Sep-21-2016, 09:11 PM)Jaynorth Wrote: Counter() just returns Counter() in the console when the script is run. country_codes is the names of the csv file which is read into the script and codes is just a variable that I used to assign the relevant column in the csv file - country_codes['English short name lower case']

I am not matching on the Alpha-2 or Alpha-3 columns in the csv file which uses the 3 letter representation of the country like "CAN" XD

But, why use any sort of Counter() function at all?  len() would do the exact same thing, wouldn't it?

>>> text = '''
... Once upon a time, there was the great country of Mexico.  Then there... blah blah blah'''
>>> [word for word in text.split()]
['Once', 'upon', 'a', 'time,', 'there', 'was', 'the', 'great', 'country', 'of', 'Mexico.', 'Then', 'there...', 'blah', 'blah', 'blah']

>>> import re
>>> [word for word in text.split() if re.sub(r'\W', '', word) in codes]
['Mexico.']

>>> len([word for word in text.split() if re.sub(r'\W', '', word) in codes])
1
I used Pandas to extract the text so it is a dataframe and not a string so I cannot use .split() or can I?
Reply
#13
(Sep-21-2016, 09:12 PM)nilamo Wrote:
(Sep-21-2016, 09:05 PM)wavic Wrote:
from collections import Counter

counter = Counter(iterable)

print(counter['item'])
How does this syntax highlighting works?  :huh:

Use the python syntax highlighter, not the generic code one.  (they're still working out the plugins)
Also, wouldn't your code just always give "0"?
>>> from collections import Counter
>>> cnt = Counter('Green eggs and spam')
>>> cnt['g']
2
>>> cnt['gg']
0
>>> cnt['eggs']
0

It gets the iterable from a CSV file. So if a row is "one,two,three,one,two,three" counter.keys() will return ['one', 'two', three'] as it suppose to be. CSV module will split it to the list.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#14
Quote:But, why use any sort of Counter() function at all?  len() would do the exact same thing, wouldn't it?
Because Counter is a better solution's and more readable than a list comprehension with a regex inside.

Quote:I used Pandas to extract the text so it is a dataframe and not a string so I cannot use .split() or can I?
You can use all Python syntax with Pandas.
Split if you want whole word.
>>> from collections import Counter
>>> cnt = Counter('Green eggs and spam spam spam spam'.split())
>>> print(cnt)
Counter({'spam': 4, 'eggs': 1, 'Green': 1, 'and': 1})
>>> print(cnt.most_common(1))
[('spam', 4)]
Reply
#15
>>>import string
>>>[word.strip(string.punctuation) for word in text.split()]
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#16
(Sep-21-2016, 09:45 PM)wavic Wrote:
>>>import string
>>>[word.strip(string.punctuation) for word in text.split()]

This gives the following error: AttributeError: 'Series' object has no attribute 'split'
Reply
#17
(Sep-21-2016, 10:11 PM)Jaynorth Wrote:
(Sep-21-2016, 09:45 PM)wavic Wrote:
>>>import string
>>>[word.strip(string.punctuation) for word in text.split()]

This gives the following error: AttributeError: 'Series' object has no attribute 'split'

Hmm! It just splits the regular text and remove the punctuation. So you get only the words. No Pandas here

In [1]: import string

In [2]: text = "This gives the following error: AttributeError: 'Series' object 
   ...: has no attribute 'split'"

In [3]: [word.strip(string.punctuation) for word in text.split()]
Out[3]: 
['This',
 'gives',
 'the',
 'following',
 'error',
 'AttributeError',
 'Series',
 'object',
 'has',
 'no',
 'attribute',
 'split']

In [4]: 
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#18
(Sep-21-2016, 10:32 PM)wavic Wrote:
(Sep-21-2016, 10:11 PM)Jaynorth Wrote:
(Sep-21-2016, 09:45 PM)wavic Wrote:
>>>import string
>>>[word.strip(string.punctuation) for word in text.split()]

This gives the following error: AttributeError: 'Series' object has no attribute 'split'

Hmm! It just splits the regular text and remove the punctuation. So you get only the words. No Pandas here

In [1]: import string

In [2]: text = "This gives the following error: AttributeError: 'Series' object 
   ...: has no attribute 'split'"

In [3]: [word.strip(string.punctuation) for word in text.split()]
Out[3]: 
['This',
 'gives',
 'the',
 'following',
 'error',
 'AttributeError',
 'Series',
 'object',
 'has',
 'no',
 'attribute',
 'split']

In [4]: 
I just checked the Pandas documentation: to split a Pandas series use- text.str.split() and this converts it to an object but now I get a TypeError: unhashable type: list    
For the codes variable with the country_codes csv because it is a list
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Cleaning a dataset: How to extract text between two patterns Palke 0 1,155 Mar-06-2023, 05:13 PM
Last Post: Palke
  extract and plot data from a txt file usercat123 2 1,230 Apr-20-2022, 06:50 PM
Last Post: usercat123
  [machine learning] identifying a number 0-9 from a 28x28 picture, not working SheeppOSU 0 1,844 Apr-09-2021, 12:38 AM
Last Post: SheeppOSU
  Comparing and Identifying ID with Percentage jonatasflausino 1 2,438 Jun-23-2020, 06:44 PM
Last Post: hussainmujtaba
  Identifying consecutive masked values in a 3D data array chai0404 12 5,731 Feb-01-2020, 12:59 PM
Last Post: perfringo
  Validate Excel with text in text file Vinci141 3 3,439 Dec-03-2018, 04:03 PM
Last Post: Larz60+
  OpenCV - extract 1st frame out of a video file kerzol81 2 22,005 Nov-12-2018, 09:12 AM
Last Post: kerzol81
  Upload csv file as numbers (floating?) and extract element, row, and column bentaz 7 4,483 Mar-19-2018, 05:34 PM
Last Post: bentaz
  Extract data between two dates from a .csv file using Python 2.7 sujai_banerji 1 10,368 Nov-15-2017, 09:48 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020