Python Forum
[Discussion] Re. Substring counting
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Discussion] Re. Substring counting
#1
This would normally go in homework discussions, but as the user has already solved their problem there is no reason to not post publicly.

Original poster was asking how to find the number of times one string occurred as a substring in another string.
https://python-forum.io/Thread-Substring-Counting

Extend this to instead give an algorithm that finds the number of times one string occurs as a subsequence in another string.

Whereas a substring must be contiguous, a subsequence need not be, though it must still be in the correct order.

For example if we count occurrences of the subsequence "an" in "banana" we get 3.
Comprised of the following: "banana", "banana", "banana"

Algorithm should run in O(mn) time where m is the size of the string we are looking for and n is the size of the string we are looking in.

def subsequence_count(needle, haystack):
    """Your code here"""
    pass

print(subsequence_count("an", "banana")) # Should print 3
print(subsequence_count("an", "trans-Panamanian banana")) # Should print 25
Reply
#2
Here is a version
import unittest as ut
import re

def subsequence_count(needle, haystack):
    if len(needle) == 1:
        return haystack.count(needle)
    elif len(needle) > 1:
        a, b = needle[0], needle[1:]
        result = 0
        for mo in re.finditer(a, haystack):
            result += subsequence_count(b, haystack[mo.end():])
        return result
    else:
        raise ValueError(('Non empty string expected, got', needle))




class TestSubsequence(ut.TestCase):
    def test_banana(self):
        n = subsequence_count('an', 'banana')
        self.assertEqual(n, 3)
        
    def test_panama(self):
        n = subsequence_count('an', 'trans-Panamanian banana')
        self.assertEqual(n, 25)

if __name__ == '__main__':
    ut.main()
Reply
#3
That's pretty neat. Your recursive approach hurts my head.
I have trouble analyzing the runtime of recursive algorithms like this.

I want to say that it is O(n^m) but I'm not positive.
I base this on the idea that a run of size m,n takes:
T(m,n) = O(n) * T(m-1, n-1)
Not particularly confident in that though.

I did add a particularly pathological test however, and yours doesn't seem to manage it in a reasonable time
Output:
... ---------------------------------------------------------------------- Ran 3 tests in 346.731s
My approach went the dynamic programming route.

T[i, j] is the number of times needle[:i+1] occurs as a subsequence of haystack[:j+1] where T is our dynamic programming table.
This leads to the recursive problem definition:
T[i,j] = T[i,j-1] + T[i-1,j-1] if needle[i] == haystack[j] else T[i,j-1]

The algorithm ends up fairly straight forward from there just filling in the table row by row:
The pathological test is added at the bottom.
Mine results in:
Output:
... ---------------------------------------------------------------------- Ran 3 tests in 0.003s OK
Reply
#4
Of course, dynamic programming is the correct solution. I improved it to use only a one dimensional array. This is a nice generalization of Pascal's triangle. Actually, Pascal's triangle implements the case where the alphabet has only one letter!
import unittest as ut

def subsequence_count(needle, haystack):
    T = [0 for x in needle]
    pairs = [(i, needle[i]) for i in range(len(needle)-1, -1, -1)]
    for source in haystack:
        for i, search in pairs:
            if source == search:
                T[i] += T[i-1] if i > 0 else 1
    return T[-1]


class TestSubsequence(ut.TestCase):
    def test_banana(self):
        n = subsequence_count('an', 'banana')
        self.assertEqual(n, 3)
        
    def test_panama(self):
        n = subsequence_count('an', 'trans-Panamanian banana')
        self.assertEqual(n, 25)

    def test_nnn(self):
        n = subsequence_count('nnn', 'nnnnn')
        self.assertEqual(n, 10)
        
    def test_assesses(self):
        needle = "assessesassessesas"
        haystack = "assessesassessesassessesassessesassessesassesses"
        n = subsequence_count(needle, haystack)
        self.assertEqual(n, 7963870)

if __name__ == '__main__':
    ut.main()
I get
.  AutoExecution
....
----------------------------------------------------------------------
Ran 4 tests in 0.000s

OK
Reply
#5
Nice. I have seen problems like "infinite use coin-changeing," use a similar kind of idea to reduce to a 1d-array, but couldn't think of a way to do that here.
Reply
#6
def substr_count(substring, text):
    return len(text.split(substring)) -1
This should work too.

sub_count('an', 'banana')
2

sub_count("an", "trans-Panamanian banana")
6

sub_count("an", "antrans-Panamanian banana")
7

sub_count("an", "antrans-Panamanian bananaan")
8
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#7
(Jan-22-2018, 11:38 AM)wavic Wrote: This should work too.
hmmm it doesn't pass the tests... Confused
Reply
#8
Which tests?
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#9
Did you read the top post?
In this version we are looking for subsequences, not substrings.
Quote:Extend this to instead give an algorithm that finds the number of times one string occurs as a subsequence in another string.

Whereas a substring must be contiguous, a subsequence need not be, though it must still be in the correct order.

For example if we count occurrences of the subsequence "an" in "banana" we get 3.
Comprised of the following: "banana", "banana", "banana"
I gave a couple sample input/outputs in the OP as well and Grib put them in unittests for convenience.

This is the template with tests:
Reply
#10
I see it now. Hm...
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Finding how many times substring is in a string using re module ranbarr 4 2,896 May-21-2021, 06:14 PM
Last Post: nilamo
  Substring in a string Propoganda 1 2,198 Dec-01-2019, 08:45 AM
Last Post: perfringo
  Find a substring in a dictonary; value substring of key aapurdel 2 6,992 May-31-2019, 06:14 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020