Python Forum
[Discussion] Re. Substring counting - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Homework (https://python-forum.io/forum-9.html)
+--- Thread: [Discussion] Re. Substring counting (/thread-7711.html)



Re. Substring counting - Mekire - Jan-22-2018

This would normally go in homework discussions, but as the user has already solved their problem there is no reason to not post publicly.

Original poster was asking how to find the number of times one string occurred as a substring in another string.
https://python-forum.io/Thread-Substring-Counting

Extend this to instead give an algorithm that finds the number of times one string occurs as a subsequence in another string.

Whereas a substring must be contiguous, a subsequence need not be, though it must still be in the correct order.

For example if we count occurrences of the subsequence "an" in "banana" we get 3.
Comprised of the following: "banana", "banana", "banana"

Algorithm should run in O(mn) time where m is the size of the string we are looking for and n is the size of the string we are looking in.

def subsequence_count(needle, haystack):
    """Your code here"""
    pass

print(subsequence_count("an", "banana")) # Should print 3
print(subsequence_count("an", "trans-Panamanian banana")) # Should print 25



RE: [Discussion] Re. Substring counting - Gribouillis - Jan-22-2018

Here is a version
import unittest as ut
import re

def subsequence_count(needle, haystack):
    if len(needle) == 1:
        return haystack.count(needle)
    elif len(needle) > 1:
        a, b = needle[0], needle[1:]
        result = 0
        for mo in re.finditer(a, haystack):
            result += subsequence_count(b, haystack[mo.end():])
        return result
    else:
        raise ValueError(('Non empty string expected, got', needle))




class TestSubsequence(ut.TestCase):
    def test_banana(self):
        n = subsequence_count('an', 'banana')
        self.assertEqual(n, 3)
        
    def test_panama(self):
        n = subsequence_count('an', 'trans-Panamanian banana')
        self.assertEqual(n, 25)

if __name__ == '__main__':
    ut.main()



RE: [Discussion] Re. Substring counting - Mekire - Jan-22-2018

That's pretty neat. Your recursive approach hurts my head.
I have trouble analyzing the runtime of recursive algorithms like this.

I want to say that it is O(n^m) but I'm not positive.
I base this on the idea that a run of size m,n takes:
T(m,n) = O(n) * T(m-1, n-1)
Not particularly confident in that though.

I did add a particularly pathological test however, and yours doesn't seem to manage it in a reasonable time
Output:
... ---------------------------------------------------------------------- Ran 3 tests in 346.731s
My approach went the dynamic programming route.

T[i, j] is the number of times needle[:i+1] occurs as a subsequence of haystack[:j+1] where T is our dynamic programming table.
This leads to the recursive problem definition:
T[i,j] = T[i,j-1] + T[i-1,j-1] if needle[i] == haystack[j] else T[i,j-1]

The algorithm ends up fairly straight forward from there just filling in the table row by row:
The pathological test is added at the bottom.
Mine results in:
Output:
... ---------------------------------------------------------------------- Ran 3 tests in 0.003s OK



RE: [Discussion] Re. Substring counting - Gribouillis - Jan-22-2018

Of course, dynamic programming is the correct solution. I improved it to use only a one dimensional array. This is a nice generalization of Pascal's triangle. Actually, Pascal's triangle implements the case where the alphabet has only one letter!
import unittest as ut

def subsequence_count(needle, haystack):
    T = [0 for x in needle]
    pairs = [(i, needle[i]) for i in range(len(needle)-1, -1, -1)]
    for source in haystack:
        for i, search in pairs:
            if source == search:
                T[i] += T[i-1] if i > 0 else 1
    return T[-1]


class TestSubsequence(ut.TestCase):
    def test_banana(self):
        n = subsequence_count('an', 'banana')
        self.assertEqual(n, 3)
        
    def test_panama(self):
        n = subsequence_count('an', 'trans-Panamanian banana')
        self.assertEqual(n, 25)

    def test_nnn(self):
        n = subsequence_count('nnn', 'nnnnn')
        self.assertEqual(n, 10)
        
    def test_assesses(self):
        needle = "assessesassessesas"
        haystack = "assessesassessesassessesassessesassessesassesses"
        n = subsequence_count(needle, haystack)
        self.assertEqual(n, 7963870)

if __name__ == '__main__':
    ut.main()
I get
.  AutoExecution
....
----------------------------------------------------------------------
Ran 4 tests in 0.000s

OK



RE: [Discussion] Re. Substring counting - Mekire - Jan-22-2018

Nice. I have seen problems like "infinite use coin-changeing," use a similar kind of idea to reduce to a 1d-array, but couldn't think of a way to do that here.


RE: [Discussion] Re. Substring counting - wavic - Jan-22-2018

def substr_count(substring, text):
    return len(text.split(substring)) -1
This should work too.

sub_count('an', 'banana')
2

sub_count("an", "trans-Panamanian banana")
6

sub_count("an", "antrans-Panamanian banana")
7

sub_count("an", "antrans-Panamanian bananaan")
8



RE: [Discussion] Re. Substring counting - Gribouillis - Jan-22-2018

(Jan-22-2018, 11:38 AM)wavic Wrote: This should work too.
hmmm it doesn't pass the tests... Confused


RE: [Discussion] Re. Substring counting - wavic - Jan-22-2018

Which tests?


RE: [Discussion] Re. Substring counting - Mekire - Jan-22-2018

Did you read the top post?
In this version we are looking for subsequences, not substrings.
Quote:Extend this to instead give an algorithm that finds the number of times one string occurs as a subsequence in another string.

Whereas a substring must be contiguous, a subsequence need not be, though it must still be in the correct order.

For example if we count occurrences of the subsequence "an" in "banana" we get 3.
Comprised of the following: "banana", "banana", "banana"
I gave a couple sample input/outputs in the OP as well and Grib put them in unittests for convenience.

This is the template with tests:



RE: [Discussion] Re. Substring counting - wavic - Jan-22-2018

I see it now. Hm...