Python Forum
Algorithm for extracting comments from Python source code
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Algorithm for extracting comments from Python source code
#1
I never worked with Python before, but now there is a task. There are several Python projects and I need to extract comments from the source code of these projects.

For comments in Python, either # or strings that are not used anywhere are used. If everything is clear with #, then about strings it's not so simple. Since it is necessary to distinguish those strings that are used (for example, for variables or in expressions) from unused strings.

After conducting several experiments in an online compiler, I think about the following algorithm.

1. IF there are no characters (except whitespaces) before the opening quotes (on the line of code where these quotes are)
2. AND IF there are no characters (except whitespaces) after the closing quotes (on the line where these quotes are)
3. AND IF the line is not between parentheses ()
4. AND IF the previous line of code does not end with \
then this is a comment string.

Example:
a = "This is NOT a comment!  "

b = (a 
    + 
    
    """ This is NOT a comment! """
    )

c = a + \
    """ This is NOT a comment!! """ 

'''
And this is already 
 a comment
'''
Please tell me, is this algorithm correct or not? Maybe it needs to be adjusted in some way?"
Reply
#2
Have you tried opening a module and asking for help?
Output:
>>> import interactiveconsole >>> dir() ['__annotations__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'interactiveconsole'] >>> help(interactiveconsole) Help on module interactiveconsole: NAME interactiveconsole CLASSES builtins.object FileCacher code.InteractiveConsole(code.InteractiveInterpreter) Shell class FileCacher(builtins.object) | Cache the stdout text so we can analyze it before returning it | | Methods defined here: | | __init__(self) | Initialize self. See help(type(self)) for accurate signature. | | flush(self) | | reset(self) | | write(self, line) | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) class Shell(code.InteractiveConsole) | Wrapper around Python that can filter input/output to the shell | | Method resolution order: | Shell | code.InteractiveConsole | code.InteractiveInterpreter | builtins.object | | Methods defined here: | | __init__(self) | Constructor. | | The optional locals argument will be passed to the | InteractiveInterpreter base class. | | The optional filename argument should specify the (file)name | of the input stream; it will show up in tracebacks. | | get_output(self) | | push(self, line) | Push a line to the interpreter. | | The line should not have a trailing newline; it may have | internal newlines. The line is appended to a buffer and the | interpreter's runsource() method is called with the | concatenated contents of the buffer as source. If this | indicates that the command was executed or invalid, the buffer | is reset; otherwise, the command is incomplete, and the buffer | is left as it was after the line was appended. The return | value is 1 if more input is required, 0 if the line was dealt | with in some way (this is the same as runsource()). | | return_output(self) | | ---------------------------------------------------------------------- | Methods inherited from code.InteractiveConsole: | | interact(self, banner=None, exitmsg=None) | Closely emulate the interactive Python console. | | The optional banner argument specifies the banner to print | before the first interaction; by default it prints a banner | similar to the one printed by the real Python interpreter, | followed by the current class name in parentheses (so as not | to confuse this with the real interpreter -- since it's so | close!). | | The optional exitmsg argument specifies the exit message | printed when exiting. Pass the empty string to suppress | printing an exit message. If exitmsg is not given or None, | a default message is printed. | | raw_input(self, prompt='') | Write a prompt and read a line. | | The returned line does not include the trailing newline. | When the user enters the EOF key sequence, EOFError is raised. | | The base implementation uses the built-in function | input(); a subclass may replace this with a different | implementation. | | resetbuffer(self) | Reset the input buffer. | | ---------------------------------------------------------------------- | Methods inherited from code.InteractiveInterpreter: | | runcode(self, code) | Execute a code object. | | When an exception occurs, self.showtraceback() is called to | display a traceback. All exceptions are caught except | SystemExit, which is reraised. | | A note about KeyboardInterrupt: this exception may occur | elsewhere in this code, and may not always be caught. The | caller should be prepared to deal with it. | | runsource(self, source, filename='<input>', symbol='single') | Compile and run some source in the interpreter. | | Arguments are as for compile_command(). | | One of several things can happen: | | 1) The input is incorrect; compile_command() raised an | exception (SyntaxError or OverflowError). A syntax traceback | will be printed by calling the showsyntaxerror() method. | | 2) The input is incomplete, and more input is required; | compile_command() returned None. Nothing happens. | | 3) The input is complete; compile_command() returned a code | object. The code is executed by calling self.runcode() (which | also handles run-time exceptions, except for SystemExit). | | The return value is True in case 2, False in the other cases (unless | an exception is raised). The return value can be used to | decide whether to use sys.ps1 or sys.ps2 to prompt the next | line. | | showsyntaxerror(self, filename=None) | Display the syntax error that just occurred. | | This doesn't display a stack trace because there isn't one. | | If a filename is given, it is stuffed in the exception instead | of what was there before (because Python's parser always uses | "<string>" when reading from a string). | | The output is written by self.write(), below. | | showtraceback(self) | Display the exception that just occurred. | | We remove the first stack item because it is our own code. | | The output is written by self.write(), below. | | write(self, data) | Write a string. | | The base implementation writes to sys.stderr; a subclass may | replace this with a different implementation. | | ---------------------------------------------------------------------- | Data descriptors inherited from code.InteractiveInterpreter: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) FILE ...\interactiveconsole.py
If you just want to collect all the comments in a file I suggest you look for a tool that does that instead of writing your own.
Reply
#3
@deanhystad

Thank you for your suggestion. But it is absolutely not my variant. I need to do it myself, so I need the algorithm.
Reply
#4
(Feb-28-2024, 06:59 PM)Pavel1982 Wrote: I need to do it myself, so I need the algorithm.
I would use the Python parsing tools that are available in the standard library, namely the tokenize() function and the ast module.
« We can solve any problem by introducing an extra level of indirection »
Reply
#5
@Gribouillis Unfortunately I have to do it in C++. As our company works only with C++
Reply
#6
Oficially, only the # ... comments exist in Python, but you can't prevent people from overusing the syntax by cluttering the code with strings. This could be done in many strange ways, for example
((
'''Here is a new strange
comment included in parenthesis'''),

"""It really does nothing useful, so
it's probably a comment, but it is impossible to guess"""
)
Also note that there are other paired delimiters in Python, namely { } and [ ]
« We can solve any problem by introducing an extra level of indirection »
Reply
#7
@Gribouillis

The example you provided is in parantheses, so it will be considered as not comment. Of course, people can do different strange variations, but these cases are very rare and I can leave them. I need a common algorithm for common case.

Thank you for you suggestion, I edited #3 for such cases
my_dict = {
    "key1": "not comment",
    "key2": 
        "Not comment"
}

my_list = [
    "Not comment",
    "Not comment"
]
1. IF there are no characters (except whitespaces) before the opening quotes (on the line of code where these quotes are)
2. AND IF there are no characters (except whitespaces) after the closing quotes (on the line where these quotes are)
3. AND IF the line is not between ( ) or [ ] or { }
4. AND IF the previous line of code does not end with \
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  delivery exe without source code py loky62 2 359 Apr-04-2024, 05:47 PM
Last Post: loky62
  Error on import: SyntaxError: source code string cannot contain null bytes kirkwilliams2049 7 6,880 Aug-03-2023, 06:00 PM
Last Post: Gribouillis
  How do I add comments from a text-file to an array of folders? clausneergaard 2 1,803 Feb-08-2023, 07:45 PM
Last Post: Larz60+
  python move specific files from source to destination including duplicates mg24 3 1,120 Jan-21-2023, 04:21 AM
Last Post: deanhystad
  Python Snippets Source kucingkembar 0 646 Oct-18-2022, 12:50 AM
Last Post: kucingkembar
  Inserting line feeds and comments into a beautifulsoup string arbiel 1 1,196 Jul-20-2022, 09:05 AM
Last Post: arbiel
  Correct the algorithm of image filter code saoko 6 2,035 May-08-2022, 05:06 PM
Last Post: saoko
  Rock paper scissors in python with "algorithm" Agat0 23 6,097 Mar-01-2022, 03:20 PM
Last Post: Agat0
  Long-term stable source to get news headlines with Python? sandufi 4 1,950 Dec-23-2021, 09:48 AM
Last Post: sandufi
  Delete multiple comments with a single API call (facebook) Ascalon 0 2,327 Dec-04-2021, 08:33 PM
Last Post: Ascalon

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020