Finding Comments in Source Code Using Regular Expressions


Many text editors have advanced find (and replace) features. When I’m programming, I like to use an editor with regular expression search and replace. This feature is allows one to find text based on complex patterns rather than based just on literals. Upon occasion I want to examine each of the comments in my source code and either edit them or remove them. I found that it was difficult to write a regular expression that would find C style comments (the comments that start with /* and end with */) because my text editor does not implement the “non-greedy matching” feature of regular expressions.

First Try

When first attempting this problem, most people consider the regular expression:

/\*.*\*/

This seems the natural way to do it. /\* finds the start of the comment (note that the literal * needs to be escaped because * has a special meaning in regular expressions), .* finds any number of any character, and \*/ finds the end of the expression.

The first problem with this approach is that .* does not match new lines.

/* First comment 
 first comment—line two*/
/* Second comment */

Second Try

This can be overcome easily by replacing the . with [^] (in some regular expression packages) or more generally with (.|[\r\n]):

/\*(.|[\r\n])*\*/

This reveals a second, more serious, problem—the expression matches too much. Regular expressions are greedy, they take in as much as they can. Consider the case in which your file has two comments. This regular expression will match them both along with anything in between:

start_code();
/* First comment */
more_code(); 
/* Second comment */
end_code();

Third Try

To fix this, the regular expression must accept less. We cannot accept just any character with a ., we need to limit the types of characters that can be in our expressions:

/\*([^*]|[\r\n])*\*/

This simplistic approach doesn’t accept any comments with a * in them.

/*
 * Common multi-line comment style.
 */
/* Second comment */

Fourth Try

This is where it gets tricky. How do we accept a * without accepting the * that is part of the end comment? The solution is to still accept any character that is not *, but also accept a * and anything that follows it provided that it isn’t followed by a /:

/\*([^*]|[\r\n]|(\*([^/]|[\r\n])))*\*/

This works better but again accepts too much in some cases. It will accept any even number of *. It might even accept the * that is supposed to end the comment.

start_code();
/****
 * Common multi-line comment style.
 ****/
more_code(); 
/*
 * Another common multi-line comment style.
 */
end_code();

Fifth Try

What we tried before will work if we accept any number of * followed by anything other than a * or a /:

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*/

Now the regular expression does not accept enough again. Its working better than ever, but it still leaves one case. It does not accept comments that end in multiple *.

/****
 * Common multi-line comment style.
 ****/
/****
 * Another common multi-line comment style.
 */

Solution

Now we just need to modify the comment end to allow any number of *:

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/

We now have a regular expression that we can paste into text editors that support regular expressions. Finding our comments is a matter of pressing the find button. You might be able to simplify this expression somewhat for your particular editor. For example, in some regular expression implementations, [^] assumes the [\r\n] and all the [\r\n] can be removed from the expression.

This is easy to augment so that it will also find // style comments:

(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)

Tool Expression and Usage Notes
nedit (/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)
Ctrl+F to find, put in expression, check the Regular Expression check box.
[^] does not include new line
grep (/\*([^*]|(\*+[^*/]))*\*+/)|(//.*)
grep -E “(/\*([^*]|(\*+[^*/]))*\*+/)|(//.*)” <files>
Does not support multi-line comments, will print out each line that completely contains a comment.
perl /((?:\/\*(?:[^*]|(?:\*+[^*\/]))*\*+\/)|(?:\/\/.*))/
perl -e “$/=undef;print<>=~/((?:\/\*(?:[^*]|(?:\*+[^*\/]))*\*+\/)|(?:\/\/.*))/g;” < <file>
Prints out all the comments run together. The (?: notation must be used for non-capturing parenthesis. Each / must be escaped because it delimits the expression. $/=undef; is used so that the file is not matched line by line like grep.
Java "(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)"
System.out.println(sourcecode.replaceAll(“(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)”,””));
Prints out the contents of the string sourcecode with the comments removed. The (?: notation must be used for non-capturing parenthesis. Each \ must be escaped in a Java String.

An Easier Method

Non-greedy Matching

Most regular expression packages support non-greedy matching. This means that the pattern will only be matched if there is no other choice. We can modify our second try to use the non-greedy matcher *? instead of the greedy matcher *. With this new tool, the middle of our comment will only match if it doesn’t match the end:

/\*(.|[\r\n])*?\*/

Tool Expression and Usage Notes
nedit /\*(.|[\r\n])*?\*/
Ctrl+F to find, put in expression, check the Regular Expression check box.
[^] does not include new line
grep /\*.*?\*/
grep -E ‘/\*.*?\*/’ <file>
Does not support multi-line comments, will print out each line that completely contains a comment.
perl /\*(?:.|[\r\n])*?\*/
perl -0777ne ‘print m!/\*(?:.|[\r\n])*?\*/!g;’ <file>
Prints out all the comments run together. The (?: notation must be used for non-capturing parenthesis.
/ does not have to be escaped because ! delimits the expression.
-0777 is used to enable slurp mode and -n enables automatic reading.
Java "/\\*(?:.|[\\n\\r])*?\\*/"
System.out.println(sourcecode.replaceAll(“/\\*(?:.|[\\n\\r])*?\\*/”,””));
Prints out the contents of the string sourcecode with the comments removed. The (?: notation must be used for non-capturing parenthesis. Each \ must be escaped in a Java String.

Caveats

Comments Inside Other Elements

Although our regular expression describes c-style comments very well, there are still problems when something
appears to be a comment but is actually part of a larger element.

someString = "An example comment: /* example */";

// The comment around this code has been commented out.
// /*
some_code();
// */

The solution to this is to write regular expressions that describe each of the possible larger elements, find these as well, decide what type of element each is, and discard the ones that are not comments. There are tools called lexers or tokenizers that can help with this task. A lexer accepts regular expressions as input, scans a stream, picks out tokens that match the regular expressions, and classifies the token based on which expression it matched. The greedy property of regular expressions is used to ensure the longest match. Although writing a full lexer for C is beyond the scope of this document, those interested should look at lexer generators such as Flex and JFlex.


Leave a comment

Your email address will not be published. Required fields are marked *

6 thoughts on “Finding Comments in Source Code Using Regular Expressions

  • Yandot

    Hi Stephen. Good work! very useful. However it seems the perl expressions (both the solution one and the non-greedy version) do not render properly the end of lines. For example with this code: const uint8_t* cmd_descriptor_block,/**<[in] */ int cmd_descriptor_block_size, /**[in] */ uint8_t* storage_for_ctx, /**<[in] */ int storage_max_size, /**<[in] */ uint8_t expected_data_xfer_len, /**<[in] */ struct context** command_struct, /**<[out] It uses storage_for_ctx memory. */ enum types* i_type /**<[out] */ I get this: /**<[in] *//**[in] *//**<[in] *//**<[in] *//**<[in] *//**<[out] It uses storage_for_ctx memory. *//**<[out] */ Cheers!

  • Trevor Sundberg

    I’ve stumbled across this a few times, and I always find it incredibly useful. This is one of those cases where writing out the DFA that the regular expression builds is actually much easier to understand and to handle the edge cases (like multiple stars at the end). Thank you!

  • Roman

    Thanks a lot for this post! I could not go on the first try. Now it is work fine with r'(\/*(.|[\r\n])*?*\/)’ in Python.

  • Jim Fennell

    This is an excellent article, explaining how you arrived at the end-result regex extremely well. I’ve been trying to build something that will remove comments from JavaScript. The one thing I have found so far that this solution doesn’t completely handle is related to the part about finding // style comments. While simply adding “|(//.*)” to the end of the regex does pick up comments that start with “//” and proceed to the end of the line, it is too aggressive in that it also picks up places where “//” appears inside a string, such as “src=’http://www.mysrc.com’ or “newUrl = ‘//’ + $(HTTP_HOST) + $(REQUEST_PATH), etc. I suspect it would be an equal challenge to find ONLY those occurrences of “//” that DON’T appear inside a quoted string… The battle rages on! Thanks for getting me this far! If I get a solution, I’ll post in the comments. If anyone else gets one, PLEASE do the same!

    • Stephen Post author

      As I state in the article, you need a lexer. Using regular expression search and replace isn’t going to cut it because String literals can contain the characters that start comments without actually starting a comment.

  • G

    Although we all know the famous Brendan Eich quote: “when I was told to “make it look like Java”, I made it look like C!” (abolishing common regurgitated ‘religion’), it should be noted that ES262 (umbrella-term for all it’s dialects: javascript (not JavaScript, the (SUN, now Oracle) trademark licensed to Mozilla’s engine named JavaScript (example: JavaScript 1.5 is Mozilla’s engine implementing ES3))), comments are only visually on the surface the same as C-style comments!

    The difference?

    Well in C-style, there is just Line Comment (// …[excluding EOL]) and Block Comment (/* */) and most C Lexers replace both with (the equivalent of) one space-char.

    In javascript however, we have ASI (automatic semi-colon insertion).. and to accommodate that, there are 3 types of comments.

    The Line Comment (//…[excluding EOL]) (which could be replaced with a space or nothing as it excludes that line’s EOL) while the Block Comment then has TWO types: the SINGLE LINE BLOCK comment (replaceable/equivalent to space-char but NO EOL!) and the MULTI-LINE BLOCK comment (replaceable/equivalent to EOL!)

    Replacing a multi-line block comment with a whitespace (and not an EOL) WILL screw up ASI, aka break working tested code! Should one want to find comments with the intention of ‘removing’ them, they should pay attention to this rule above. It’s trivial to do so: just check if there is at least one EOL in a block-comment!

    More interestingly is the problem that the exercise of finding comments in javascript is really centered around correctly identifying regular expression literals (and string/template-literals)! Yet the vast majority of people concentrate stubborn on matching the comments (after all, that’s what they’re after and they seem so … ‘regular’…).

    Here’s the thing to grasp: just because the tokens themselves (like a comment-token) usually can be described by a regular expression, doesn’t necessarily means that one can describe the source-text (in which these tokens appear) in a regular way!

    One can not uniformly say that ‘it’ can or can not be done with regex, after all, there are many different types (and versions) of regular expression engines implemented on different languages/hosts/platforms etc. And some programmers assume PCRE by default when talking regex, etc.. For example, a regex engine with a (limited) stack can potentially solve the matching double/semi colon problem (string literals containing odd/even number of escaped quotes) and some other lexer-like problems. BUT, that’s not simply going to solve the lexer-state problem!

    SADLY javascript source code can not be lexed by a lexer/tokenizer alone, it needs help from a parser (or be a hybrid of lexer-parser all-in-one) to know the lexical context in the source! Without it, one can not know if slash (/) means division or regular expression literal.

    In the past ~15 years, a lot of people copied the Crockford JSMin errors and believed one could just backtrack to the previous token, but that’s obviously wrong, very wrong! Imagine (simplified to meaningless examples, just crank them up in your mind to something realistic and more importantly a lot more expressive/difficult):

    (a+b)/42/i   //division right?  if(a+b)/42/i.exec(str); //Or.. a regex? And yes, the exec has functional 'side effects' 

    what about:

    ({ valueOf: function(){return 100} }/42/i) 

    vs

    {console.log('a');} /42/i 

    And ponder about the different meanings such code gets should ASI kick in! Good luck regex engine WITH state..

    If only regex literals were delimited with an @ or an # (like: @expression@flags or #expression#flags) then javascript source could be lexable with your most basic naive state machine, roughly fifteen lines of code could (pre-)lex the source (even raw character stream) into: regex-literal, string literal, template literal all the same routine (end-char = unescaped open-char) and 2 routines for the comments. In most cases/environments this simple state machine would outperform a regex implementing the same (and being ultimately a state machine itself). Everything not matched (the chars in-between) are then already free of regex/string/template literals and comments (and for example a naive old school syntax highlighter could then trivially simple match numerical/boolean/null literals and keywords). That simple and it would be guaranteed 100% reliable. And everyone would have learned to code such a similar snippet in the first week of programming class. After all, it’s nothing but a loop and a switch.. But.. we live in the real world. Besides the fact that @ and # are (by now) in serious proposal to be used as decorator(@) and private fields(#), that ship had already sailed once javascript was in actual use on the web.

    I hope this explains WHY it’s so hard/unrealistic/impossible (depending on regex engine) and that the problem lies much deeper than a simple regex engine matching quote problem; it’s the javascript grammar itself.

    Humoristically, regex IS the problem (the regex literal’s delimiter char to be precise). More fun: since it is already such a problem.. it was decided that in ES>3 one can have un-escaped slasheS in regex characterclasses as well.. thus the following is a perfectly valid regex: /[///** ]/g.exec('*/')

    And then we haven’t even taken all the extensions in the latest ES262 editions (>ES5.1) like arrow functions (that appear to require look-ahead in the lexer/parser as well) etc..

    PS: Extra note, lexer generators commonly accept their token-definitions as regular expressions (explained above), that however does not necessarily mean they actually use regular expressions (at least not all) in the resulting lexer code they generate.