We found that 5% of the 666 Python open source GitHub repositories we checked had the following three comma-related bugs caused by typos:
Too few commas
Accidentally missed comma from string in list/tuple/set, resulting in unwanted string concatenation.
Implicit string concatenation that resulted from a typo can change the behaviour of the application. Take for example:
def is_positive(word):
words = (
'yes',
'correct',
'affirmative'
'agreed',
)
return word in wordsis_positive('agreed') is True # evaluates to False
is_positive(‘agreed’) evaluates to False because a typo resulted in the comma being missed from the end of affirmative’, resulting in ‘affirmative’ and ‘agreed’ being implicitly concatenated to ‘affirmativeagreed’.
Accidentally missed comma from 1 element tuple, making it a str instead of a tuple.
As far as the Python parser is concerned the parentheses are optional for non-empty tuples. According to the documentation: it is actually the comma which makes a tuple, not the parentheses. The parentheses are optional, except in the empty tuple case.
Too many commas
Typo accidentally adding a comma to end of a value], turning it into a tuple
As far as the Python parser is concerned the parentheses are optional for non-empty tuples. According to the documentation: it is actually the comma which makes a tuple, not the parentheses. The parentheses are optional, except in the empty tuple case.
value = (1) # evaluates to int
value = 1, # evaluates to tuple with one element
value = (1,) # evaluates to tuple with one element
value = 1, 2 # evaluates to tuple with two elements
By accidentally suffixing a comma on value = 1, expect to get a TypeError when the code later attempts to perform integer operations on a variable that evaluated to tuple.
Detecting the bugs
We did not go through the repositories line by line. Instead we ran the repositories through our suite of static analysis checks) which we run on AWS Lambda, taking a grand total of 90 seconds.
The main difficulty was reducing false positives. Syntactically there is no difference between a missing comma in a list and implicit string concatenation done on purpose. Indeed, when we first wrote the “probably missing a comma” checker we found that 95% of the problems were actually false positives: there are perfectly cromulent reasons a developer would do implicit string concatenation spanning multiple lines:
- Code like SQL or HTML
- User agent string
- path-like values such as file paths and URLs
- Encoded text like JSON and CSV file contents
- Very long message
- Sha hash
After checking the 666 codebases we understood the key drivers of false positives and added allowances for implicit string concatenation for those types of data. After we gave allowances for these valid reasons to do implicit string concatenation the false positive rate went down to negligible non-annoying level. At this point we w