5% of 666 Python repos had comma typo bugs (inc V8, TensorFlow and PyTorch) by rikatee

Share This Article

Sed ut perspiciatis unde.

We found that 5% of the 666 Python open source GitHub repositories we checked had the following three comma-related bugs caused by typos:

Too few commas
Accidentally missed comma from string in list/tuple/set, resulting in unwanted string concatenation.

Implicit string concatenation that resulted from a typo can change the behaviour of the application. Take for example:

def is_positive(word):
    words = (
        'yes',
        'correct',
        'affirmative'
        'agreed',
     )
 return word in wordsis_positive('agreed') is True # evaluates to False

is_positive(‘agreed’) evaluates to False because a typo resulted in the comma being missed from the end of affirmative’, resulting in ‘affirmative’ and ‘agreed’ being implicitly concatenated to ‘affirmativeagreed’.

Accidentally missed comma from 1 element tuple, making it a str instead of a tuple.

As far as the Python parser is concerned the parentheses are optional for non-empty tuples. According to the documentation: it is actually the comma which makes a tuple, not the parentheses. The parentheses are optional, except in the empty tuple case.

Too many commas
Typo accidentally adding a comma to end of a value], turning it into a tuple

value = (1) # evaluates to int
value = 1, # evaluates to tuple with one element
value = (1,) # evaluates to tuple with one element
value = 1, 2 # evaluates to tuple with two elements

By accidentally suffixing a comma on value = 1, expect to get a TypeError when the code later attempts to perform integer operations on a variable that evaluated to tuple.

Detecting the bugs

We did not go through the repositories line by line. Instead we ran the repositories through our suite of static analysis checks) which we run on AWS Lambda, taking a grand total of 90 seconds.

The main difficulty was reducing false positives. Syntactically there is no difference between a missing comma in a list and implicit string concatenation done on purpose. Indeed, when we first wrote the “probably missing a comma” checker we found that 95% of the problems were actually false positives: there are perfectly cromulent reasons a developer would do implicit string concatenation spanning multiple lines:

Code like SQL or HTML
User agent string
path-like values such as file paths and URLs
Encoded text like JSON and CSV file contents
Very long message
Sha hash

After checking the 666 codebases we understood the key drivers of false positives and added allowances for implicit string concatenation for those types of data. After we gave allowances for these valid reasons to do implicit string concatenation the false positive rate went down to negligible non-annoying level. At this point we w

5% of 666 Python repos had comma typo bugs (inc V8, TensorFlow and PyTorch) by rikatee

5% of 666 Python repos had comma typo bugs (inc V8, TensorFlow and PyTorch) by rikatee

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

5% of 666 Python repos had comma typo bugs (inc V8, TensorFlow and PyTorch) by rikatee

5% of 666 Python repos had comma typo bugs (inc V8, TensorFlow and PyTorch) by rikatee

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter