Validation and Regex Compatibility Issues in python-string-utils
Summary
Several validation functions appear to reject inputs that are commonly accepted in standard formats. Property-based and targeted tests reveal at least three distinct issues:
- Scientific notation support in
is_number/is_decimal is incomplete (rejects uppercase E and negative exponents)
- URL port parsing is overly restrictive (rejects single-digit ports)
- Pangram detection is case-sensitive (fails uppercase-only pangrams)
Affected Code
- Scientific notation regex:
string_utils/_regex.py:7
- URL regex port segment:
string_utils/_regex.py:14
- Pangram implementation:
string_utils/validation.py:510-514
- Number validation:
string_utils/validation.py:135-138
- Decimal validation:
string_utils/validation.py:159-172
Environment
- OS: Windows
- Python: 3.10
- Command:
python run_added_tests.py
Expected Behavior
is_number and is_decimal should accept scientific notation using both lowercase and uppercase E, and should allow signed exponents (e.g., 1e-3, 1E+3, 1.5e-3).
is_url should accept 1–5 digit port numbers (standard range 0–65535), e.g., http://localhost:8.
is_pangram should be case-insensitive (uppercase-only pangrams should pass).
Actual Behavior
- Scientific notation:
is_number('1E3') → False
is_number('1e-3') → False
is_decimal('1.5e-3') → False
- URL port:
is_url('http://localhost:8') → False
is_url('http://127.0.0.1:7') → False
- Pangram:
is_pangram('THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG') → False
Code References
- Scientific notation regex currently:
string_utils/_regex.py:7
NUMBER_RE = re.compile(r'^([+\-]?)((\d+)(\.\d+)?(e\d+)?|\.\d+)$')
- Limitations:
- Only lowercase
e
- Exponent requires digits only (
\d+), no optional sign
- URL port segment:
string_utils/_regex.py:14
(:\d{2,})? requires at least two digits for the port
- Pangram:
string_utils/validation.py:510-514
- Compares set of characters to
string.ascii_lowercase without case normalization
Suggested Fixes
is_number / is_decimal:
- Update
NUMBER_RE to accept both e and E, and an optional sign before exponent digits, e.g.:
- Allow pattern segment like
[eE][+\-]?\d+
- Ensure downstream checks (
is_integer, is_decimal) keep consistent semantics when scientific notation is used.
is_url:
- Relax port segment to
(:\d{1,5})? and optionally validate numeric range (0–65535) outside the regex if desired.
is_pangram:
- Normalize input to lowercase (or use a case-insensitive comparison) before computing set inclusion.
Validation and Regex Compatibility Issues in python-string-utils
Summary
Several validation functions appear to reject inputs that are commonly accepted in standard formats. Property-based and targeted tests reveal at least three distinct issues:
is_number/is_decimalis incomplete (rejects uppercaseEand negative exponents)Affected Code
string_utils/_regex.py:7string_utils/_regex.py:14string_utils/validation.py:510-514string_utils/validation.py:135-138string_utils/validation.py:159-172Environment
python run_added_tests.pyExpected Behavior
is_numberandis_decimalshould accept scientific notation using both lowercase and uppercaseE, and should allow signed exponents (e.g.,1e-3,1E+3,1.5e-3).is_urlshould accept 1–5 digit port numbers (standard range 0–65535), e.g.,http://localhost:8.is_pangramshould be case-insensitive (uppercase-only pangrams should pass).Actual Behavior
is_number('1E3')→ Falseis_number('1e-3')→ Falseis_decimal('1.5e-3')→ Falseis_url('http://localhost:8')→ Falseis_url('http://127.0.0.1:7')→ Falseis_pangram('THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG')→ FalseCode References
string_utils/_regex.py:7NUMBER_RE = re.compile(r'^([+\-]?)((\d+)(\.\d+)?(e\d+)?|\.\d+)$')e\d+), no optional signstring_utils/_regex.py:14(:\d{2,})?requires at least two digits for the portstring_utils/validation.py:510-514string.ascii_lowercasewithout case normalizationSuggested Fixes
is_number/is_decimal:NUMBER_REto accept botheandE, and an optional sign before exponent digits, e.g.:[eE][+\-]?\d+is_integer,is_decimal) keep consistent semantics when scientific notation is used.is_url:(:\d{1,5})?and optionally validate numeric range (0–65535) outside the regex if desired.is_pangram: