Bug report
In short
I found a bug in the t-string tokeniser by fuzzing it, specifically in set_ftstring_expr (from Parser/lexer/lexer.c).
The following is a reproducer which consistently reports a MemoryError, which isn't the expected error (I would expect TokenError).
import tokenize
import io
list(tokenize.tokenize(io.BytesIO(b't"{!\n!x').readline))
results in the following:
Traceback (most recent call last):
File "<python-input-3>", line 1, in <module>
list(tokenize.tokenize(io.BytesIO(b't"{!\n!x').readline))
~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bits/python315/Lib/tokenize.py", line 499, in tokenize
yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True)
File "/home/bits/python315/Lib/tokenize.py", line 634, in _generate_tokens_from_c_tokenizer
for info in it:
^^
MemoryError
Longer version
set_ftstring_expr in Parser/lexer/lexer.c, extracts the source text of the expression inside {...} so it can be attached as metadata to the token.
The function tracks two fields on the tokenizer mode struct:
last_expr_size is set when { is seen, with value strlen(tok->cur) (bytes from tok->cur to end of buffer)
last_expr_end is set when !, }, or : is seen with value strlen(tok->start) (bytes from the delimiter to end of buffer)
The intended expression length is last_expr_size - last_expr_end (number of characters between the two positions).
However, this does not work when there are two ! across two lines, for example:
The sequence of events (LLM-analysed):
-
{ on line 1: _PyLexer_update_ftstring_expr(tok, '{') sets last_expr_size = strlen(tok->cur) (bytes remaining on line 1 after {) and last_expr_end = -1.
-
First ! on line 1: _PyLexer_update_ftstring_expr(tok, '!') sets last_expr_end = strlen(tok->start) (a smaller value from the same line 1 buffer). set_ftstring_expr runs, computes last_expr_size - last_expr_end > 0, stores result in token->metadata. Crucially, last_expr_end is now ≥ 0.
-
Newline: _PyLexer_update_ftstring_expr(tok, 0) would normally append the next line's content and grow last_expr_size, keeping the measurements in sync. But the case 0 branch has a guard: it skips the append when last_expr_end >= 0. Because the first ! already set last_expr_end, the append is skipped and last_expr_size is locked at its small line-1 value.
-
Second ! on line 2: _PyLexer_update_ftstring_expr(tok, '!') sets last_expr_end = strlen(tok->start) measured in the new line 2 buffer. If line 2 has more content after ! than line 1 had after {, this new last_expr_end > last_expr_size. A new token struct is active (the previous one was emitted), so token->metadata == NULL and set_ftstring_expr runs the full computation -- producing a negative Py_ssize_t.
That negative value is then used in three places:
PyMem_Malloc((last_expr_size - last_expr_end + 1) * sizeof(char)) cast to size_t, -N+1 becomes huge.
PyUnicode_DecodeUTF8(buf, last_expr_size - last_expr_end, NULL) The length argument is Py_ssize_t but unicodeobject.c immediately checks if (size > PY_SSIZE_T_MAX) after casting -- a negative value cast to size_t is huge and trips the overflow guard, raising PyErr_NoMemory.
- Loop bounds (
for i < ...; while i < ...) -- a negative bound means the loops never execute, so the comment-stripping pass is silently skipped for inputs that reach the hash_detected branch.
Proposed fix
Compute expr_len once at the top of set_ftstring_expr and return -1 immediately if it is negative. Returning -1 signals a tokenizer error, which surfaces to Python callers as TokenError -- the correct outcome for malformed source. All five downstream uses of the subtraction are replaced with expr_len.
The fix (with a regression test) is already implemented on my branch.
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Bug report
In short
I found a bug in the t-string tokeniser by fuzzing it, specifically in
set_ftstring_expr(fromParser/lexer/lexer.c).The following is a reproducer which consistently reports a
MemoryError, which isn't the expected error (I would expectTokenError).results in the following:
Longer version
set_ftstring_exprinParser/lexer/lexer.c, extracts the source text of the expression inside{...}so it can be attached as metadata to the token.The function tracks two fields on the tokenizer mode struct:
last_expr_sizeis set when{is seen, with valuestrlen(tok->cur)(bytes fromtok->curto end of buffer)last_expr_endis set when!,}, or:is seen with valuestrlen(tok->start)(bytes from the delimiter to end of buffer)The intended expression length is
last_expr_size - last_expr_end(number of characters between the two positions).However, this does not work when there are two
!across two lines, for example:The sequence of events (LLM-analysed):
{on line 1:_PyLexer_update_ftstring_expr(tok, '{')setslast_expr_size = strlen(tok->cur)(bytes remaining on line 1 after{) andlast_expr_end = -1.First
!on line 1:_PyLexer_update_ftstring_expr(tok, '!')setslast_expr_end = strlen(tok->start)(a smaller value from the same line 1 buffer).set_ftstring_exprruns, computeslast_expr_size - last_expr_end > 0, stores result intoken->metadata. Crucially,last_expr_endis now ≥ 0.Newline:
_PyLexer_update_ftstring_expr(tok, 0)would normally append the next line's content and growlast_expr_size, keeping the measurements in sync. But thecase 0branch has a guard: it skips the append whenlast_expr_end >= 0. Because the first!already setlast_expr_end, the append is skipped andlast_expr_sizeis locked at its small line-1 value.Second
!on line 2:_PyLexer_update_ftstring_expr(tok, '!')setslast_expr_end = strlen(tok->start)measured in the new line 2 buffer. If line 2 has more content after!than line 1 had after{, this newlast_expr_end > last_expr_size. A newtokenstruct is active (the previous one was emitted), sotoken->metadata == NULLandset_ftstring_exprruns the full computation -- producing a negativePy_ssize_t.That negative value is then used in three places:
PyMem_Malloc((last_expr_size - last_expr_end + 1) * sizeof(char))cast tosize_t,-N+1becomes huge.PyUnicode_DecodeUTF8(buf, last_expr_size - last_expr_end, NULL)The length argument isPy_ssize_tbutunicodeobject.cimmediately checksif (size > PY_SSIZE_T_MAX)after casting -- a negative value cast tosize_tis huge and trips the overflow guard, raisingPyErr_NoMemory.for i < ...; while i < ...) -- a negative bound means the loops never execute, so the comment-stripping pass is silently skipped for inputs that reach thehash_detectedbranch.Proposed fix
Compute
expr_lenonce at the top ofset_ftstring_exprand return-1immediately if it is negative. Returning-1signals a tokenizer error, which surfaces to Python callers asTokenError-- the correct outcome for malformed source. All five downstream uses of the subtraction are replaced withexpr_len.The fix (with a regression test) is already implemented on my branch.
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux