Skip to content

t-string tokeniser reports MemoryError on invalid input #149183

@KowalskiThomas

Description

@KowalskiThomas

Bug report

In short

I found a bug in the t-string tokeniser by fuzzing it, specifically in set_ftstring_expr (from Parser/lexer/lexer.c).

The following is a reproducer which consistently reports a MemoryError, which isn't the expected error (I would expect TokenError).

import tokenize
import io

list(tokenize.tokenize(io.BytesIO(b't"{!\n!x').readline))

results in the following:

Traceback (most recent call last):
  File "<python-input-3>", line 1, in <module>
    list(tokenize.tokenize(io.BytesIO(b't"{!\n!x').readline))
    ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bits/python315/Lib/tokenize.py", line 499, in tokenize
    yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True)
  File "/home/bits/python315/Lib/tokenize.py", line 634, in _generate_tokens_from_c_tokenizer
    for info in it:
                ^^
MemoryError

Longer version

set_ftstring_expr in Parser/lexer/lexer.c, extracts the source text of the expression inside {...} so it can be attached as metadata to the token.

The function tracks two fields on the tokenizer mode struct:

  • last_expr_size is set when { is seen, with value strlen(tok->cur) (bytes from tok->cur to end of buffer)
  • last_expr_end is set when !, }, or : is seen with value strlen(tok->start) (bytes from the delimiter to end of buffer)

The intended expression length is last_expr_size - last_expr_end (number of characters between the two positions).

However, this does not work when there are two ! across two lines, for example:

t"{expr!conv1
n!conv2

The sequence of events (LLM-analysed):

  1. { on line 1: _PyLexer_update_ftstring_expr(tok, '{') sets last_expr_size = strlen(tok->cur) (bytes remaining on line 1 after {) and last_expr_end = -1.

  2. First ! on line 1: _PyLexer_update_ftstring_expr(tok, '!') sets last_expr_end = strlen(tok->start) (a smaller value from the same line 1 buffer). set_ftstring_expr runs, computes last_expr_size - last_expr_end > 0, stores result in token->metadata. Crucially, last_expr_end is now ≥ 0.

  3. Newline: _PyLexer_update_ftstring_expr(tok, 0) would normally append the next line's content and grow last_expr_size, keeping the measurements in sync. But the case 0 branch has a guard: it skips the append when last_expr_end >= 0. Because the first ! already set last_expr_end, the append is skipped and last_expr_size is locked at its small line-1 value.

  4. Second ! on line 2: _PyLexer_update_ftstring_expr(tok, '!') sets last_expr_end = strlen(tok->start) measured in the new line 2 buffer. If line 2 has more content after ! than line 1 had after {, this new last_expr_end > last_expr_size. A new token struct is active (the previous one was emitted), so token->metadata == NULL and set_ftstring_expr runs the full computation -- producing a negative Py_ssize_t.

That negative value is then used in three places:

  1. PyMem_Malloc((last_expr_size - last_expr_end + 1) * sizeof(char)) cast to size_t, -N+1 becomes huge.
  2. PyUnicode_DecodeUTF8(buf, last_expr_size - last_expr_end, NULL) The length argument is Py_ssize_t but unicodeobject.c immediately checks if (size > PY_SSIZE_T_MAX) after casting -- a negative value cast to size_t is huge and trips the overflow guard, raising PyErr_NoMemory.
  3. Loop bounds (for i < ...; while i < ...) -- a negative bound means the loops never execute, so the comment-stripping pass is silently skipped for inputs that reach the hash_detected branch.

Proposed fix

Compute expr_len once at the top of set_ftstring_expr and return -1 immediately if it is negative. Returning -1 signals a tokenizer error, which surfaces to Python callers as TokenError -- the correct outcome for malformed source. All five downstream uses of the subtraction are replaced with expr_len.

The fix (with a regression test) is already implemented on my branch.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)topic-parsertype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions