
Models compared: GPT-5.1 Thinking vs GPT-5 Thinking
Tooling: Built-in
python sandbox (single execution)Severity: High (blocks entire class of workflows that rely on the Python tool)
Status: Reproducible on demand
Summary
Running the same prompt with the same attached files produces opposite outcomes depending on the model. On GPT-5 Thinking the python tool executes and returns the expected audit results. On GPT-5.1 Thinking the python tool fails at a system level before user code begins, so there is no Markdown output, no JSON object, and no file reads. The failure repeats within the same chat and across fresh chats.
Environment
- ChatGPT Web, desktop browser (Windows 10 LTSC 2019, Chrome 142)
- Region: Spain
- Model A: GPT-5.1 Thinking → fails
- Model B: GPT-5 Thinking → succeeds
- Same account, same session style, same files, same prompt
Files attached to the chat
A small Windows automation project, text-only, UTF-8, mix of .ps1, .cmd, .md, .xml, .gitattributes, for example:
CreatePrimaryAdmin.ps1,BootstrapLocalAdmin.ps1,SetupComplete.cmd,PreOOBE.cmdAutounattend.xml,.gitattributesREADME.md,AGENTS.md,DECISIONS.md,SECURITY.md,CONTRIBUTING.md,INTERACTION_CONTRACT.md,BACKGROUND.md,LICENSE
The exact file set is not critical. The failure reproduces as long as several text files are attached.
Minimal prompt used
The prompt asks the model to call python.exec exactly once and run a safe, defensive audit script. The code:
- enumerates files in
/mnt/data - reads each file with robust multi-codec decode
- computes line counts and EOL style
- prints a human Markdown report and then prints a JSON object
- swallows exceptions into
report["errors"]
No network. No writes. Pure read-only. Single tool call.
Note: For triage, an even smaller repro also fails on GPT-5.1 Thinking, for example a one-liner
print("ok")in a singlepython.execcall inside a fresh chat with the same files attached.
Steps to reproduce
- Start a new ChatGPT chat.
- Attach the project files listed above.
- Select GPT-5.1 Thinking.
- Send the prompt that performs one
python.execrun with the safe audit. - Observe the result.
- Repeat in a second fresh chat with GPT-5 Thinking and the same prompt and files.
Expected behavior
- The
pythontool should start the sandbox, run the user code, and print two blocks:
- a concise Markdown audit
- a JSON object with file-level details and any soft errors
Actual behavior
- On GPT-5.1 Thinking, the
pythontool fails before any user code executes.- No Markdown is printed.
- The JSON object is never created.
- No file under
/mnt/datais actually read for that run. - UI shows “message stream error” bubbles.
- On GPT-5 Thinking, the same prompt and files succeed and return the expected Markdown and JSON.
Repro frequency
-
Always in my tests on GPT-5.1 Thinking, including:
- a clean, non-project chat with only the attached files and the prompt
- repeated runs inside the same chat
- repeated runs across brand new chats
Impact
- Blocks audits, data inspection, quick ETL, and any workflow that depends on the Python tool while using GPT-5.1 Thinking.
- Users get silent failure or “message stream error,” with no actionable diagnostics.
Evidence (screenshots to attach)
-
python_exec fail translation.jpg
Shows GPT-5.1 Thinking chat where the assistant explicitly reports that thepythontool hit a system-level error before user code started, so no Markdown or JSON exists. -
5 works arrow.jpg
Shows GPT-5 Thinking with the same files and prompt. The audit runs and returns a structured summary. -
5.1 fail arrow.jpg
Shows GPT-5.1 Thinking with the same files and prompt. Two red “message stream error” bubbles appear. No output is produced.
Notes and hypotheses
- The behavior suggests a sandbox or tool-bridge failure specific to GPT-5.1 Thinking, not user code.
- File sizes are small. No long-running compute. No stdout flooding.
- The prompt enforces a single
python.execcall, so it is not a multi-run limit. - The exact same script and files work on GPT-5 Thinking, which strongly isolates the issue to the 5.1 tool integration.
Temporary workaround
- Switch the chat to GPT-5 Thinking for tasks that need the Python tool.
- Keep the same prompt and attachments. Execution succeeds there.
What would help from the team
- Confirm whether GPT-5.1 Thinking uses a different Python tool gateway than GPT-5 Thinking.
- Server-side trace for the failing chat shows the tool session creation and crash reason.
- If limits differ for 5.1, please document them in the UI or tool docs.
