How an AI Chatbot Turned Old Web Bugs into a Security Wake-Up Call

AI chatbots are quickly becoming the default interface for customer support. They’re easy to deploy, impressive in demos, and promise real operational savings. But the 2025 security findings involving the customer-facing chatbot used by Eurostar show how quickly things can go wrong when AI is layered on top of familiar web architectures without rethinking trust boundaries.

The issues, reported by Pen Test Partners, weren’t exotic model failures or cases of “AI going rogue.” They were ordinary web and API security mistakes—simply amplified by a large language model sitting in the middle.

This write-up explains what the chatbot did, what broke, how the exploit paths worked (in a sanitized form), and what engineers should take away from it.

What the chatbot was — and wasn’t

Eurostar’s chatbot was designed to answer general customer questions using a large language model (LLM). At the time of testing:

It did not have access to customer accounts
It was not connected to booking or payment systems
It could not retrieve personal or transactional data

This limited scope is important. It prevented a data breach. But it also encouraged a dangerous assumption: that because the bot was “read-only,” its security posture didn’t matter as much.

The architectural pattern behind the failure

At a high level, the chatbot followed a common LLM integration pattern:

A browser-based chat UI collects user messages
Messages are sent to a backend API along with conversation state
The backend builds a prompt consisting of:
- a system prompt
- conversation history
- the latest user message
The full prompt is sent to the LLM
The model’s response is rendered directly in the UI

The failure occurred at the trust boundary between the browser and the backend.

The core mistake: trusting conversation history

The backend validated only the most recent user message before sending the entire conversation history to the LLM. Earlier messages were assumed to be safe because they had already passed through the UI.

That assumption was wrong.

Conversation state—including message content and identifiers—was partially controlled by the client and insufficiently revalidated server-side. This created a classic time-of-check versus time-of-use flaw that has existed in web applications for decades.

Once that boundary was broken, the rest of the exploit chain followed naturally.

Exploit paths (sanitized)

1. Guardrail bypass via conversation manipulation

Because only the last message was validated, earlier conversation turns could be altered after initial checks. Injected instructions placed into historical messages were included in the prompt context sent to the LLM.

From the model’s point of view, these instructions appeared to be legitimate prior conversation. The final user message could remain harmless, allowing the request to pass policy checks.

Result: content and behavior guardrails were bypassed without triggering filters.
Engineering lesson: guardrails must apply to the entire prompt, not individual turns. Conversation history must be reconstructed and validated server-side.

2. Prompt injection and system prompt disclosure

Once injected instructions were processed, the chatbot could be steered into discussing its own internal rules and behavior. The system prompt—included verbatim in the context window—could be paraphrased or partially revealed.

This was not a model bug. LLMs do not understand secrecy. If instructions are in the context, they are potentially retrievable.

Result: internal configuration details were exposed, reducing the effort required for further manipulation.
Engineering lesson: treat system prompts as sensitive configuration, not secrets. Do not rely on the model to keep them hidden.

3. HTML injection and self-XSS

The frontend rendered chatbot responses as formatted HTML without strict sanitization. The model could be induced to return browser-interpreted markup, which the UI rendered directly.

This enabled self-XSS—code execution in the user’s own browser session—typically requiring social engineering, but still a real security risk.

Result: phishing vectors and erosion of user trust in a customer-facing interface.
Engineering lesson: AI output is untrusted input. Encode by default, sanitize aggressively, and avoid rendering raw HTML unless absolutely necessary.

4. Weak validation of conversation and message identifiers

Conversation and message identifiers were accepted with minimal validation. Arbitrary or simple values were not consistently rejected.

On their own, these flaws were low impact. Combined with the issues above, they made conversation manipulation easier and more reliable.

Engineering lesson: identifiers must be server-generated, non-guessable, and bound to authenticated context—AI or not.

Why no customer data was exposed

Despite the seriousness of the issues:

No customer booking data was accessed
No authenticated sessions were compromised
No internal systems were reached

The chatbot’s lack of privileged access prevented escalation. This was a matter of scope containment and good fortune—not strong defensive depth. Had the bot been connected to backend systems, the same flaws could have had far more serious consequences.

Disclosure and remediation

Pen Test Partners reported the issues through Eurostar’s vulnerability disclosure process in mid-2025. Communication difficulties followed, reportedly linked to transitions in vulnerability management. After public attention, the vulnerabilities were patched and the findings published.

While the disclosure process itself became contentious, the technical findings were not seriously disputed.

Why this case matters beyond Eurostar

This incident is not a warning about AI being inherently unsafe.

It’s a reminder that AI systems magnify existing security mistakes.

Every issue demonstrated here—trusting client state, partial validation, unsafe rendering—predates LLMs by many years. The difference is that LLMs eagerly consume context and follow instructions, making these mistakes easier to exploit and easier to chain together.

A simple rule for teams shipping chatbots

If your chatbot:

Trusts client-supplied conversation history
Validates only the most recent message
Renders AI output as HTML
Embeds sensitive rules directly in prompts

…you should assume you’re exposed to the same class of vulnerabilities.

Final takeaway

The Eurostar chatbot incident wasn’t about AI hallucinations or runaway models. It was about what happens when old security assumptions meet a system that processes language flexibly and obediently.

AI doesn’t replace security fundamentals.
It punishes you harder for ignoring them.

The technology is new.
The mistakes are not.