CVE-2026-0847 — NLTK Path Traversal Vulnerability Leading to Arbitrary File Read

CVE ID: CVE-2026-0847
Product: Natural Language Toolkit (NLTK)
Component: CorpusReader classes
Vulnerability Type: Path Traversal → Arbitrary File Read
CWE: CWE-22 (Improper Limitation of a Pathname to a Restricted Directory)
CVSS v3.1 Score: 8.6 (High)
Attack Vector: Network
Attack Complexity: Low
Privileges Required: None
User Interaction: None
Confidentiality Impact: High
Integrity Impact: None
Availability Impact: None
Exploitability: High if user-controlled file paths are processed
Exploit Availability: No confirmed public exploit tool yet, but exploitation is straightforward and proof-of-concept demonstrations are possible
Affected Versions: NLTK versions up to and including 3.9.2
Published: March 2026

Overview

A high-severity security flaw tracked as CVE-2026-0847 affects the Python Natural Language Toolkit (NLTK) library. The issue occurs due to improper validation of file paths within multiple corpus reader classes used for loading text datasets.

In vulnerable implementations, file paths passed to certain NLTK components are not sufficiently sanitized before being processed by the underlying filesystem operations. Because of this weakness, specially crafted input containing directory traversal sequences such as ../ can escape the intended directory structure and cause arbitrary files on the host system to be read.

This vulnerability primarily impacts applications that expose NLTK functionality through APIs, web interfaces, automated data processing pipelines, or machine learning platforms where file names or corpus paths may be influenced by external input.

If exploited successfully, attackers may obtain sensitive system files including configuration files, credentials, API keys, environment variables, or other private information stored on the server.

Affected Components

The vulnerability is associated with several NLTK corpus reader classes that handle file loading operations.

Affected components include:

WordListCorpusReader
TaggedCorpusReader
BracketParseCorpusReader

These classes are designed to read datasets from specified directories. However, insufficient path validation allows user-controlled paths to bypass directory restrictions.

Vulnerability Details

In the vulnerable implementation, file paths received by corpus reader classes are processed without verifying whether they remain inside the expected corpus directory.

When the application constructs the file path, traversal sequences may not be filtered or normalized correctly.

For example, if an application expects to load a dataset file located in a corpus directory, the path may be built using code similar to the following:

corpus_root + "/" + filename

If the filename parameter contains traversal characters, the resulting path may escape the intended directory.

Example malicious input:

../../../../etc/passwd

Instead of loading a legitimate dataset file, the application may open a sensitive system file. Because the file content is then processed by the NLTK reader, the attacker receives the contents through the application response.

This weakness exists because path canonicalization or validation is not enforced before the file access operation is performed.

Technical Root Cause

The vulnerability is caused by a combination of design and validation issues:

File paths are accepted without strict validation.
Directory traversal characters are not filtered.
Canonical path resolution is not enforced.
File operations rely directly on user-supplied paths.

Because of these factors, an attacker may manipulate the file path to escape the designated corpus directory.

Attack Scenario

A realistic exploitation scenario may occur as follows:

A web application offers NLP services such as text analysis or corpus processing.
The application allows users to select or specify dataset files.
The backend passes the provided path directly to NLTK corpus reader functions.
The attacker supplies a malicious path containing directory traversal sequences.
The application reads arbitrary files from the server filesystem.

Possible sensitive targets include:

/etc/passwd
/etc/shadow
/home/user/.ssh/id_rsa
/application/.env
/config/settings.yaml

Once accessed, these files may expose credentials or sensitive configuration information.

Impact

The primary security impact of this vulnerability is unauthorized disclosure of sensitive information.

Potential consequences include:

Exposure of system configuration files
Leakage of application secrets
Disclosure of API keys and authentication tokens
Exposure of SSH private keys
Discovery of database credentials

Although the vulnerability itself enables file read access, attackers may use the obtained information to perform additional attacks such as privilege escalation, lateral movement, or remote system compromise.

Proof of Concept (Educational)

The following example illustrates how a vulnerable implementation could be exploited.

from nltk.corpus.reader import WordListCorpusReadercorpus = WordListCorpusReader("/app/corpora", "../../../../etc/passwd")
data = corpus.words()print(data)

If the application exposes the file name parameter to user input, the above payload may cause the server to read the /etc/passwd file.

This proof-of-concept is provided for educational and defensive testing purposes only.

Example Exploitation Payloads

Attackers may attempt various directory traversal patterns to bypass input filters.

Basic traversal payloads:

../../../../etc/passwd
../../../etc/shadow
../../../../home/app/.env

URL encoded payloads:

..%2F..%2F..%2F..%2Fetc%2Fpasswd

Double encoded traversal:

..%252F..%252F..%252Fetc%252Fpasswd

Windows based traversal:

..\..\..\..\Windows\System32\drivers\etc\hosts

These payloads attempt to move outside the application directory to access sensitive files.

MITRE ATT&CK Mapping

Initial Access

T1190 – Exploit Public Facing Application

Discovery

T1083 – File and Directory Discovery

Credential Access

T1552 – Unsecured Credentials

Collection

T1005 – Data from Local System

Exfiltration

T1041 – Exfiltration Over Command and Control Channel

Indicators of Exploitation

Security teams should monitor for the following indicators:

Repeated usage of ../ or ..\ patterns in requests
Access attempts to sensitive files such as /etc/passwd
Requests referencing .env, .ssh, or configuration files
File access outside expected application directories

Example suspicious request:

GET /api/corpus?file=../../../../etc/passwd

Detection

Detection should focus on identifying suspicious path patterns and abnormal file access behavior.

Monitoring should be implemented at multiple layers including application logs, system logs, and web server logs.

Log Sources

Effective detection may rely on the following log sources:

Application logs
Python runtime logs
API request logs
Web server access logs
File integrity monitoring logs
Linux audit logs (auditd)
Container runtime logs
Kubernetes pod logs
Cloud workload logs

Detection Rules

Splunk

index=web OR index=app
| search "../" OR "..\\" OR "%2e%2e%2f" OR "%2e%2e%5c"
| stats count by src_ip, uri, user_agent
| sort -count

index=web_logs
| search "etc/passwd" OR ".env" OR ".ssh"
| stats count by src_ip, uri

Elastic (KQL)

url.path : "*../*" OR url.path : "*..\\*"

http.request.body.content : "*../*" OR http.request.body.content : "*%2e%2e%2f*"

Microsoft Sentinel (KQL)

CommonSecurityLog
| where RequestURL contains "../"
   or RequestURL contains "..\\"
   or RequestURL contains "%2e%2e%2f"
| summarize count() by SourceIP, RequestURL

Wazuh

rule.id: directory_traversal
and
(full_log contains "../" or full_log contains "..\\")

Suricata IDS Rule

alert http any any -> any any (msg:"Possible Path Traversal Attempt"; content:"../"; http_uri; sid:900001; rev:1;)

Mitigation

Immediate mitigation should include strict validation of file paths and elimination of direct user input usage in corpus file loading operations.

Recommended defensive practices include:

Enforcing allow-listed directories for file access
Rejecting paths containing traversal sequences
Canonicalizing file paths before file operations
Running applications with least-privilege permissions
Isolating NLP processing environments in containers

Secure path validation example:

import osreal_path = os.path.realpath(user_input)
if not real_path.startswith(allowed_directory):
    raise Exception("Invalid file path")

Remediation

The recommended remediation is to upgrade the NLTK library to the latest version once the official fix becomes available.

Security teams should track the upstream repository for updates and apply patches as soon as they are released.

Official Upgrade / Patch Information

NLTK Project Security Updates
https://www.nltk.org/news.html

Security Recommendations

Organizations using NLTK in production systems should implement the following controls:

Restrict user-controlled file paths
Validate all filesystem operations
Monitor logs for traversal patterns
Apply dependency security scanning
Keep NLP libraries updated

Applications providing NLP services through public APIs should be reviewed carefully, as they are more likely to expose this vulnerability.

CVE-2026-0847: Critical Path Traversal Flaw in NLTK Allows Attackers to Read Arbitrary Server Files