CVE-2026-0847 — NLTK Path Traversal Vulnerability Leading to Arbitrary File Read
CVE ID: CVE-2026-0847
Product: Natural Language Toolkit (NLTK)
Component: CorpusReader classes
Vulnerability Type: Path Traversal → Arbitrary File Read
CWE: CWE-22 (Improper Limitation of a Pathname to a Restricted Directory)
CVSS v3.1 Score: 8.6 (High)
Attack Vector: Network
Attack Complexity: Low
Privileges Required: None
User Interaction: None
Confidentiality Impact: High
Integrity Impact: None
Availability Impact: None
Exploitability: High if user-controlled file paths are processed
Exploit Availability: No confirmed public exploit tool yet, but exploitation is straightforward and proof-of-concept demonstrations are possible
Affected Versions: NLTK versions up to and including 3.9.2
Published: March 2026
Overview
A high-severity security flaw tracked as CVE-2026-0847 affects the Python Natural Language Toolkit (NLTK) library. The issue occurs due to improper validation of file paths within multiple corpus reader classes used for loading text datasets.
In vulnerable implementations, file paths passed to certain NLTK components are not sufficiently sanitized before being processed by the underlying filesystem operations. Because of this weakness, specially crafted input containing directory traversal sequences such as ../ can escape the intended directory structure and cause arbitrary files on the host system to be read.
This vulnerability primarily impacts applications that expose NLTK functionality through APIs, web interfaces, automated data processing pipelines, or machine learning platforms where file names or corpus paths may be influenced by external input.
If exploited successfully, attackers may obtain sensitive system files including configuration files, credentials, API keys, environment variables, or other private information stored on the server.
Affected Components
The vulnerability is associated with several NLTK corpus reader classes that handle file loading operations.
Affected components include:
- WordListCorpusReader
- TaggedCorpusReader
- BracketParseCorpusReader
These classes are designed to read datasets from specified directories. However, insufficient path validation allows user-controlled paths to bypass directory restrictions.
Vulnerability Details
In the vulnerable implementation, file paths received by corpus reader classes are processed without verifying whether they remain inside the expected corpus directory.
When the application constructs the file path, traversal sequences may not be filtered or normalized correctly.
For example, if an application expects to load a dataset file located in a corpus directory, the path may be built using code similar to the following:
corpus_root + "/" + filename
If the filename parameter contains traversal characters, the resulting path may escape the intended directory.
Example malicious input:
../../../../etc/passwd
Instead of loading a legitimate dataset file, the application may open a sensitive system file. Because the file content is then processed by the NLTK reader, the attacker receives the contents through the application response.
This weakness exists because path canonicalization or validation is not enforced before the file access operation is performed.
Technical Root Cause
The vulnerability is caused by a combination of design and validation issues:
- File paths are accepted without strict validation.
- Directory traversal characters are not filtered.
- Canonical path resolution is not enforced.
- File operations rely directly on user-supplied paths.
Because of these factors, an attacker may manipulate the file path to escape the designated corpus directory.
Attack Scenario
A realistic exploitation scenario may occur as follows:
- A web application offers NLP services such as text analysis or corpus processing.
- The application allows users to select or specify dataset files.
- The backend passes the provided path directly to NLTK corpus reader functions.
- The attacker supplies a malicious path containing directory traversal sequences.
- The application reads arbitrary files from the server filesystem.
Possible sensitive targets include:
/etc/passwd
/etc/shadow
/home/user/.ssh/id_rsa
/application/.env
/config/settings.yaml
Once accessed, these files may expose credentials or sensitive configuration information.
Impact
The primary security impact of this vulnerability is unauthorized disclosure of sensitive information.
Potential consequences include:
- Exposure of system configuration files
- Leakage of application secrets
- Disclosure of API keys and authentication tokens
- Exposure of SSH private keys
- Discovery of database credentials
Although the vulnerability itself enables file read access, attackers may use the obtained information to perform additional attacks such as privilege escalation, lateral movement, or remote system compromise.
Proof of Concept (Educational)
The following example illustrates how a vulnerable implementation could be exploited.
from nltk.corpus.reader import WordListCorpusReadercorpus = WordListCorpusReader("/app/corpora", "../../../../etc/passwd")
data = corpus.words()print(data)
If the application exposes the file name parameter to user input, the above payload may cause the server to read the /etc/passwd file.
This proof-of-concept is provided for educational and defensive testing purposes only.
Example Exploitation Payloads
Attackers may attempt various directory traversal patterns to bypass input filters.
Basic traversal payloads:
../../../../etc/passwd
../../../etc/shadow
../../../../home/app/.env
URL encoded payloads:
..%2F..%2F..%2F..%2Fetc%2Fpasswd
Double encoded traversal:
..%252F..%252F..%252Fetc%252Fpasswd
Windows based traversal:
..\..\..\..\Windows\System32\drivers\etc\hosts
These payloads attempt to move outside the application directory to access sensitive files.
MITRE ATT&CK Mapping
Initial Access
T1190 – Exploit Public Facing Application
Discovery
T1083 – File and Directory Discovery
Credential Access
T1552 – Unsecured Credentials
Collection
T1005 – Data from Local System
Exfiltration
T1041 – Exfiltration Over Command and Control Channel
Indicators of Exploitation
Security teams should monitor for the following indicators:
- Repeated usage of
../or..\patterns in requests - Access attempts to sensitive files such as
/etc/passwd - Requests referencing
.env,.ssh, or configuration files - File access outside expected application directories
Example suspicious request:
GET /api/corpus?file=../../../../etc/passwd
Detection
Detection should focus on identifying suspicious path patterns and abnormal file access behavior.
Monitoring should be implemented at multiple layers including application logs, system logs, and web server logs.
Log Sources
Effective detection may rely on the following log sources:
Application logs
Python runtime logs
API request logs
Web server access logs
File integrity monitoring logs
Linux audit logs (auditd)
Container runtime logs
Kubernetes pod logs
Cloud workload logs
Detection Rules
Splunk
index=web OR index=app
| search "../" OR "..\\" OR "%2e%2e%2f" OR "%2e%2e%5c"
| stats count by src_ip, uri, user_agent
| sort -count
index=web_logs
| search "etc/passwd" OR ".env" OR ".ssh"
| stats count by src_ip, uri
Elastic (KQL)
url.path : "*../*" OR url.path : "*..\\*"
http.request.body.content : "*../*" OR http.request.body.content : "*%2e%2e%2f*"
Microsoft Sentinel (KQL)
CommonSecurityLog
| where RequestURL contains "../"
or RequestURL contains "..\\"
or RequestURL contains "%2e%2e%2f"
| summarize count() by SourceIP, RequestURL
Wazuh
rule.id: directory_traversal
and
(full_log contains "../" or full_log contains "..\\")
Suricata IDS Rule
alert http any any -> any any (msg:"Possible Path Traversal Attempt"; content:"../"; http_uri; sid:900001; rev:1;)
Mitigation
Immediate mitigation should include strict validation of file paths and elimination of direct user input usage in corpus file loading operations.
Recommended defensive practices include:
- Enforcing allow-listed directories for file access
- Rejecting paths containing traversal sequences
- Canonicalizing file paths before file operations
- Running applications with least-privilege permissions
- Isolating NLP processing environments in containers
Secure path validation example:
import osreal_path = os.path.realpath(user_input)
if not real_path.startswith(allowed_directory):
raise Exception("Invalid file path")
Remediation
The recommended remediation is to upgrade the NLTK library to the latest version once the official fix becomes available.
Security teams should track the upstream repository for updates and apply patches as soon as they are released.
Official Upgrade / Patch Information
NLTK Project Security Updates
https://www.nltk.org/news.html
Security Recommendations
Organizations using NLTK in production systems should implement the following controls:
- Restrict user-controlled file paths
- Validate all filesystem operations
- Monitor logs for traversal patterns
- Apply dependency security scanning
- Keep NLP libraries updated
Applications providing NLP services through public APIs should be reviewed carefully, as they are more likely to expose this vulnerability.
