Security: Hardcoded Tika Server URL without Authentication#2275
Conversation
The `tika_extractor` function in `tika.py` sends PDF content to a hardcoded internal URL (`http://tika:9998/tika`) without any authentication mechanism. This could allow unauthorized access to the Tika server if the network is compromised, and the lack of TLS means data is transmitted in plaintext. Additionally, the `requests.put` call does not verify SSL certificates (though not applicable here due to HTTP), and there is no timeout validation or retry logic for network failures beyond the basic timeout. Signed-off-by: tomaioo <203048277+tomaioo@users.noreply.github.com>
Greptile SummaryThis PR makes the hardcoded Tika server URL configurable by replacing the bare string literal with
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py | Replaces hardcoded TIKA_URL with os.environ.get("TIKA_URL", ...) at module level; URL is still evaluated once at import time, no scheme validation is added, and no tests cover the new configurable path. |
Sequence Diagram
%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Env as Environment / Config
participant Module as tika.py (import)
participant Caller as tika_extractor()
participant Tika as Tika Server
Env->>Module: os.environ.get("TIKA_URL", default) evaluated once at import
Note over Module: TIKA_URL frozen as module-level constant
Caller->>Tika: "requests.put(TIKA_URL, data=pdf_stream, timeout=120)"
Tika-->>Caller: 200 OK + extracted text (str)
Note over Caller: Returns response.text (str), not pd.DataFrame as annotated
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Env as Environment / Config
participant Module as tika.py (import)
participant Caller as tika_extractor()
participant Tika as Tika Server
Env->>Module: os.environ.get("TIKA_URL", default) evaluated once at import
Note over Module: TIKA_URL frozen as module-level constant
Caller->>Tika: "requests.put(TIKA_URL, data=pdf_stream, timeout=120)"
Tika-->>Caller: 200 OK + extracted text (str)
Note over Caller: Returns response.text (str), not pd.DataFrame as annotated
Comments Outside Diff (1)
-
nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py, line 95 (link)No URL scheme validation on env var value
TIKA_URLnow accepts arbitrary values from the environment, but there is no check that the provided value is anhttp://orhttps://URL before it is used in therequests.putcall. An operator misconfiguration supplying afile://or other unexpected scheme would result in an unintended request or a confusingrequestserror rather than an actionable validation failure at startup. Per the project's "Input Validation at Boundaries" standard, URLs sourced from external configuration should be validated before use.Prompt To Fix With AI
This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py Line: 95 Comment: **No URL scheme validation on env var value** `TIKA_URL` now accepts arbitrary values from the environment, but there is no check that the provided value is an `http://` or `https://` URL before it is used in the `requests.put` call. An operator misconfiguration supplying a `file://` or other unexpected scheme would result in an unintended request or a confusing `requests` error rather than an actionable validation failure at startup. Per the project's "Input Validation at Boundaries" standard, URLs sourced from external configuration should be validated before use. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py:25-28
`TIKA_URL` is resolved once at module import time. If `TIKA_URL` is set in the environment *after* the module is imported (a common pattern in test suites or dynamic container environments), the change will be silently ignored and the stale value will be used for every subsequent call. Moving the lookup inside the function (or using `functools.lru_cache` for efficiency) makes the value reflect the environment at the time of each call.
```suggestion
_DEFAULT_TIKA_URL = "http://tika:9998/tika"
def tika_extractor(
```
### Issue 2 of 3
nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py:95
**No URL scheme validation on env var value**
`TIKA_URL` now accepts arbitrary values from the environment, but there is no check that the provided value is an `http://` or `https://` URL before it is used in the `requests.put` call. An operator misconfiguration supplying a `file://` or other unexpected scheme would result in an unintended request or a confusing `requests` error rather than an actionable validation failure at startup. Per the project's "Input Validation at Boundaries" standard, URLs sourced from external configuration should be validated before use.
### Issue 3 of 3
nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py:25
**No test coverage for configurable URL behaviour**
The `test-coverage-new-code` project rule requires that new or modified business logic includes corresponding unit tests. This change introduces a runtime-configurable code path (env var override of `TIKA_URL`) with no accompanying tests. A test verifying that `os.environ["TIKA_URL"]` is picked up correctly (and that the default is used when unset) would guard against regressions where the lookup is accidentally removed or overridden.
Reviews (1): Last reviewed commit: "fix(security): hardcoded tika server url..." | Re-trigger Greptile
| TIKA_URL = os.environ.get("TIKA_URL", "http://tika:9998/tika") | ||
|
|
||
|
|
||
| def tika_extractor( |
There was a problem hiding this comment.
TIKA_URL is resolved once at module import time. If TIKA_URL is set in the environment after the module is imported (a common pattern in test suites or dynamic container environments), the change will be silently ignored and the stale value will be used for every subsequent call. Moving the lookup inside the function (or using functools.lru_cache for efficiency) makes the value reflect the environment at the time of each call.
| TIKA_URL = os.environ.get("TIKA_URL", "http://tika:9998/tika") | |
| def tika_extractor( | |
| _DEFAULT_TIKA_URL = "http://tika:9998/tika" | |
| def tika_extractor( |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py
Line: 25-28
Comment:
`TIKA_URL` is resolved once at module import time. If `TIKA_URL` is set in the environment *after* the module is imported (a common pattern in test suites or dynamic container environments), the change will be silently ignored and the stale value will be used for every subsequent call. Moving the lookup inside the function (or using `functools.lru_cache` for efficiency) makes the value reflect the environment at the time of each call.
```suggestion
_DEFAULT_TIKA_URL = "http://tika:9998/tika"
def tika_extractor(
```
How can I resolve this? If you propose a fix, please make it concise.| import requests | ||
|
|
||
| TIKA_URL = "http://tika:9998/tika" | ||
| TIKA_URL = os.environ.get("TIKA_URL", "http://tika:9998/tika") |
There was a problem hiding this comment.
No test coverage for configurable URL behaviour
The test-coverage-new-code project rule requires that new or modified business logic includes corresponding unit tests. This change introduces a runtime-configurable code path (env var override of TIKA_URL) with no accompanying tests. A test verifying that os.environ["TIKA_URL"] is picked up correctly (and that the default is used when unset) would guard against regressions where the lookup is accidentally removed or overridden.
Rule Used: New functionality must include corresponding unit ... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py
Line: 25
Comment:
**No test coverage for configurable URL behaviour**
The `test-coverage-new-code` project rule requires that new or modified business logic includes corresponding unit tests. This change introduces a runtime-configurable code path (env var override of `TIKA_URL`) with no accompanying tests. A test verifying that `os.environ["TIKA_URL"]` is picked up correctly (and that the default is used when unset) would guard against regressions where the lookup is accidentally removed or overridden.
**Rule Used:** New functionality must include corresponding unit ... ([source](.greptile))
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Summary
Security: Hardcoded Tika Server URL without Authentication
Problem
Severity:
Medium| File:nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py:L48The
tika_extractorfunction intika.pysends PDF content to a hardcoded internal URL (http://tika:9998/tika) without any authentication mechanism. This could allow unauthorized access to the Tika server if the network is compromised, and the lack of TLS means data is transmitted in plaintext. Additionally, therequests.putcall does not verify SSL certificates (though not applicable here due to HTTP), and there is no timeout validation or retry logic for network failures beyond the basic timeout.Solution
Make the Tika URL configurable via environment variable or configuration file, add authentication headers if required, and consider using HTTPS with proper certificate verification. Validate the URL scheme to prevent SSRF if user input is ever involved.
Changes
nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py(modified)