Skip to content

Security: Hardcoded Tika Server URL without Authentication#2275

Open
tomaioo wants to merge 1 commit into
NVIDIA:mainfrom
tomaioo:fix/security/hardcoded-tika-server-url-without-authen
Open

Security: Hardcoded Tika Server URL without Authentication#2275
tomaioo wants to merge 1 commit into
NVIDIA:mainfrom
tomaioo:fix/security/hardcoded-tika-server-url-without-authen

Conversation

@tomaioo

@tomaioo tomaioo commented Jun 26, 2026

Copy link
Copy Markdown

Summary

Security: Hardcoded Tika Server URL without Authentication

Problem

Severity: Medium | File: nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py:L48

The tika_extractor function in tika.py sends PDF content to a hardcoded internal URL (http://tika:9998/tika) without any authentication mechanism. This could allow unauthorized access to the Tika server if the network is compromised, and the lack of TLS means data is transmitted in plaintext. Additionally, the requests.put call does not verify SSL certificates (though not applicable here due to HTTP), and there is no timeout validation or retry logic for network failures beyond the basic timeout.

Solution

Make the Tika URL configurable via environment variable or configuration file, add authentication headers if required, and consider using HTTPS with proper certificate verification. Validate the URL scheme to prevent SSRF if user input is ever involved.

Changes

  • nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py (modified)

The `tika_extractor` function in `tika.py` sends PDF content to a hardcoded internal URL (`http://tika:9998/tika`) without any authentication mechanism. This could allow unauthorized access to the Tika server if the network is compromised, and the lack of TLS means data is transmitted in plaintext. Additionally, the `requests.put` call does not verify SSL certificates (though not applicable here due to HTTP), and there is no timeout validation or retry logic for network failures beyond the basic timeout.

Signed-off-by: tomaioo <203048277+tomaioo@users.noreply.github.com>
@tomaioo tomaioo requested review from a team as code owners June 26, 2026 18:24
@tomaioo tomaioo requested a review from charlesbluca June 26, 2026 18:24
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR makes the hardcoded Tika server URL configurable by replacing the bare string literal with os.environ.get("TIKA_URL", "http://tika:9998/tika"), addressing a legitimate configuration-as-code concern for operators who deploy the Tika sidecar at a non-default address.

  • The URL is still resolved once at module import time (a module-level constant), so changes to TIKA_URL after import — common in test suites — are silently ignored; moving the lookup inside the function or switching to a lazy accessor would fix this.
  • No URL scheme validation is applied to the env var value, meaning a misconfigured file:// or other unexpected scheme would reach requests.put without a clear error.
  • No unit tests are added for the new configurable code path, which the project's test-coverage-new-code rule requires.

Confidence Score: 4/5

The change is small and moves in the right direction — externalising a previously hardcoded URL — but leaves a subtle import-time evaluation trap and adds no test coverage.

The one-line change is low-risk in production where env vars are stable before the process starts, but the module-level evaluation means any test that sets TIKA_URL after import will silently use the stale default, making the new code path effectively untestable without a workaround. No scheme validation and no tests compound this.

nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py — the only changed file; the import-time evaluation and missing tests warrant a second look before merge.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py Replaces hardcoded TIKA_URL with os.environ.get("TIKA_URL", ...) at module level; URL is still evaluated once at import time, no scheme validation is added, and no tests cover the new configurable path.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Env as Environment / Config
    participant Module as tika.py (import)
    participant Caller as tika_extractor()
    participant Tika as Tika Server

    Env->>Module: os.environ.get("TIKA_URL", default) evaluated once at import
    Note over Module: TIKA_URL frozen as module-level constant

    Caller->>Tika: "requests.put(TIKA_URL, data=pdf_stream, timeout=120)"
    Tika-->>Caller: 200 OK + extracted text (str)
    Note over Caller: Returns response.text (str), not pd.DataFrame as annotated
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Env as Environment / Config
    participant Module as tika.py (import)
    participant Caller as tika_extractor()
    participant Tika as Tika Server

    Env->>Module: os.environ.get("TIKA_URL", default) evaluated once at import
    Note over Module: TIKA_URL frozen as module-level constant

    Caller->>Tika: "requests.put(TIKA_URL, data=pdf_stream, timeout=120)"
    Tika-->>Caller: 200 OK + extracted text (str)
    Note over Caller: Returns response.text (str), not pd.DataFrame as annotated
Loading

Comments Outside Diff (1)

  1. nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py, line 95 (link)

    P2 No URL scheme validation on env var value
    TIKA_URL now accepts arbitrary values from the environment, but there is no check that the provided value is an http:// or https:// URL before it is used in the requests.put call. An operator misconfiguration supplying a file:// or other unexpected scheme would result in an unintended request or a confusing requests error rather than an actionable validation failure at startup. Per the project's "Input Validation at Boundaries" standard, URLs sourced from external configuration should be validated before use.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py
    Line: 95
    
    Comment:
    **No URL scheme validation on env var value**
    `TIKA_URL` now accepts arbitrary values from the environment, but there is no check that the provided value is an `http://` or `https://` URL before it is used in the `requests.put` call. An operator misconfiguration supplying a `file://` or other unexpected scheme would result in an unintended request or a confusing `requests` error rather than an actionable validation failure at startup. Per the project's "Input Validation at Boundaries" standard, URLs sourced from external configuration should be validated before use.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py:25-28
`TIKA_URL` is resolved once at module import time. If `TIKA_URL` is set in the environment *after* the module is imported (a common pattern in test suites or dynamic container environments), the change will be silently ignored and the stale value will be used for every subsequent call. Moving the lookup inside the function (or using `functools.lru_cache` for efficiency) makes the value reflect the environment at the time of each call.

```suggestion
_DEFAULT_TIKA_URL = "http://tika:9998/tika"

def tika_extractor(
```

### Issue 2 of 3
nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py:95
**No URL scheme validation on env var value**
`TIKA_URL` now accepts arbitrary values from the environment, but there is no check that the provided value is an `http://` or `https://` URL before it is used in the `requests.put` call. An operator misconfiguration supplying a `file://` or other unexpected scheme would result in an unintended request or a confusing `requests` error rather than an actionable validation failure at startup. Per the project's "Input Validation at Boundaries" standard, URLs sourced from external configuration should be validated before use.

### Issue 3 of 3
nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py:25
**No test coverage for configurable URL behaviour**
The `test-coverage-new-code` project rule requires that new or modified business logic includes corresponding unit tests. This change introduces a runtime-configurable code path (env var override of `TIKA_URL`) with no accompanying tests. A test verifying that `os.environ["TIKA_URL"]` is picked up correctly (and that the default is used when unset) would guard against regressions where the lookup is accidentally removed or overridden.

Reviews (1): Last reviewed commit: "fix(security): hardcoded tika server url..." | Re-trigger Greptile

Comment on lines +25 to 28
TIKA_URL = os.environ.get("TIKA_URL", "http://tika:9998/tika")


def tika_extractor(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 TIKA_URL is resolved once at module import time. If TIKA_URL is set in the environment after the module is imported (a common pattern in test suites or dynamic container environments), the change will be silently ignored and the stale value will be used for every subsequent call. Moving the lookup inside the function (or using functools.lru_cache for efficiency) makes the value reflect the environment at the time of each call.

Suggested change
TIKA_URL = os.environ.get("TIKA_URL", "http://tika:9998/tika")
def tika_extractor(
_DEFAULT_TIKA_URL = "http://tika:9998/tika"
def tika_extractor(
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py
Line: 25-28

Comment:
`TIKA_URL` is resolved once at module import time. If `TIKA_URL` is set in the environment *after* the module is imported (a common pattern in test suites or dynamic container environments), the change will be silently ignored and the stale value will be used for every subsequent call. Moving the lookup inside the function (or using `functools.lru_cache` for efficiency) makes the value reflect the environment at the time of each call.

```suggestion
_DEFAULT_TIKA_URL = "http://tika:9998/tika"

def tika_extractor(
```

How can I resolve this? If you propose a fix, please make it concise.

import requests

TIKA_URL = "http://tika:9998/tika"
TIKA_URL = os.environ.get("TIKA_URL", "http://tika:9998/tika")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No test coverage for configurable URL behaviour
The test-coverage-new-code project rule requires that new or modified business logic includes corresponding unit tests. This change introduces a runtime-configurable code path (env var override of TIKA_URL) with no accompanying tests. A test verifying that os.environ["TIKA_URL"] is picked up correctly (and that the default is used when unset) would guard against regressions where the lookup is accidentally removed or overridden.

Rule Used: New functionality must include corresponding unit ... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/common/api/internal/extract/pdf/engines/tika.py
Line: 25

Comment:
**No test coverage for configurable URL behaviour**
The `test-coverage-new-code` project rule requires that new or modified business logic includes corresponding unit tests. This change introduces a runtime-configurable code path (env var override of `TIKA_URL`) with no accompanying tests. A test verifying that `os.environ["TIKA_URL"]` is picked up correctly (and that the default is used when unset) would guard against regressions where the lookup is accidentally removed or overridden.

**Rule Used:** New functionality must include corresponding unit ... ([source](.greptile))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant