Skip to content

Computer vision initial services#389

Open
javiermtorres wants to merge 19 commits into
mainfrom
378-initial-computer-vision
Open

Computer vision initial services#389
javiermtorres wants to merge 19 commits into
mainfrom
378-initial-computer-vision

Conversation

@javiermtorres
Copy link
Copy Markdown
Contributor

Preliminary implementation for computer vision tasks (object detection, image segmentation and image classification).

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@besaleli besaleli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts!!

}

message ImageInput {
bytes image = 1;
Copy link
Copy Markdown
Member

@besaleli besaleli Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: having 3 different ImageInput feels like a code smell. How would you feel about either/any combination of:

  • just throwing the images directly in ImageClassificationRequest and doing something like repeated bytes inputs = 1;. This would give us parity with repeated string inputs = 1; in text services
  • Putting it in a separate proto or reusing it
  • Condensing all of the protos into one file (would this be so horrible... this may make it easier if anyone is working with Kafka)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now taking image classification as "gold standard" while I work at the impl, and I'll probably retrofit the other two categories from there. So I'll take these comments into consideration when I revisit the proto defs for them 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It just felt easier to transcribe the hf pipelines in separate files at the start)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(separate proto + imports ftw, imho)

}

message ObjectDetectionResponse {
repeated ImageBoundingBoxes boxes_batch = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would prefer this just be named boxes, but not a hill I need to die on

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that for each image, we have boxes, but then for a bunch of input images we have groups of boxes.... so I ran out of imagination :-/ Any suggestions here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to "box" and "boxes", although the grammatical number agreement gets a bit weird :-/

@javiermtorres javiermtorres requested a review from besaleli May 11, 2026 07:28
Comment thread encoderfile-runtime/src/main.rs Outdated

macro_rules! run_cli {
($model_type:ident, $cli:expr, $config:expr, $session:expr, $tokenizer:expr, $model_config:expr) => {{
($model_type:ident, $cli:expr, $config:expr, $session:expr, $input_state:expr, $task_state:expr) => {{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge in latest changes, we changed encoderfile-runtime/main.rs in the gpu update and model type dispatch is now happening elsewhere

Comment thread encoderfile/src/runtime/state.rs Outdated
pub tokenizer: TokenizerService,
pub model_config: ModelConfig,
pub per_model_input_state: T::InputState,
pub per_task_state: T::TaskState,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming here unclear. what does per_model and per_task mean? why can it just not be model_input_state and task_state?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the "per_" would convey the meaning that this is an input state / task state (a config really) that depends on what input / task are used in your model. But I guess we can get rid of the per_ thing if it causes confusion.
At some point the state and config concepts need to be separated, but not today :)

Comment thread encoderfile/src/common/model_type.rs Outdated
@@ -1,5 +1,9 @@
macro_rules! model_type {
[ $( $x:ident ),* $(,)? ] => {
pub trait ModelTypeSpec: Send + Sync + Clone + std::fmt::Debug + 'static {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we moving this inside the macro?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a leftover from some previous restructuring. I'll take it back out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back out.

Comment thread encoderfile/src/dev_utils/mod.rs Outdated
use std::{fs::File, io::BufReader};

const EMBEDDING_DIR: &str = "../models/embedding";
// CHECK sentence embedding????
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedding and sentence embedding can use the same model

&[AssetKind::Transform]
}
}
impl AssetPolicySpec for crate::common::model_type::$model_type {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are moving all of the logic out of the trait, what is the point in having the trait? we might as well just make it a function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite legit.... I guess we could just have a fn required_assets(spec: impl InputType + TaskType). I doubt we could gain something by moving this to type space 😕

Also, the new InputType and TaskType traits (or vals in a fun) should be enough to compose the list of assets, instead of matching the tuples, right?


fn inference(&self, request: impl Into<Self::Input>) -> Result<Self::Output, ApiError> {
let request = request.into();
let rescale_factor = 0.00392156862745098 as f32;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make these preprocessing steps into lua bindings. with a Preprocess function that's extracted from the transform

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I wanted to make it work first, but totally agreed. Let's see if we can get something closer to what one would expect in an hf pipeline.
BTW I'm not considering b/w (1 channel) images, for example, right now.


let tensor = Tensor(data.into_dyn());

let result = func
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slay

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revisit this, I think I just copied whatever it was already there without much consideration 😂
I guess we can do the same for text logits in this case, but it will get more complicated for object detection and image segmentation. But shape checks will fix everything 😉

@@ -0,0 +1,316 @@
# Multipart OpenAPI Service Example
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angpt looping you in here. we should add this into the docs when we release new version

Comment thread encoderfile-runtime/src/main.rs Outdated
Comment thread encoderfile/src/services/image_classification.rs Outdated
@javiermtorres javiermtorres force-pushed the 378-initial-computer-vision branch from 46c597c to 3065642 Compare May 18, 2026 10:11
@javiermtorres javiermtorres requested a review from besaleli May 22, 2026 13:45
@@ -0,0 +1,21 @@
syntax = "proto3";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one day we're gonna have to merge all of these so they're actually usable by someone streaming data into kafka for example 😩

Comment thread encoderfile/src/builder/image_preprocessor.rs Outdated
}

pub const DEFAULT_VERSION: &str = "0.1.0";
pub const DEFAULT_VERSION: &str = "0.2.0";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're gonna change this, we need backwards compatibility logic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum. The runtime won't need it because the payload is bundled with it anyway, but it matters for the builder. However, both are going to be present for every version, so one just needs to get the builder from the same tag as the runtime. That's how we can get away with backwards compatibility, but maybe it's not too user friendly :-/

.unwrap();

// TODO make parallel???
// TODO _maybe
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish these comments were less cryptic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, erm, totally right 😅😅😅😅😅
Ok, so, in the case of text processing, the preprocessing is not even comparable to inference, and we can do both sequentially. However, for image processing, we could think about overlapping image preprocessing and image inference, choosing a proper batch number instead of just putting everything in one batch. I'd still assume inference is going to be larger anyway, so the savings may not be too big (specially if everything is done via CPU, haw); otoh, this helps introducing queues for some streaming mode of operation later on.

I hope this makes it clearer. I don't think we need to start implementing any of this now, though, but I'll make like a less cryptic comment here 🤣🙈

Comment thread encoderfile/src/services/image_classification.rs Outdated
ImageClassificationTransform::new(DEFAULT_LIBS.to_vec(), Some(postprocess_code))
.expect("Failed to create engine");

let num_channels = self.model_input_state.config.num_channels as usize;
Copy link
Copy Markdown
Member

@besaleli besaleli Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly, we are now making users create an entirely separate config to provide values that we're deciding they need for postprocessing. I was hoping that users could define these in the lua transform and have more control over the exact steps they want to use. This is the point of having the lua transform—so that users don't get config fatigue

Seeing now that preprocessor_config.json is an asset for hf image classification models. Might be helpful to include a comment. Have we checked different models to see if the schema gets crazy? They don't like to give us authoritation JSON schemas :')

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry!!! Yes, apparently preprocessor_config.json is a thing in image models. Yes, as expected, it is largely undocumented (I'm just browsing the pipeline code...) and sometimes it overlaps with config.json. Ha. Hahaha.
I have indeed checked a couple of models or three. The entries I use here seem consistently used anywhere else with the same format (and we'll deal with overlapping entries later on), but there are some (like channels) that would have a greater impact in our code if we really want to handle them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me that we need to instruct the user to copy yet another config file, or maybe make simple shell/PowerShell scripts 🤔

-F "files=@/path/to/image2.jpg"
```

### Request Body (multipart/form-data)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha. Wait until it really works 😋 But yes, multipart seemed ok here (and maybe for stdin?)


impl Commands {
pub async fn execute<'a, R: Read + Seek>(
pub async fn execute<'loader, R: Read + Seek>(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this rename

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might change a couple of other lifetime refs so I really understand what they are referring to....

}

#[test]
fn test_image_classification_model() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placeholder, sorry. I wanted to ask for a review before working on integration level tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants