Hi,
I have multiple models that I want to deploy.
I created two resource pools, one for CPU and one for GPU.
Now I deployed two models to the resource pool CPU, and it is working well.
But when I try to create an endpoint and attach it to the resource pool GPU, it fails.
I tried two different models and it is still not working.
The models work if I set dedicated resources with GPU.
Here's the message I got by mail:
Error Messages: Model server exited unexpectedly.
So basically, when I hit create endpoint it keeps loading for some minutes then the error shows.
I found this error in logging explorer:
(1) NOT_FOUND: Error executing an HTTP request: HTTP response code 404 with body '<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: caip-tenant-fc9d9d0b-17f4-4284-9823-401faaf96ac0/5044324052747943936-processed/tfeieOptimizedModel/20230812093241/1/variables/variables.data-00000-of-00001</Details></Error>
when reading gs://caip-tenant-fc9d9d0b-17f4-4284-9823-401faaf96ac0/5044324052747943936-processed/tfeieOptimizedModel/20230812093241/1/variables/variables.data-00000-of-00001
Tried importing model without "Tensorflow optimize runtime" option and I got this error:
P_REQUIRES failed at xla_ops.cc:296 : UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA -- was support for that platform linked in?"
User | Count |
---|---|
16 | |
2 | |
1 | |
1 | |
1 |