Summary
The user is setting up integration with Ray and successfully creating a new RayCluster but faces issues connecting to an existing cluster due to the worker_node_config
argument, which leads to the creation of a new cluster instead. They are contemplating submitting an issue through the Ray Python SDK, and another user offers assistance if an issue is filed. The user finds it illogical that there is no handling for an empty worker_node_config
list to prevent the creation of a new cluster.
geert.pingen
Added it to an issue https://github.com/flyteorg/flyte/issues/5877|here, thanks for the fast response <@U072ZEKG7V0> <@U04H6UUE78B>!
geert.pingen
Morning :wave: sure thing
david.espejo
this is odd, I don't seem to find logic to handle an empty worker_node_config
list and avoiding creating a new cluster.
<@U05E3N35EEL>
Please report this on an issue to track it
[flyte-bug}
sovietaced
Interesting. I’m using Ray myself and will likely need to fix some issues with the plugin so I might be able to take a look at this if you file an issue.
geert.pingen
I guess we can just submit through the Ray Python SDK directly
geert.pingen
It looks like the worker_node_config
has been required since the https://github.com/flyteorg/flytekit/blame/v1.9.1/plugins/flytekit-ray/flytekitplugins/ray/task.py#L33|initial commit? Not sure how the docs example has ever worked.
geert.pingen
Hi :wave: I’m setting up the integration with Ray, and it seems to work nicely when creating a fresh RayCluster (using @task(task_config=RayJobConfig(worker_node_config=[WorkerNodeConfig(…)]))
). I can see the cluster starting, the job getting scheduled and distributed, and completing successfully.
I’m having trouble with using an existing RayCluster (in the same cluster) though. What is the correct approach for that?
From the docs https://docs.flyte.org/en/latest/flytesnacks/examples/ray_plugin/index.html#submit-a-ray-job-to-existing-cluster|here I read that I should be able to use @task(task_config=RayJobConfig(address="<RAY_CLUSTER_ADDRESS>"))
.
However when trying that it seems worker_node_config
is a required argument. I tried using an empty list instead:
container_image=...,
task_config=RayJobConfig(
worker_node_config=[], # No need to create a Ray cluster but argument is required, maybe just setting to empty list helps?
address="kuberay-cluster-head-svc.kuberay.svc.cluster.local:8265",
runtime_env=...
),
)```
But then it tries to start a new RayCluster instead of using the existing one found at `address`:
```❯ k get <http://rayclusters.ray.io|rayclusters.ray.io> -A
NAMESPACE NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
<flyte-project-<flyte-domain> ahvfr924w8k2vgvf97wp-n0-0-raycluster-crb9z 100m 500Mi 0 ready 2m25s
kuberay kuberay-cluster 1 1 2 3G 0 ready 3h37m
...```