F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

FlyteFile Creation Issue with Azure Blob

Summary

The user is looking for a solution to create a FlyteFile using an Azure Storage account blob SAS URL, which Flyte misinterprets as a directory. They request workarounds, clarification on feature availability, and examples, mentioning a specific pull request. The user suspects the issue stems from two bugs in Flytekit (fsspec) rather than a lack of support. They suggest involving Yee, who has a local solution but is cautious about its broader implications. The user believes that using workload identity and setting remoteData to signedUrls:false might help resolve the issue. They note they are not in production and have a limited use case but have successfully implemented workload identity with Azure blob storage for common Flyte workflows.

Status
resolved
Tags
    Source
    #flyte-on-azure
      s

      srale

      9/24/2024

      Thanks for the info :) I will take a look into it and get back to you if this doesn't work for our use case

      c

      chris.grass

      9/24/2024

      <@U06V6CQTKL6> the long SAS has been an open issue for a long time because of some library complexities. have you tried https://azure.github.io/azure-workload-identity/docs/introduction.html|workload identity? feel free to reach out if you have any questions about implementation

      s

      srale

      9/17/2024

      Hi all <#C05315T4K5K|> :slightly_smiling_face: We want to use Azure Storage account blob SAS url to create a FlyteFile. The problem with this, is that the FlyteFile maps the whole file path + the sas in the url as the file name. This means that Flyte sees the url as a directory and not a file. Is there a workaround for this, or is this feature missing? Thank you in advance

      k

      kumare

      9/18/2024

      <@UNR3C6Y4T>

      c

      chris.grass

      9/18/2024

      we aren't running in prod and have a limited use case, but we have workload identity + azure blob store working for common flyte workflows

      c

      chris.grass

      9/18/2024

      using workload identity and "remoteData is configured to set signedUrls:false " should be enough to bypass the issue

      d

      david.espejo

      9/18/2024

      right <@U05QG8SE2LA> Is this also a limitation even if <@U06V6CQTKL6> used Workload Identity instead of SAS tokens for storage account access?

      c

      chris.grass

      9/18/2024

      We might want to pull Yee into the conversation since he was looking at the python fixes. iirc, he had a local solution but was concerned about its implications for other use cases

      c

      chris.grass

      9/18/2024

      as mentioned in the flyte golang pr, i don't think the lack of support for that endpoint is the fundamental problem here. i think the two flytekit (fsspec) bugs are actually to blame for the behavior <@U06V6CQTKL6> is seeing. https://github.com/flyteorg/flyte/issues/4701 https://github.com/flyteorg/flyte/issues/4700

      c

      chris.grass

      9/18/2024

      sorry, i have been out of the flyte ecosystem for a while now. expected to get back in this week or next, so this is good timing. give me a little while to catch up though

      d

      david.espejo

      9/18/2024

      Not sure, for now I'm deferring to <@U05QG8SE2LA> to validate what's missing to merge

      s

      srale

      9/18/2024

      Yes, it would :slightly_smiling_face: When do you think this could be available?

      d

      david.espejo

      9/17/2024

      <@U06V6CQTKL6> would <https://github.com/flyteorg/flyte/pull/4629|this PR> cover what you intend to do?

      s

      srale

      9/17/2024

      It seems that it interprets the query from '/tmp/flytecdmj730u/local_flytekit/ec7c7207da04cd009680ed0636b3277e/a.txt?sp=r&amp;st=2024-09-17T14:10:24Z&amp;se=2024-11-01T23:10:24Z&amp;spr=https&amp;sv=2022-11-02&amp;sr=b&amp;sig=A2WINhWtCSfJNJ8sdqodQJrxKoNjz%2FGfmHUln1VlDf4%3D' , so just ?sp=r&amp;st=2024-09-17T14:10:24Z&amp;se=2024-11-01T23:10:24Z&amp;spr=https&amp;sv=2022-11-02&amp;sr=b&amp;sig=A2WINhWtCSfJNJ8sdqodQJrxKoNjz%2FGfmHUln1VlDf4%3D as part of the file name (perhaps because it doesn't end with the file extension) and flyte assumes it's a directory

      k

      kumare

      9/17/2024

      Now I get it - yes if you are using http (signed url) it will use http protocol and that cannot download directories only files. But weirdly the path looks like a dir. this seems we should default to assuming a file and proceeding

      s

      srale

      9/17/2024

      Running a simple workflow like this: from flytekit import task, workflow, Resources from flytekit.types.file import FlyteFile import os

      @task(_requests_=Resources(_cpu_="1", _mem_="1Gi"), _limits_=Resources(_cpu_="2", _mem_="1Gi")) def normal_task(_sas_: str) -&gt; str: new_sas = FlyteFile.from_source(_sas_) with open(new_sas, "r") as f: text = f.read()

      return text

      @workflow def wf(_sas_: str) -&gt; str: normal_output = normal_task(_sas_=_sas_)

      return normal_output
      
      k

      kumare

      9/17/2024

      We would need examples and pointers - I do not follow

      k

      kumare

      9/17/2024

      Sorry what is sas?