F

Flyte enables you to build & deploy data & ML pipelines, hassle-free. The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks. Explore and Join the Flyte Community!

Cache Utilization in Flyte Pipelines

Summary

The user is seeking advice on utilizing cache in a pipeline that runs locally and then in the cloud or on another machine. They note that Flyte uses diskcache with SQLite, which is not ideal for simultaneous access across multiple machines. The user suggests potential solutions, including placing the SQLite database on an NFS share, implementing PostgreSQL support for diskcache, or replacing diskcache with a caching solution that supports PostgreSQL or another database. They are looking for better approaches to address this issue.

Status
resolved
Tags
    Source
    #ask-the-community
      a

      aleksei.grachev.tech

      10/21/2024

      Thank you, Haytham

      h

      habuelfutuh

      10/18/2024

      If you reallyyy want, the cache service has an API interface. you can replace diskcache with a version that uses gRPC to record results into the remote cache (files will need to be pushed remotely though)... and then running in the cloud will automatically leverage the cache.

      a

      aleksei.grachev.tech

      10/18/2024

      Hello, everyone. I'm curious if anyone has encountered a similar use case:

      1. I run a pipeline locally, which fills the cache.
      2. I then run the same pipeline in the cloud or on another machine, expecting it to utilize the cache from step 1. From what I understand, Flyte uses diskcache, which relies on SQLite. However, SQLite isn't ideal for simultaneous access from multiple machines and requires additional workarounds. Here are some thoughts I've come up with:
      3. Place the SQLite database on an NFS share (not recommended by SQLite developers, but it might work).
      4. Implement PostgreSQL support for diskcache.
      5. Replace diskcache with a caching solution that supports PostgreSQL or another database. None of these solutions are perfect, so I would appreciate any suggestions for a better approach. Thank you!