Summary
The user reports a significant increase in memory usage of the flyte-binary pod from 500 MB to 2 GB under heavy load, resulting in an OOM Kill. They are unsure if this is due to internal caching, dataCatalog, or if all workflows are stored in memory. The user requests a feature to delete completed workflows from memory and suggests increasing resources to 4 CPUs and over 4 GB of RAM, questioning if 4 GB is sufficient for about 10,000 daily pipeline triggers. They plan to investigate cache control's impact on memory scaling and emphasize the importance of monitoring, mentioning specific configuration parameters and performance optimization resources. The user suspects caching issues or a memory leak, noting that memory usage can rise to 8 GB before an OOM kill occurs, despite no workloads failing. They observe system crashes every 2 days with memory around 8 GB, although it does not frequently crash under production workload. They consider moving to flyte-core
for additional scaling mechanisms and mention a temporary solution while planning to build a POC of Flyte-core
. Additionally, they note that Flyte 1.13.2 contains a fix for the memory issue in flyte-binary
and suggest planning an update on Monday.
divyank.agarwal
ohh my bad. I thought the fix got release in 1.13.1
dubovikov.kirill
<@U06MQ3WEUBS> could we plan an update on Monday?
curupa
<@U06MQ3WEUBS>, Flyte 1.13.2 is https://github.com/flyteorg/flyte/releases/tag/v1.13.2|out and that contains a fix for this memory issue you're seeing in flyte-binary
.
divyank.agarwal
ok thanks.. We have a temporary solution. and We will start build a POC of Flyte-core
david.espejo
with the amount of executions and what you expect (around 10k triggers a day) I'd say you should consider moving to flyte-core
as there are additional mechanisms available to scale out
divyank.agarwal
Let me know if any other metric is needed.
divyank.agarwal
I can get that.
david.espejo
<@U06MQ3WEUBS> do you have metrics on resource usage from the Pod? how many executions?
divyank.agarwal
<@U0265RTUJ5B> added an observation above.. might be useful.
divyank.agarwal
Memory here is around 8 GB.
divyank.agarwal
an update on this problem. Under our production workload. It is not crashing every few hours. However it is still crashing every 2 days.