Skip to content

[FEATURE] Dynamic Engine Pool Scaling #7050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 4 tasks
wangzhigang1999 opened this issue Apr 28, 2025 · 1 comment
Open
3 of 4 tasks

[FEATURE] Dynamic Engine Pool Scaling #7050

wangzhigang1999 opened this issue Apr 28, 2025 · 1 comment
Labels

Comments

@wangzhigang1999
Copy link
Contributor

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the feature

This feature aims to introduce dynamic resizing (scaling up and down) capabilities to Kyuubi's Engine Pool mechanism. By introducing a pluggable PoolScalingStrategy, a coordinator manager EnginePoolManager, an accessor interface EnginePoolAccessor to interact with specific pool implementations, and corresponding monitoring metrics PoolMetrics, the size of the Engine Pool can automatically adjust based on predefined policies—such as time-based rules and, potentially in the future, load-based strategies.

Motivation

Current Situation:

Kyuubi’s Engine Pool pre-starts and manages a set of engine instances for specific users or groups to reduce session creation latency. Currently, the size of each pool is statically configured via kyuubi.engine.pool.size, requiring administrators to manually set and adjust this size based on expected loads.

Problems:

  1. Resource Waste: A statically large pool during low-load periods leaves many engines idle, consuming CPU, memory, and possibly licensing resources unnecessarily.
  2. Performance Bottlenecks and User Experience Degradation: During peak traffic, a statically small pool may be insufficient to handle concurrency, causing session requests to queue or even time out, hurting throughput and user experience.
  3. Operational Complexity: Manual monitoring and resizing based on experience is inefficient, reactive, error-prone, and ill-suited for complex or rapidly changing load patterns.
  4. Inadequate Cloud-Native Adaptability: Static sizing cannot leverage cloud elasticity effectively, limiting dynamic resource allocation per actual demand—contrary to cloud-native principles.

Real-World Scenario:
We have a big data cluster submitting jobs via Kyuubi with day-night resource elasticity (e.g., 8–10 TB memory resident during the day and roughly 1.5× expanded at night), engine pools are used for certain heavy users:

  • Option 1 (2 engines, 48GB RAM each): Under nighttime peak, the few engines become bottlenecks, prone to GC pauses; often an engine hangs on GC causing jobs to be delayed by 1-2 hours and severely increasing night-time operational burden.
  • Option 2 (4 engines, 32GB RAM each): Can barely handle nighttime but engines remain occupied during daytime due to ongoing sessions and Spark Dynamic Allocation executor caching, wasting resources and affecting other non-Kyuubi jobs.

This illustrates the limitations of static pools facing dynamic resources and varying loads, underscoring the urgent need for automated elastic scaling.

Goals and Benefits:

  • Improve Resource Efficiency: Automatically shrink pool size during low load to free idle resources and cut costs.
  • Enhance System Resilience: Expand pool size proactively in high load to promptly respond to user demand, ensuring service performance and availability.
  • Increase Adaptability: Enable Kyuubi Engine Pools to automatically adapt to periodic or bursty workload fluctuations.
  • Simplify Operations: Reduce manual intervention and management complexity with automated scaling.
  • Better Cloud-Native Support: Leverage cloud platform elasticity for on-demand resource allocation.

Describe the solution

We introduce a set of new components and configurations to enable dynamic resizing of Engine Pools. The core architecture revolves around instantiating an EnginePoolManager for each pool (or sub-pool) requiring dynamic scaling. This manager periodically runs (as per scaling.interval), computes the target size through a configurable PoolScalingStrategy, and interacts with the concrete pool implementation via an EnginePoolAccessor to carry out scale-up or graceful scale-down operations. A cooldown period is enforced to stabilize scaling, and detailed metrics are exposed through PoolMetrics.

Core Components

  1. EnginePoolManager

    • Responsibilities: Manages the dynamic scaling lifecycle for a single Engine Pool identified by poolIdentifier (e.g., user/group key). Runs periodic scaling checks using a scheduled executor. It retrieves the current pool size and optional metrics via EnginePoolAccessor, calculates the desired size via PoolScalingStrategy, respects the cooldown period (bypassing scaling if within cooldown), and triggers resize operations as needed. It logs and reports scaling events, target and actual sizes, latencies, and errors to PoolMetrics and logs.
    • Lifecycle: Tied to the Engine Pool instance in the Kyuubi server, created and started on server/pool startup, and gracefully stopped on shutdown or pool destruction.
  2. PoolScalingStrategy (Pluggable Interface)

    • Responsibilities: Defines the core logic for computing the target pool size. Must be stateless or serializable if needed for configuration distribution. Receives a PoolContext with pool ID, current time, current size, min/max bounds, and optional load/performance metrics collected from the pool. Returns a desired target size, ideally within min/max bounds (final bounds enforcement is done by EnginePoolManager).
    • Allows users to implement and plug in custom scaling algorithms.
  3. EnginePoolAccessor (Interface to the Pool Implementation)

    • Responsibilities: Abstracts interaction with the concrete Engine Pool implementations (EnginePool, UserGroupAwareEnginePool, etc.). Provides:
      • Precise retrieval of the current effective pool size (excluding starting or pending-removal engines).
      • Execution of scale-up commands (for example, creating new engines asynchronously).
      • Collection of internal metrics useful to scaling decisions (active sessions, pending sessions, idle engines, pending removal counts, etc.).
  4. PoolMetrics Interface

    • Responsibilities: Defines APIs to report and monitor dynamic scaling activities such as current and target pool sizes, scaling events (scale-ups and downs), scaling latencies, and errors.
    • A default implementation will integrate with Kyuubi’s existing MetricsSystem to register gauges, counters, timers, etc., with appropriate labels to distinguish pools.
sequenceDiagram
    title Dynamic Scaling Check Sequence

    participant S as Scheduler (in Manager)
    participant EPM as EnginePoolManager
    participant EPA as EnginePoolAccessor
    participant PSS as PoolScalingStrategy
    participant PM as PoolMetrics

    S ->>+ EPM: Trigger Scaling Check (Every Interval)
    EPM ->> EPM: Check Cooldown Period
    opt Cooldown Active
        EPM -->> S: Skip Check (In Cooldown)
    end
    EPM ->>+ EPA: getCurrentSize()
    EPA -->>- EPM: currentSize
    EPM ->> PM: recordPoolSize(currentSize)
    EPM ->>+ EPA: collectMetrics()
    EPA -->>- EPM: poolMetricsMap
    EPM ->> EPM: Create PoolContext(currentSize, poolMetricsMap, ...)
    EPM ->>+ PSS: calculateTargetSize(context)
    PSS -->>- EPM: targetSizeRaw
    EPM ->> EPM: Clamp targetSize = max(minSize, min(maxSize, targetSizeRaw))
    EPM ->> PM: recordTargetPoolSize(targetSize)

    alt targetSize != currentSize
        EPM ->>+ EPA: resize(targetSize)
        Note right of EPA: Initiates async scale-up or<br/>graceful scale-down
        EPA -->>- EPM: Resize Requested (returns)
        EPM ->> EPM: Update lastScalingTimestamp
        EPM ->> PM: recordScalingEvent(currentSize, targetSize)
    else targetSize == currentSize
        EPM ->> EPM: Log "No scaling needed"
    end

    EPM ->> PM: recordScalingLatency(...)
    EPM -->>- S: Check Complete

Loading

Additional context

This is an initial proposal aiming to address the dynamic scaling capabilities of the Engine Pool in Kyuubi. The design and implementation details are still open for discussion. I sincerely welcome feedback, suggestions, and any improvements from the community to help refine and make this feature more robust and aligned with real-world needs. Looking forward to collaborating with everyone!

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.
@wangzhigang1999
Copy link
Contributor Author

I noticed that this PR #5662 implements a similar functionality. Perhaps I could build upon it and improve it further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant