[SPARK-51711][ML][PYTHON][CONNECT] Propagates the active remote spark session to new threads to fix CrossValidator #50507

xi-db · 2025-04-03T11:09:03Z

What changes were proposed in this pull request?

In SparkML with Spark Connect, the _parallelFitTasks fails when running CrossValidator fitting, as the active remote spark session is not properly propagated to the new threads.

Before the PR, this code will fail in the line cvModel = cv.fit(data):

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.linalg import Vectors

data = spark.createDataFrame([
    (Vectors.dense(1.0, 2.0), 0),
    (Vectors.dense(2.0, 3.0), 1),
    (Vectors.dense(1.5, 2.5), 0),
    (Vectors.dense(3.0, 4.0), 1),
    (Vectors.dense(1.1, 2.1), 0),
    (Vectors.dense(2.5, 3.5), 1),
], ["features", "label"])

rf = RandomForestClassifier(labelCol="label", featuresCol="features")
evaluator = BinaryClassificationEvaluator(labelCol="label")
paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2])
             .addGrid(rf.numTrees, [5, 10])
             .build())

cv = CrossValidator(estimator=rf,
                    estimatorParamMaps=paramGrid,
                    evaluator=evaluator,
                    numFolds=3)

cvModel = cv.fit(data)

bestModel = cvModel.bestModel
print(f"Best maxDepth: {bestModel.getMaxDepth()}")
print(f"Best maxBins: {bestModel.getMaxBins()}")
print(f"Best numTrees: {bestModel.getNumTrees}")

It fails because the active remote spark session is not properly set on that thread:

File ~/spark/python/pyspark/ml/util.py:250, in try_remote_call.<locals>.wrapped(self, name, *args)
    247 from pyspark.ml.connect.serialize import serialize, deserialize
    249 session = SparkSession.getActiveSession()
--> 250 assert session is not None
    251 assert isinstance(self._java_obj, str)
    252 methods, obj_ref = _extract_id_methods(self._java_obj)

AssertionError:

With this fix, the above code snippet works correctly.

Why are the changes needed?

It fixes a bug with CrossValidator fitting.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New test.

Was this patch authored or co-authored using generative AI tooling?

No.

…llelFitTasks

xi-db · 2025-04-03T11:21:48Z

Hi @zhengruifeng , could you review this PR?

It fixes a bug in SparkML via SparkConnect. The bug is reproducible with the code example in the PR description.

Thanks!

WeichenXu123

LGTM

WeichenXu123

LGTM

zhengruifeng · 2025-04-07T02:21:10Z

python/pyspark/ml/connect/tuning.py

@@ -434,7 +434,7 @@ def _fit(self, dataset: Union[pd.DataFrame, DataFrame]) -> "CrossValidatorModel"

            tasks = _parallelFitTasks(est, train, eva, validation, epm)
            if not is_remote():
-                tasks = list(map(inheritable_thread_target, tasks))
+                tasks = list(map(inheritable_thread_target(dataset.sparkSession), tasks))


Is this necessary?
it is under if not is_remote() branch

python/pyspark/util.py

zhengruifeng · 2025-04-07T02:36:54Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLCache.scala

@@ -86,7 +86,7 @@ private[connect] class MLCache extends Logging {

 private[connect] object MLCache {
  // The maximum number of distinct items in the cache.
-  private val MAX_CACHED_ITEMS = 100


is this related?
if no, I think we should file a separate PR to increase the value

This reverts commit a218d38.

Co-authored-by: Ruifeng Zheng <[email protected]>

Propagates the active remote spark session to new threads to fix para…

6771c4f

…llelFitTasks

github-actions bot added ML CORE PYTHON CONNECT labels Apr 3, 2025

xi-db changed the title ~~Propagates the active remote spark session to new threads to fix parallelFitTasks~~ [SPARK-51711] Propagates the active remote spark session to new threads to fix parallelFitTasks Apr 3, 2025

xi-db changed the title ~~[SPARK-51711] Propagates the active remote spark session to new threads to fix parallelFitTasks~~ [SPARK-51711] Propagates the active remote spark session to new threads to fix CrossValidator Apr 3, 2025

xi-db changed the title ~~[SPARK-51711] Propagates the active remote spark session to new threads to fix CrossValidator~~ [SPARK-51711][ML][PYTHON][CONNECT] Propagates the active remote spark session to new threads to fix CrossValidator Apr 3, 2025

Fix lint issue

a2b02fe

WeichenXu123 approved these changes Apr 3, 2025

View reviewed changes

xi-db added 2 commits April 3, 2025 13:31

Fix lint issue

3e3ad4e

Move test to correct suite

fe8992b

github-actions bot removed the CONNECT label Apr 3, 2025

Update ml/connect/tuning.py as well

64ee91c

github-actions bot added the CONNECT label Apr 3, 2025

WeichenXu123 approved these changes Apr 3, 2025

View reviewed changes

Update ml/connect/tuning.py as well

3c31378

HyukjinKwon approved these changes Apr 4, 2025

View reviewed changes

xi-db added 2 commits April 4, 2025 08:51

Fix lint issue

b8183d6

Increase MAX_CACHED_ITEMS

a218d38

github-actions bot added the SQL label Apr 4, 2025

Merge branch 'master' into fix-parallelFitTasks

a9fc41e

zhengruifeng reviewed Apr 7, 2025

View reviewed changes

python/pyspark/util.py Outdated Show resolved Hide resolved

zhengruifeng reviewed Apr 7, 2025

View reviewed changes

xi-db added 2 commits April 8, 2025 12:04

Revert "Increase MAX_CACHED_ITEMS"

e90d670

This reverts commit a218d38.

Merge branch 'master' into fix-parallelFitTasks

50984e0

github-actions bot removed the SQL label Apr 8, 2025

Rename SCS to RemoteSparkSession

69a6869

Co-authored-by: Ruifeng Zheng <[email protected]>

xi-db added 4 commits April 8, 2025 14:04

Rename SCS to RemoteSparkSession

63c9f9c

Fix tests

5f2e375

Fix tests

44f0b7f

Fix tests

ec82013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51711][ML][PYTHON][CONNECT] Propagates the active remote spark session to new threads to fix CrossValidator #50507

[SPARK-51711][ML][PYTHON][CONNECT] Propagates the active remote spark session to new threads to fix CrossValidator #50507

xi-db commented Apr 3, 2025 •

edited

Loading

xi-db commented Apr 3, 2025

WeichenXu123 left a comment

WeichenXu123 left a comment

zhengruifeng Apr 7, 2025 •

edited

Loading

zhengruifeng Apr 7, 2025

[SPARK-51711][ML][PYTHON][CONNECT] Propagates the active remote spark session to new threads to fix CrossValidator #50507

Are you sure you want to change the base?

[SPARK-51711][ML][PYTHON][CONNECT] Propagates the active remote spark session to new threads to fix CrossValidator #50507

Conversation

xi-db commented Apr 3, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

xi-db commented Apr 3, 2025

WeichenXu123 left a comment

Choose a reason for hiding this comment

WeichenXu123 left a comment

Choose a reason for hiding this comment

zhengruifeng Apr 7, 2025 • edited Loading

Choose a reason for hiding this comment

zhengruifeng Apr 7, 2025

Choose a reason for hiding this comment

xi-db commented Apr 3, 2025 •

edited

Loading

zhengruifeng Apr 7, 2025 •

edited

Loading