Skip to content

Key not found "ADLSGen2" when using to_spark_dataframe #1503

Open
@malthe

Description

@malthe

I'm creating a dataset directly using a URL (relying on identity-based access):

dataset = Dataset.Tabular.from_parquet_files("https://<account>.dfs.core.windows.net/<path>")

(This prompts my browser to start a login-process.)

While dataset.to_pandas_dataframe() works fine, when I try dataset.to_spark_dataframe() I get the following Java traceback:

: java.util.NoSuchElementException: key not found: ADLSGen2
	at scala.collection.MapLike.default(MapLike.scala:235)
	at scala.collection.MapLike.default$(MapLike.scala:234)
	at scala.collection.AbstractMap.default(Map.scala:63)
	at scala.collection.MapLike.apply(MapLike.scala:144)
	at scala.collection.MapLike.apply$(MapLike.scala:143)
	at scala.collection.AbstractMap.apply(Map.scala:63)
	at com.microsoft.dprep.io.StreamInfoFileSystem$.toFileSystemPath(StreamInfoFileSystem.scala:68)
	at com.microsoft.dprep.execution.Storage$.expandHdfsPath(Storage.scala:37)
	at com.microsoft.dprep.execution.executors.GetFilesExecutor$.$anonfun$getFiles$1(GetFilesExecutor.scala:18)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
	at com.microsoft.dprep.execution.executors.GetFilesExecutor$.getFiles(GetFilesExecutor.scala:12)
	at com.microsoft.dprep.execution.LariatDataset$.getFiles(LariatDataset.scala:32)
	at com.microsoft.dprep.execution.PySparkExecutor.getFiles(PySparkExecutor.scala:225)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:834)

This is using "com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-62-25d40cff-SNAPSHOT" and PySpark 3.1.2.

What might cause this error?

The Java code is called from a generated Python module which shows where the "ADLSGen2" key comes from:

# ...
lds0 = jex.getFiles(
    [{"searchPattern":"https://<account>.dfs.core.windows.net/<path>",
       "handler":"ADLSGen2",
       "arguments":{"credential":""}
    }], 
    secrets
)
# ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    ADOIssue is documented on MSFT ADO for internal trackingData4MLproduct-question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions