I am trying to run a DLT job that uses GraphFrames, which is in the ML standard image. I am using it successfully in my job compute instances but I'm running into problems trying to use it in a DLT job. Here are my overrides for the standard job compute policy:
{
"spark_version": {
"type": "unlimited",
"defaultValue": "auto:latest-lts-ml"
},
"cluster_type": {
"type": "allowlist",
"defaultValue": "all-purpose",
"values": [
"all-purpose",
"job",
"dlt"
]
}
}
However, when I run the DLT job, I get the following error:
ModuleNotFoundError: No module named 'graphframes',None,Map(),Map(),List(),List(),Map())
GraphFrames is not pip installable that I know of. Primary instructions are maven coords as the python package uses underlying java/scala.
Will DLT pipelines support GraphFrames?
Related but unresolved question.