cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

FranPérez
by New Contributor III
  • 18127 Views
  • 9 replies
  • 6 kudos

set PYTHONPATH when executing workflows

I set up a workflow using 2 tasks. Just for demo purposes, I'm using an interactive cluster for running the workflow. { "task_key": "prepare", "spark_python_task": { "python_file": "file...

  • 18127 Views
  • 9 replies
  • 6 kudos
Latest Reply
kenmyers-8451
Contributor II
  • 6 kudos

Just checking in again if there is a way to do this in the last few years? As Fran mentioned, `sys.path.append("/Workspace/Repos/devops/mlhub-mlops-dev/src")` is not a great "fix" for the reasons already mentioned. I've found that you can do `pip ins...

  • 6 kudos
8 More Replies
Ligaya
by New Contributor II
  • 60604 Views
  • 12 replies
  • 2 kudos

ValueError: not enough values to unpack (expected 2, got 1)

Code:Writer.jdbc_writer("Economy",economy,conf=CONF.MSSQL.to_dict(), modified_by=JOB_ID['Economy'])The problem arises when i try to run the code, in the specified databricks notebook, An error of "ValueError: not enough values to unpack (expected 2, ...

  • 60604 Views
  • 12 replies
  • 2 kudos
Latest Reply
johantylor
New Contributor II
  • 2 kudos

You can also avoid this error by making the split logic more adaptable. Before separating the table name, first verify whether a schema is actually present. If it isn’t, you can set a default schema or simply handle it as one value. This approach is ...

  • 2 kudos
11 More Replies
Adig
by New Contributor III
  • 9589 Views
  • 6 replies
  • 17 kudos

Generate Group Id for similar deduplicate values of a dataframe column.

Inupt DataFrame'''KeyName KeyCompare SourcePapasMrtemis PapasMrtemis S1PapasMrtemis Pappas, Mrtemis S1Pappas, Mrtemis PapasMrtemis S2Pappas, Mrtemis Pappas, Mrtemis S2Mich...

  • 9589 Views
  • 6 replies
  • 17 kudos
Latest Reply
rafaelpoyiadzi
New Contributor II
  • 17 kudos

Hey. We’ve run into similar deduplication problems before. If the name differences are pretty minor (punctuation, spacing, small typos), fuzzy string matching can usually get you most of the way there. That kind of similarity-based clustering works f...

  • 17 kudos
5 More Replies
confused_dev
by New Contributor II
  • 45078 Views
  • 8 replies
  • 7 kudos

Python mocking dbutils in unittests

I am trying to write some unittests using pytest, but I am coming accross the problem of how to mock my dbutils method when dbutils isn't being defined in my notebook.Is there a way to do this so that I can unit test individual functions that are uti...

  • 45078 Views
  • 8 replies
  • 7 kudos
Latest Reply
kenmyers-8451
Contributor II
  • 7 kudos

If this helps anyone here is how we do this:We rely on databricks_test for injecting dbutils into the notebooks that we're testing (which is a 3rd party package mind you and hasn't been updated in a while but still works). And in our notebooks we put...

  • 7 kudos
7 More Replies
Hubert-Dudek
by Databricks MVP
  • 28927 Views
  • 14 replies
  • 12 kudos

Resolved! dbutils or other magic way to get notebook name or cell title inside notebook cell

Not sure it exists but maybe there is some trick to get directly from python code:NotebookNameCellTitlejust working on some logger script shared between notebooks and it could make my life a bit easier

  • 28927 Views
  • 14 replies
  • 12 kudos
Latest Reply
rtullis
New Contributor II
  • 12 kudos

I got the solution to work in terms of printing the notebook that I was running; however, what if you have notebook A that calls a function that prints the notebook name, and you run notebook B that %runs notebook A?  I get the notebook B's name when...

  • 12 kudos
13 More Replies
Nandini
by New Contributor II
  • 18457 Views
  • 12 replies
  • 7 kudos

Pyspark: You cannot use dbutils within a spark job

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str): files_in_path = db...

  • 18457 Views
  • 12 replies
  • 7 kudos
Latest Reply
Etyr
Contributor II
  • 7 kudos

If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) # Get Path class to convert string path to FS path path = spark._...

  • 7 kudos
11 More Replies
Roy
by New Contributor II
  • 75363 Views
  • 6 replies
  • 0 kudos

Resolved! dbutils.notebook.exit() executing from except in try/except block even if there is no error.

I am using Python notebooks as part of a concurrently running workflow with Databricks Runtime 6.1. Within the notebooks I am using try/except blocks to return an error message to the main concurrent notebook if a section of code fails. However I h...

  • 75363 Views
  • 6 replies
  • 0 kudos
Latest Reply
tonyliken
New Contributor II
  • 0 kudos

because the dbutils.notebook.exit() is an 'Exception' it will always trigger the except Exception as e: part of the code. When can use this to our advantage to solve the problem by adding an 'if else' to the except block. query = "SELECT 'a' as Colum...

  • 0 kudos
5 More Replies
Data_Analytics1
by Contributor III
  • 38003 Views
  • 10 replies
  • 10 kudos

Failure starting repl. How to resolve this error? I got this error in a job which is running.

Failure starting repl. Try detaching and re-attaching the notebook.java.lang.Exception: Python repl did not start in 30 seconds. at com.databricks.backend.daemon.driver.IpykernelUtils$.startIpyKernel(JupyterDriverLocal.scala:1442) at com.databricks.b...

  • 38003 Views
  • 10 replies
  • 10 kudos
Latest Reply
PabloCSD
Valued Contributor II
  • 10 kudos

I have had this problem many times, today I made a copy of the cluster and it got "de-saturated", it could help someone in the future

  • 10 kudos
9 More Replies
JonHMDavis
by New Contributor II
  • 7518 Views
  • 5 replies
  • 2 kudos

Graphframes not importing on Databricks 9.1 LTS ML

Is Graphframes for python meant to be installed by default on Databricks 9.1 LTS ML? Previously I was running the attached python command on 7.3 LTS ML with no issue, however now I am getting "no module named graphframes" when trying to import the pa...

  • 7518 Views
  • 5 replies
  • 2 kudos
Latest Reply
malz
Databricks Partner
  • 2 kudos

Hi @MuthuLakshmi ,  As per the documentation it was mentioned that graphframes comes preinstalled in databricks runtime for machine learning. but when trying to import the python module of graphframes, getting no module found error.from graphframes i...

  • 2 kudos
4 More Replies
maranBH
by New Contributor III
  • 30798 Views
  • 5 replies
  • 11 kudos

Resolved! How to import a function to another notebook using Repos without %run?

Hi all,I was reading the Repos documentation: https://docs.databricks.com/repos.html#migrate-from-run-commandsIt is explained that, one advantage of Repos is no longer necessary to use %run magic command to make funcions available in one notebook to ...

  • 30798 Views
  • 5 replies
  • 11 kudos
Latest Reply
JakubSkibicki
Contributor
  • 11 kudos

Due to new functionalies in Runtime 16.0 regarding autoload i came across this autoload. Performaed a practical test. It works. However had some problems at first.As in solution the key was that definitions are places in a file.py not a notebook.

  • 11 kudos
4 More Replies
kidexp
by New Contributor II
  • 30511 Views
  • 7 replies
  • 2 kudos

Resolved! How to install python package on spark cluster

Hi, How can I install python packages on spark cluster? in local, I can use pip install. I want to use some external packages which is not installed on was spark cluster. Thanks for any suggestions.

  • 30511 Views
  • 7 replies
  • 2 kudos
Latest Reply
Mikejerere
New Contributor II
  • 2 kudos

If --py-files doesn’t work, try this shorter method:Create a Conda Environment: Install your packages.conda create -n myenv python=3.xconda activate myenvpip install your-packagePackage and Submit: Use conda-pack and spark-submit with --archives.cond...

  • 2 kudos
6 More Replies
tanjil
by New Contributor III
  • 21857 Views
  • 7 replies
  • 6 kudos

Resolved! Downloading sharepoint lists using python

Hello, I am trying to download lists from SharePoint into a pandas dataframe. However I cannot get any information successfully. I have attempted many solution mentioned in stackoverflow. Below is one of those attempts: # https://pypi.org/project/sha...

  • 21857 Views
  • 7 replies
  • 6 kudos
Latest Reply
tanjil
New Contributor III
  • 6 kudos

Hello, I have gotten the code to work by using office365 library instead.

  • 6 kudos
6 More Replies
Constantine
by Contributor III
  • 19728 Views
  • 3 replies
  • 7 kudos

Resolved! collect_list by preserving order based on another variable - Spark SQL

I am using databricks sql notebook to run these queries. I have a Python UDF like   %python   from pyspark.sql.functions import udf from pyspark.sql.types import StringType, DoubleType, DateType   def get_sell_price(sale_prices): return sale_...

  • 19728 Views
  • 3 replies
  • 7 kudos
Latest Reply
villi77
New Contributor II
  • 7 kudos

I had a similar situation where I was trying to order the days of the week from Monday to Sunday.  I saw solutions that use Python but was wanting to do it all in SQL.  My original attempt was to use: CONCAT_WS(',', COLLECT_LIST(DISTINCT t.LOAD_ORIG_...

  • 7 kudos
2 More Replies
AsfandQ
by New Contributor III
  • 23437 Views
  • 7 replies
  • 6 kudos

Resolved! Delta tables: Cannot set default column mapping mode to "name" in Python for delta tables

Hello,I am trying to write Delta files for some CSV data. When I docsv_dataframe.write.format("delta").save("/path/to/table.delta")I get: AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of yourschema.Having look...

  • 23437 Views
  • 7 replies
  • 6 kudos
Latest Reply
Personal1
New Contributor II
  • 6 kudos

I still get the error when I try any method. The column names with spaces are throwing error [DELTA_INVALID_CHARACTERS_IN_COLUMN_NAMES] Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.df1.write.format("delta") \ .mo...

  • 6 kudos
6 More Replies
del1000
by New Contributor III
  • 23009 Views
  • 8 replies
  • 3 kudos

Resolved! Is it possible to passthrough job's parameters to variable?

Scenario:I tried to run notebook_primary as a job with same parameters' map. This notebook is orchestrator for notebooks_sec_1, notebooks_sec_2, and notebooks_sec_3 and next. I run them by dbutils.notebook.run(path, timeout, arguments) function.So ho...

  • 23009 Views
  • 8 replies
  • 3 kudos
Latest Reply
nnalla
New Contributor II
  • 3 kudos

I am using getCurrentBindings(), but it returns an empty dictionary even though I passed parameters. I am running it in a scheduled workflow job

  • 3 kudos
7 More Replies
Labels