cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Ancil
by Contributor II
  • 16530 Views
  • 11 replies
  • 1 kudos

Any on please suggest how we can effectively loop through PySpark Dataframe .

Scenario: I Have a dataframe with more than 1000 rows, each row having a file path and result data column. I need to loop through each row and write files to the file path, with data from the result column.what is the easiest and time effective way ...

image
  • 16530 Views
  • 11 replies
  • 1 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 1 kudos

Hi,​I agree with Werners, try to avoid loop with Pyspark Dataframe.If your dataframe is small, as you said, only about 1000 rows, you may consider to use Pandas.Thanks.​

  • 1 kudos
10 More Replies
pabloaus
by New Contributor III
  • 5678 Views
  • 2 replies
  • 4 kudos

Resolved! How to read sql file from a Repo to string

I am trying to read a sql file in the repo to string. I have triedwith open("/Workspace/Repos/xx@***.com//file.sql","r") as queryFile: queryText = queryFile.read()And I get following error.[Errno 1] Operation not permitted: '/Workspace/Repos/***@*...

  • 5678 Views
  • 2 replies
  • 4 kudos
Latest Reply
Senthil1
Contributor
  • 4 kudos

I checked in my unity_catalog enabled cluster, i am able to access the @repos file to read and display

  • 4 kudos
1 More Replies
shamly
by New Contributor III
  • 5782 Views
  • 9 replies
  • 2 kudos

Resolved! need to remove doubledagger delimiter from a csv using databricks

My csv data looks like this‡‡companyId‡‡,‡‡empId‡‡,‡‡regionId‡‡,‡‡companyVersion‡‡,‡‡Question‡‡I tried this codedff = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "‡,").csv(f"/mnt/data/path/datafile.csv")But I...

  • 5782 Views
  • 9 replies
  • 2 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 2 kudos

Hi @shamly pt​ I took a bit another approach since I guess no one would be sure of the the encoding of the data you showed. Sample data I took :‡‡companyId‡‡,‡‡empId‡‡,‡‡regionId‡‡,‡‡companyVersion‡‡,‡‡Question‡‡‡‡1‡‡,‡‡121212‡‡,‡‡R‡‡,‡‡1.0A‡‡,‡‡NA‡‡...

  • 2 kudos
8 More Replies
Andrei_Radulesc
by Contributor III
  • 2503 Views
  • 2 replies
  • 0 kudos

Terraform can set ALL_PRIVILEGES and USE_CATALOG on catalogs for 'account users', but not # SELECT or USE_SCHEMA

Only the GUI seems to allow SELECT and USE_SCHEMA 'account users' permissions on catalogs. Terraform gives me an error. Here is my Terraform config:resource "databricks_grants" "staging" { provider = databricks.workspace catalog = databricks_catalog....

  • 2503 Views
  • 2 replies
  • 0 kudos
Latest Reply
Pat
Honored Contributor III
  • 0 kudos

Hi @Andrei Radulescu-Banu​ ,Which version of the provider are you using?I did check the github repo it should work:https://github.com/databricks/terraform-provider-databricks/blob/d65ef3518074a48e079080d94e1ab33a80bf7e0f/catalog/resource_grants.go#L1...

  • 0 kudos
1 More Replies
tom_shaffner
by New Contributor III
  • 12013 Views
  • 6 replies
  • 8 kudos

Resolved! Is there some form of enablement required to use Delta Live Tables (DLT)?

I'm trying to use delta live tables, but if I import even the example notebooks I get a warning saying `ModuleNotFoundError: No module named 'dlt'`. If I try and install via pip it attempts to install a deep learning framework of some sort.I checked ...

  • 12013 Views
  • 6 replies
  • 8 kudos
Latest Reply
Insight6
New Contributor II
  • 8 kudos

Here's the solution I came up with... Replace `import dlt` at the top of your first cell with the following: try: import dlt # When run in a pipeline, this package will exist (no way to import it here) except ImportError: class dlt...

  • 8 kudos
5 More Replies
dineshg
by New Contributor III
  • 3826 Views
  • 3 replies
  • 6 kudos

Resolved! pyspark - execute dynamically framed action statement stored in string variable

I need to execute union statement which is framed dynamically and stored in string variable. I framed the union statement, but struck with executing the statement. Does anyone know how to execute union statement stored in string variable? I'm using p...

  • 3826 Views
  • 3 replies
  • 6 kudos
Latest Reply
Shalabh007
Honored Contributor
  • 6 kudos

@Dineshkumar Gopalakrishnan​ using python's exec() function can be used to execute a python statement, which in your case could be pyspark union statement. Refer below sample code snippet for your reference.df1 = spark.sparkContext.parallelize([(1, 2...

  • 6 kudos
2 More Replies
BearInTheWoods
by New Contributor III
  • 2699 Views
  • 1 replies
  • 4 kudos

Importing Azure SQL data into Databricks

Hi,I am looking at building a data warehouse using Databricks. Most of the data will be coming from Azure SQL, and we now have Azure SQL CDC enabled to capture changes. Also I would like to import this without paying for additional connectors like Fi...

  • 2699 Views
  • 1 replies
  • 4 kudos
Latest Reply
ravinchi
New Contributor III
  • 4 kudos

@Bear Woods​ Hi! were you able to create DLT tables using CDC feature from sources like sql tables ? even I'm kinda in your situation, you need to leverage apply_changes function and create_streaming_live_table() function but it required intermediate...

  • 4 kudos
g96g
by New Contributor III
  • 6438 Views
  • 8 replies
  • 0 kudos

Resolved! ADF pipeline fails when passing the parameter to databricks

I have project where I have to read the data from NETSUITE using API. Databricks Notebook runs perfectly when I manually insert the table names I want to read from the source. I have dataset (csv) file in adf with all the table names that I need to r...

  • 6438 Views
  • 8 replies
  • 0 kudos
Latest Reply
mcwir
Contributor
  • 0 kudos

Have you tried do debug the json payload of adf trigger ? maybe it wrongly conveys tables names

  • 0 kudos
7 More Replies
Ramabadran
by New Contributor II
  • 11604 Views
  • 3 replies
  • 4 kudos

java.lang.NoClassDefFoundError: scala/Product$class

Hi I am getting "java.lang.NoClassDefFoundError: scala/Product$class" error while using Deequ 1.0.5 version. Please suggest fix to this problem or any work around Error Py4JJavaError Traceback (most recent call last) <command-2625366351750561> in...

  • 11604 Views
  • 3 replies
  • 4 kudos
Latest Reply
mcwir
Contributor
  • 4 kudos

its seems like maven issue

  • 4 kudos
2 More Replies
tanin
by Contributor
  • 2846 Views
  • 4 replies
  • 7 kudos

Does anybody feel the unit test on Dataset is slow? (much slower than RDD). This is in Scala.

I profile it and it seems the slowness comes from Spark planning, especially for a more complex job (e.g. 100+ joins). Is there a way to speed it up (e.g. by disabling certain optimization)?

  • 2846 Views
  • 4 replies
  • 7 kudos
Latest Reply
mcwir
Contributor
  • 7 kudos

I had similar feeling recently.

  • 7 kudos
3 More Replies
Merchiv
by New Contributor III
  • 3867 Views
  • 3 replies
  • 1 kudos

Resolved! How to use uuid in SQL merge into statement

I have a Merge into statement that I use to update existing entries or create new entries in a dimension table based on a natural business key.When creating new entries I would like to also create a unique uuid for that entry that I can use to crossr...

  • 3867 Views
  • 3 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

you might wanna look into an identity column, which is possible now in delta lake.https://www.databricks.com/blog/2022/08/08/identity-columns-to-generate-surrogate-keys-are-now-available-in-a-lakehouse-near-you.html

  • 1 kudos
2 More Replies
KVNARK
by Honored Contributor II
  • 1762 Views
  • 3 replies
  • 11 kudos

Is there any limitation in querying the no. of SQL queries in Databricks SQL workspace.

Is there any limitation in querying the no. of SQL queries in Databricks SQL workspace. 

  • 1762 Views
  • 3 replies
  • 11 kudos
Latest Reply
Rajeev_Basu
Contributor III
  • 11 kudos

1000 has been documented to be by default, though I have never checked the correctness.

  • 11 kudos
2 More Replies
Ajay-Pandey
by Esteemed Contributor III
  • 1797 Views
  • 2 replies
  • 9 kudos

Kafka integration with Databricks

Hi allI want to integrate Kafka with databricks if anyone can share any doc or code it will help me a lot.Thanks in advance

  • 1797 Views
  • 2 replies
  • 9 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 9 kudos

This is code that I am using to read from KafkainputDF = (spark .readStream .format("kafka") .option("kafka.bootstrap.servers", host) .option("kafka.ssl.endpoint.identification.algorithm", "https") .option("kafka.sasl.mechanism", "PLAIN") .option("ka...

  • 9 kudos
1 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels