cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Taha_Hussain
by Databricks Employee
  • 4074 Views
  • 1 replies
  • 5 kudos

Ask your technical questions at Databricks Office Hours! November 16 - 8:00 AM - 9:00 AM PT: Register HereNovember 30 - 11:00 AM - 12:00 PM PT: Regist...

Ask your technical questions at Databricks Office Hours!November 16 - 8:00 AM - 9:00 AM PT: Register HereNovember 30 - 11:00 AM - 12:00 PM PT: Register HereDatabricks Office Hours connects you directly with experts to answer all your Databricks quest...

  • 4074 Views
  • 1 replies
  • 5 kudos
Latest Reply
Taha_Hussain
Databricks Employee
  • 5 kudos

Q&A Recap from 11/30 Office HoursQ: What is the downside of using z-ordering and auto optimize? It seems like there could be a tradeoff with writing small files (whereas it is good at reading a larger file), is that true?A: By default, Delta Lake on ...

  • 5 kudos
Ancil
by Contributor II
  • 23664 Views
  • 11 replies
  • 1 kudos

Any on please suggest how we can effectively loop through PySpark Dataframe .

Scenario: I Have a dataframe with more than 1000 rows, each row having a file path and result data column. I need to loop through each row and write files to the file path, with data from the result column.what is the easiest and time effective way ...

image
  • 23664 Views
  • 11 replies
  • 1 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 1 kudos

Hi,​I agree with Werners, try to avoid loop with Pyspark Dataframe.If your dataframe is small, as you said, only about 1000 rows, you may consider to use Pandas.Thanks.​

  • 1 kudos
10 More Replies
pabloaus
by New Contributor III
  • 7489 Views
  • 2 replies
  • 4 kudos

Resolved! How to read sql file from a Repo to string

I am trying to read a sql file in the repo to string. I have triedwith open("/Workspace/Repos/xx@***.com//file.sql","r") as queryFile: queryText = queryFile.read()And I get following error.[Errno 1] Operation not permitted: '/Workspace/Repos/***@*...

  • 7489 Views
  • 2 replies
  • 4 kudos
Latest Reply
Senthil1
Databricks Partner
  • 4 kudos

I checked in my unity_catalog enabled cluster, i am able to access the @repos file to read and display

  • 4 kudos
1 More Replies
shamly
by New Contributor III
  • 9825 Views
  • 9 replies
  • 2 kudos

Resolved! need to remove doubledagger delimiter from a csv using databricks

My csv data looks like this‡‡companyId‡‡,‡‡empId‡‡,‡‡regionId‡‡,‡‡companyVersion‡‡,‡‡Question‡‡I tried this codedff = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "‡,").csv(f"/mnt/data/path/datafile.csv")But I...

  • 9825 Views
  • 9 replies
  • 2 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 2 kudos

Hi @shamly pt​ I took a bit another approach since I guess no one would be sure of the the encoding of the data you showed. Sample data I took :‡‡companyId‡‡,‡‡empId‡‡,‡‡regionId‡‡,‡‡companyVersion‡‡,‡‡Question‡‡‡‡1‡‡,‡‡121212‡‡,‡‡R‡‡,‡‡1.0A‡‡,‡‡NA‡‡...

  • 2 kudos
8 More Replies
Andrei_Radulesc
by Contributor III
  • 4278 Views
  • 2 replies
  • 0 kudos

Terraform can set ALL_PRIVILEGES and USE_CATALOG on catalogs for 'account users', but not # SELECT or USE_SCHEMA

Only the GUI seems to allow SELECT and USE_SCHEMA 'account users' permissions on catalogs. Terraform gives me an error. Here is my Terraform config:resource "databricks_grants" "staging" { provider = databricks.workspace catalog = databricks_catalog....

  • 4278 Views
  • 2 replies
  • 0 kudos
Latest Reply
Pat
Esteemed Contributor
  • 0 kudos

Hi @Andrei Radulescu-Banu​ ,Which version of the provider are you using?I did check the github repo it should work:https://github.com/databricks/terraform-provider-databricks/blob/d65ef3518074a48e079080d94e1ab33a80bf7e0f/catalog/resource_grants.go#L1...

  • 0 kudos
1 More Replies
tom_shaffner
by New Contributor III
  • 16546 Views
  • 6 replies
  • 8 kudos

Resolved! Is there some form of enablement required to use Delta Live Tables (DLT)?

I'm trying to use delta live tables, but if I import even the example notebooks I get a warning saying `ModuleNotFoundError: No module named 'dlt'`. If I try and install via pip it attempts to install a deep learning framework of some sort.I checked ...

  • 16546 Views
  • 6 replies
  • 8 kudos
Latest Reply
Insight6
New Contributor II
  • 8 kudos

Here's the solution I came up with... Replace `import dlt` at the top of your first cell with the following: try: import dlt # When run in a pipeline, this package will exist (no way to import it here) except ImportError: class dlt...

  • 8 kudos
5 More Replies
dineshg
by New Contributor III
  • 5684 Views
  • 3 replies
  • 6 kudos

Resolved! pyspark - execute dynamically framed action statement stored in string variable

I need to execute union statement which is framed dynamically and stored in string variable. I framed the union statement, but struck with executing the statement. Does anyone know how to execute union statement stored in string variable? I'm using p...

  • 5684 Views
  • 3 replies
  • 6 kudos
Latest Reply
Shalabh007
Honored Contributor
  • 6 kudos

@Dineshkumar Gopalakrishnan​ using python's exec() function can be used to execute a python statement, which in your case could be pyspark union statement. Refer below sample code snippet for your reference.df1 = spark.sparkContext.parallelize([(1, 2...

  • 6 kudos
2 More Replies
BearInTheWoods
by New Contributor III
  • 4256 Views
  • 1 replies
  • 4 kudos

Importing Azure SQL data into Databricks

Hi,I am looking at building a data warehouse using Databricks. Most of the data will be coming from Azure SQL, and we now have Azure SQL CDC enabled to capture changes. Also I would like to import this without paying for additional connectors like Fi...

  • 4256 Views
  • 1 replies
  • 4 kudos
Latest Reply
ravinchi
New Contributor III
  • 4 kudos

@Bear Woods​ Hi! were you able to create DLT tables using CDC feature from sources like sql tables ? even I'm kinda in your situation, you need to leverage apply_changes function and create_streaming_live_table() function but it required intermediate...

  • 4 kudos
g96g
by New Contributor III
  • 10345 Views
  • 8 replies
  • 0 kudos

Resolved! ADF pipeline fails when passing the parameter to databricks

I have project where I have to read the data from NETSUITE using API. Databricks Notebook runs perfectly when I manually insert the table names I want to read from the source. I have dataset (csv) file in adf with all the table names that I need to r...

  • 10345 Views
  • 8 replies
  • 0 kudos
Latest Reply
mcwir
Contributor
  • 0 kudos

Have you tried do debug the json payload of adf trigger ? maybe it wrongly conveys tables names

  • 0 kudos
7 More Replies
Ramabadran
by New Contributor II
  • 18595 Views
  • 3 replies
  • 4 kudos

java.lang.NoClassDefFoundError: scala/Product$class

Hi I am getting "java.lang.NoClassDefFoundError: scala/Product$class" error while using Deequ 1.0.5 version. Please suggest fix to this problem or any work around Error Py4JJavaError Traceback (most recent call last) <command-2625366351750561> in...

  • 18595 Views
  • 3 replies
  • 4 kudos
Latest Reply
mcwir
Contributor
  • 4 kudos

its seems like maven issue

  • 4 kudos
2 More Replies
tanin
by Contributor
  • 4898 Views
  • 4 replies
  • 7 kudos

Does anybody feel the unit test on Dataset is slow? (much slower than RDD). This is in Scala.

I profile it and it seems the slowness comes from Spark planning, especially for a more complex job (e.g. 100+ joins). Is there a way to speed it up (e.g. by disabling certain optimization)?

  • 4898 Views
  • 4 replies
  • 7 kudos
Latest Reply
mcwir
Contributor
  • 7 kudos

I had similar feeling recently.

  • 7 kudos
3 More Replies
Merchiv
by New Contributor III
  • 6782 Views
  • 3 replies
  • 1 kudos

Resolved! How to use uuid in SQL merge into statement

I have a Merge into statement that I use to update existing entries or create new entries in a dimension table based on a natural business key.When creating new entries I would like to also create a unique uuid for that entry that I can use to crossr...

  • 6782 Views
  • 3 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

you might wanna look into an identity column, which is possible now in delta lake.https://www.databricks.com/blog/2022/08/08/identity-columns-to-generate-surrogate-keys-are-now-available-in-a-lakehouse-near-you.html

  • 1 kudos
2 More Replies
KVNARK
by Honored Contributor II
  • 2572 Views
  • 3 replies
  • 11 kudos

Is there any limitation in querying the no. of SQL queries in Databricks SQL workspace.

Is there any limitation in querying the no. of SQL queries in Databricks SQL workspace. 

  • 2572 Views
  • 3 replies
  • 11 kudos
Latest Reply
Rajeev_Basu
Databricks Partner
  • 11 kudos

1000 has been documented to be by default, though I have never checked the correctness.

  • 11 kudos
2 More Replies
Labels