cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

DanSiegel
by New Contributor
  • 716 Views
  • 0 replies
  • 0 kudos

Access an external table from another workspace

What's the best way to add an external table so another cluster/workspace can access an existing external table on S3? I need to redeploy my workspace into a new VPC, so I am not expecting any collisions of the warehouses. Is it as simple as adding ...

  • 716 Views
  • 0 replies
  • 0 kudos
CalvinCalvert_
by New Contributor
  • 629 Views
  • 0 replies
  • 0 kudos

How does FSCK work and does it have any negative effects on subsequent notebook executions?

In my environment, there are 3 groups of notebooks that run on their own schedules, however they all use the same underlying transaction logs (auditlogs, as we call them) in S3. From time to time, various notebooks from each of the 3 groups fail wit...

  • 629 Views
  • 0 replies
  • 0 kudos
MohitAnchlia
by New Contributor II
  • 1017 Views
  • 0 replies
  • 1 kudos

Change AWS storage setting and account

I am seeing a super weird behaviour in databricks. We initially configured the following: 1. Account X in Account Console -> AWS Account arn:aws:iam::X:role/databricks-s3 2. We setup databricks-s3 as S3 bucket in Account Console -> AWS Storage 3. W...

  • 1017 Views
  • 0 replies
  • 1 kudos
TrinaDe
by New Contributor II
  • 3842 Views
  • 1 replies
  • 1 kudos

How can we join two pyspark dataframes side by side (without using join,equivalent to pd.concat() in pandas) ? I am trying to join two extremely large dataframes where each is of the order of 50 million.

My two dataframes look like new_df2_record1 and new_df2_record2 and the expected output dataframe I want is like new_df2: The code I have tried is the following: If I print the top 5 rows of new_df2, it gives the output as expected but I cannot pri...

0693f000007OoS6AAK
  • 3842 Views
  • 1 replies
  • 1 kudos
Latest Reply
TrinaDe
New Contributor II
  • 1 kudos

The code in a more legible format:

  • 1 kudos
AnandNair
by New Contributor
  • 750 Views
  • 0 replies
  • 0 kudos

Load an explicit schema from an external metadata.csv file or a json file for reading csv's into dataframe

Hi, I have a metadata csv file which contains column name, and datatype such as Colm1: INT Colm2: String. I can also get the same in a json format as shown: I can store this on ADLS. How can I convert this into a schema like: "Myschema" that I can ...

  • 750 Views
  • 0 replies
  • 0 kudos
Devaraj
by New Contributor
  • 3190 Views
  • 0 replies
  • 0 kudos

Not able to fetch data from Simba Spark Jdbc Driver

We are getting below error when we tried to set the date in preparedstatement using Simba Spark Jdbc Driver. Exception: Query execution failed: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.h...

  • 3190 Views
  • 0 replies
  • 0 kudos
twotwoiscute
by New Contributor
  • 1486 Views
  • 0 replies
  • 0 kudos

PySpark pandas_udf slower than single thread

I used @pandas_udf write a function for speeding up the process(parsing xml file ) and then compare it's speed with single thread , Surprisingly , Using @pandas_udf is two times slower than single-thread code. And the number of xml files I need to p...

  • 1486 Views
  • 0 replies
  • 0 kudos
User16776430979
by New Contributor III
  • 1190 Views
  • 0 replies
  • 1 kudos

Repos file size limit - Is it possible to clone a specific branch into Repos?

We refactored our codebase into another branch of our existing repo and consolidated the files so that they should be useable within the Databricks Repos size/file limitations. However, even though the new branch is smaller, I am still getting an err...

  • 1190 Views
  • 0 replies
  • 1 kudos
User16752239289
by Valued Contributor
  • 1503 Views
  • 1 replies
  • 1 kudos

Resolved! Failed to add S3 init script in job cluster

I use below payload to submit my job that include am init script saved on S3. The instance profile and init script worked on interactive cluster. But when I move to job cluster the init script cannot be configure. { "new_cluster": { "spar...

  • 1503 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16752239289
Valued Contributor
  • 1 kudos

It is due to the region is missing. For init script saved in S3, the region field is required. The init script section should be like below :"init_scripts": [ { "s3": { "destination": "s3://<my bucket>...

  • 1 kudos
User16790091296
by Contributor II
  • 2672 Views
  • 1 replies
  • 0 kudos

Notebook path can't be in DBFS?

Some of us are working with IDEs and trying to deploy notebooks (.py) files to dbfs. the problem I have noticed is when configuring jobs, those paths are not recognized.notebook_path: If I use this :dbfs:/artifacts/client-state-vector/0.0.0/bootstrap...

  • 2672 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16752239289
Valued Contributor
  • 0 kudos

The issue is that the python file saved under DBFS not as a workspace notebook. When you given /artifacts/client-state vector/0.0.0/bootstrap.py, the workspace will search the notebook(python file in this case) under the folder that under Workspace t...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 929 Views
  • 1 replies
  • 0 kudos

Is it possible that only a particular cluster have only access to a s3 bucket or folder in s3

Hi I want to set up a cluster and want to give access to that cluster to some user only those user on that particular cluster should have access to read and write from and to the bucket. that particular bucket is not mounted on the workspace.Is th...

  • 929 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16752239289
Valued Contributor
  • 0 kudos

Yes, you can set up an instance profile that can access the S3 bucket and then only give certain users privilege to use the instance profile. For more details, you can check here

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels