cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

alonisser
by Contributor
  • 2354 Views
  • 2 replies
  • 1 kudos

Resolved! Accessing confluent schema registry from databricks with scala fails with 401 (just for scala, not python, just in databricks)

Nore, I've tested with the same connection variable:locally with scala - works (via the same prod schema registry)in the cluster with python - worksin the cluster with scala - fails with 401 auth errordef setupSchemaRegistry(schemaRegistryUrl: String...

  • 2354 Views
  • 2 replies
  • 1 kudos
Latest Reply
alonisser
Contributor
  • 1 kudos

Found the issue: it's the uber package mangling some dependency resolving, which I fixedAnother issue, is that currently you can't use 6.* branch of confluent schema registry client in databricks, because the avro version is different then the one su...

  • 1 kudos
1 More Replies
kjoth
by Contributor II
  • 17321 Views
  • 5 replies
  • 5 kudos

Resolved! Databricks default python libraries list & version

We are using data-bricks. How do we know the default libraries installed in the databricks & what versions are being installed. I have ran pip list, but couldn't find the pyspark in the returned list.

  • 17321 Views
  • 5 replies
  • 5 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 5 kudos

Hi @karthick J​ ,If you would like to see all the libraries installed in your cluster and the version, then I will recommend to check the "Environment" tab. In there you will be able to find all the libraries installed in your cluster.Please follow t...

  • 5 kudos
4 More Replies
Erik
by Valued Contributor III
  • 5121 Views
  • 6 replies
  • 7 kudos

Databricks query performance when filtering on a column correlated to the partition-column

(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):Setting: Delta-lake, Databricks SQL compute used by powerbi. I am wondering about the following scenario: We have a column `timest...

  • 5121 Views
  • 6 replies
  • 7 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 7 kudos

In query I would just query first by date (generated from timestamp which we want to query) and than by exact timestamp, so it will use partitioning benefit.

  • 7 kudos
5 More Replies
BradCliQ
by New Contributor II
  • 2783 Views
  • 2 replies
  • 2 kudos

Resolved! Clean up of residual AWS resources when deleting a DB workspace

When deleting a workspace from the Databricks Accounts Console, I noticed the AWS resources (VPC, NAT, etc.) are not removed. Should they be? And if not, is there a clean/simple way of cleaning up the residual AWS resources?

  • 2783 Views
  • 2 replies
  • 2 kudos
Latest Reply
BradCliQ
New Contributor II
  • 2 kudos

Thank you Prabakar - that's what I figured but didn't know if there was documentation on resource cleanup. I'll just go through and find everything the CF stack created and remove them.Regards,Brad

  • 2 kudos
1 More Replies
omsas
by New Contributor
  • 2558 Views
  • 2 replies
  • 0 kudos

How to add Columns for Automatic Fill on Pandas Python

1. I have data x,I would like to create a new column with the condition that the value are 1, 2 or 32. The name of the column is SHIFT where this SHIFT column will be filled automatically if the TIME_CREATED column meets the conditions.3. the conditi...

Columns Table Result of tested
  • 2558 Views
  • 2 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

You an do something like this in pandas. Note there could be a more performant way to do this too. import pandas as pd import numpy as np   df = pd.DataFrame({'a':[1,2,3,4]}) df.head() > a > 0 1 > 1 2 > 2 3 > 3 4   conditions = [(df['a'] <=2...

  • 0 kudos
1 More Replies
SQLArchitect
by New Contributor
  • 1480 Views
  • 1 replies
  • 1 kudos

Writing Records Failing Constraint Requirements to Separate Table when using Delta Live Tables

Are there any plans / capabilities in place or approaches people are using for writing (logging) records failing constraint requirements to separate tables when using Delta Live Tables? Also, are there any plans / capabilities in place or approaches ...

  • 1480 Views
  • 1 replies
  • 1 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 1 kudos

According to the language reference documentation, I do not believe quarantining records is possible right now out of the box. But there are a few workarounds under the current functionality. Create a second table with the inverse of the expectations...

  • 1 kudos
Sandesh87
by New Contributor III
  • 3079 Views
  • 1 replies
  • 0 kudos

dbutils.secrets.get- NoSuchElementException: None.get

The below code executes a 'get' api method to retrieve objects from s3 and write to the data lake.The problem arises when I use dbutils.secrets.get to get the keys required to establish the connection to s3my_dataframe.rdd.foreachPartition(partition ...

  • 3079 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Howdy @Sandesh Puligundla​ - Thank you for your question. Thank you for your patience. I'd like to give this a bit longer to see how the community responds. Hang tight!

  • 0 kudos
Andyfcx
by New Contributor
  • 2525 Views
  • 2 replies
  • 2 kudos

Resolved! Is it possible to clone a private repository and use it in databricks Repos?

As title, I need to clone code from my private git repo, and use it in my notebook, I do something likedef cmd(command, cwd=None): process = subprocess.Popen(command.split(), stdout=subprocess.PIPE, cwd=cwd) output, error = process.communicate(...

  • 2525 Views
  • 2 replies
  • 2 kudos
Latest Reply
Prabakar
Databricks Employee
  • 2 kudos

Hi @Andy Huang​ , Yes, you can do it if it's accessible from Databricks. Please refer to: https://docs.databricks.com/repos.html#repos-for-git-integrationDatabricks does not support private Git servers, such as Git servers behind a VPN.

  • 2 kudos
1 More Replies
Personal1
by New Contributor II
  • 3831 Views
  • 2 replies
  • 2 kudos

Resolved! Understanding Partitions in Spark Local Mode

I have few fundamental questions in Spark3 while running a simple Spark app in my local mac machine (with 6 cores in total). Please help.local[*] runs my Spark application in local mode with all the cores present on my mac, correct? It also means tha...

  • 3831 Views
  • 2 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

That is a lot of questions in one topic.Let's give it a try:[1] this all depends on the values of the concerning parameters and the program you run(think joins, unions, repartition etc)[2] spark.default.parallelism is by default the number of cores *...

  • 2 kudos
1 More Replies
philip
by New Contributor
  • 6560 Views
  • 2 replies
  • 2 kudos

Resolved! current date as default in a widget while scheduling the notebook

I have a scheduled a notebook. can I keep current date as default in widget whenever the notebook run and also i need the flexibility to change the widget value to any other date based on the ad hoc run that I do.

  • 6560 Views
  • 2 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

So building on the answer of Hubert:from datetime import datedate_for_widget = date.today()So if you use date_for_widget as your default value, you are there.And ofc you can fill this date_for_widget variable with anything you want.You can even fetch...

  • 2 kudos
1 More Replies
MudassarA
by New Contributor II
  • 14692 Views
  • 4 replies
  • 1 kudos

Resolved! How to fix TypeError: __init__() got an unexpected keyword argument 'max_iter'?

# Create the model using sklearn (don't worry about the parameters for now): model = SGDRegressor(loss='squared_loss', verbose=0, eta0=0.0003, max_iter=3000) Train/fit the model to the train-part of the dataset: odel.fit(X_train, y_train) ERROR: Typ...

  • 14692 Views
  • 4 replies
  • 1 kudos
Latest Reply
Fantomas_nl
New Contributor II
  • 1 kudos

Replacing max_iter with n_iter resolves the error. Thnx! It is a bit unusual to expect errors like this with this type of solution from Microsoft. As if it could not be prevented..

  • 1 kudos
3 More Replies
Artem_Y
by Databricks Employee
  • 2290 Views
  • 1 replies
  • 2 kudos

Show all distinct values per column in dataframe Problem Statement:I want to see all the distinct values per column for my entire table, but a SQL que...

Show all distinct values per column in dataframeProblem Statement:I want to see all the distinct values per column for my entire table, but a SQL query with a collect_set() on every column is not dynamic and too long to write.Use this code to show th...

collect set table
  • 2290 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Artem Yevtushenko​ - This is great! Thank you for sharing!

  • 2 kudos
aimas
by New Contributor III
  • 7612 Views
  • 8 replies
  • 5 kudos

Resolved! error creating tables using UI

Hi, i try to create a table using UI, but i keep getting the error "error creating table <table name> create a cluster first" even when i have a cluster alread running. what is the problem?

  • 7612 Views
  • 8 replies
  • 5 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 5 kudos

Be sure that cluster is selected (arrow in database) and at least there is Default database.

  • 5 kudos
7 More Replies
Orianh
by Valued Contributor II
  • 24221 Views
  • 11 replies
  • 10 kudos

Resolved! Read JSON files from the s3 bucket

Hello guys, I'm trying to read JSON files from the s3 bucket. but no matter what I try I get Query returned no result or if I don't specify the schema I get unable to infer a schema.I tried to mount the s3 bucket, still not works.here is some code th...

  • 24221 Views
  • 11 replies
  • 10 kudos
Latest Reply
Prabakar
Databricks Employee
  • 10 kudos

Please refer to the doc that helps you to read JSON. If you are getting this error the problem should be with the JSON schema. Please validate it.As a test, create a simple JSON file (you can get it on the internet), upload it to your S3 bucket, and ...

  • 10 kudos
10 More Replies
Data_Bricks1
by New Contributor III
  • 3731 Views
  • 7 replies
  • 0 kudos

data from 10 BLOB containers and multiple hierarchical folders(every day and every hour folders) in each container to Delta lake table in parquet format - Incremental loading for latest data only insert no updates

I am able to load data for single container by hard coding, but not able to load from multiple containers. I used for loop, but data frame is loading only last container's last folder record only.Here one more issue is I have to flatten data, when I ...

  • 3731 Views
  • 7 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

for sure function (def) should be declared outside loop, move it after importing libraries,logic is a bit complicated you need to debug it using display(Flatten_df2) (or .show()) and validating json after each iteration (using break or sleep etc.)

  • 0 kudos
6 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels