cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

manugarri
by New Contributor II
  • 15966 Views
  • 10 replies
  • 1 kudos

Fuzzy text matching in Spark

I have a list of client provided data, a list of company names. I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark ...

  • 15966 Views
  • 10 replies
  • 1 kudos
Latest Reply
Sonal
New Contributor II
  • 1 kudos

You can use Zingg: Spark based open source tool for this https://github.com/zinggAI/zingg

  • 1 kudos
9 More Replies
Sam
by New Contributor III
  • 1603 Views
  • 0 replies
  • 0 kudos

Can Admins enable Table Download on Sample but not on Full Dataset?

Is it possible to allow for Table download on a sampled dataset but not the full dataset? In the configuration settings it seems like you have to allow both?Not withstanding the fact people could loop through the sample download, it seems like a prud...

  • 1603 Views
  • 0 replies
  • 0 kudos
saniafatimi
by New Contributor II
  • 3181 Views
  • 1 replies
  • 1 kudos

Need guidance on migrating power bi reports to databricks

Hi All, I want to import an existing database/tables (say AdventureWorks) to databricks. And after importing tables, I want to develop reports on top I need guidance on this. Can someone give me resources that could help me in doing things end to en...

  • 3181 Views
  • 1 replies
  • 1 kudos
Latest Reply
Chris_Shehu
Valued Contributor III
  • 1 kudos

@sania fatimi​  There are several different ways to do this and it's really going to depend on what your current need is. You could for example load the data into the databricks delta lake and use the databricks powerbi connecter to query the data fr...

  • 1 kudos
User16830818524
by New Contributor II
  • 2245 Views
  • 3 replies
  • 0 kudos

Resolved! Libraries in Databricks Runtimes

Is it possible to easily determine what libraries and which version are included in a specific DBR Version?

  • 2245 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hello. My name is Piper and I'm one of the community moderators. One of the team members sent this information to me.This should be the correct path to check libraries installed with DBRs.https://docs.databricks.com/release-notes/runtime/8.3ml.html?_...

  • 0 kudos
2 More Replies
Rodrigo_Brandet
by New Contributor
  • 4504 Views
  • 3 replies
  • 4 kudos

Resolved! Upload CSV files on Databricks by code (note UI)

Hello everyone.I have a process on databricks when I need to upload a CSV file everyday manually.I would like to know if there is a way to import this data (as panda in python, for example) with no necessary to upload this file everyday manually util...

  • 4504 Views
  • 3 replies
  • 4 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 4 kudos

Autoloader is indeed a valid option,or use of some kind of ETL tool which fetches the file and put it somewhere on your cloud provider, like Azure Data Factory or AWS Glue etc.

  • 4 kudos
2 More Replies
Zen
by New Contributor III
  • 5013 Views
  • 2 replies
  • 3 kudos

Resolved! ssh onto Cluster as root

Hello, I'm following the instructions here:https://docs.databricks.com/clusters/configure.html?_ga=2.17611385.1712747127.1631209439-1615211488.1629573963#ssh-access-to-clustersto ssh onto the Driver node, and it's working perfectly when I ssh on as `...

  • 5013 Views
  • 2 replies
  • 3 kudos
Latest Reply
cconnell
Contributor II
  • 3 kudos

I am 99% sure that logging into a Databricks node as root will not be allowed.

  • 3 kudos
1 More Replies
Anonymous
by Not applicable
  • 1921 Views
  • 2 replies
  • 0 kudos

Resolved! What are the advantages of using Delta if I am using MLflow? How is Delta useful for DS/ML use cases?

I am already using MLflow. What benefit would Delta provide me since I am not really working on Data engineering workloads

  • 1921 Views
  • 2 replies
  • 0 kudos
Latest Reply
Sebastian
Contributor
  • 0 kudos

The most important aspect is your experiment can track the version of the data table. So during audits you will be able to trace back why a specific prediction was made.

  • 0 kudos
1 More Replies
brickster_2018
by Databricks Employee
  • 3128 Views
  • 2 replies
  • 3 kudos

Resolved! What is the best file format for a temporary table?

As part of my ETL process, I create intermediate/staging temporary tables. These tables created are read at a later point in the ETL and finally cleaned up. Should I use Delta? Using Delta creates the overhead of running optimize jobs, which would de...

  • 3128 Views
  • 2 replies
  • 3 kudos
Latest Reply
Sebastian
Contributor
  • 3 kudos

Agree.. the intermediate delta tables helps since it brings reliability to the pipeline.

  • 3 kudos
1 More Replies
Nyarish
by Contributor
  • 1052 Views
  • 0 replies
  • 0 kudos

How to connect Neo4j aura to Databricks connection Error

I get this error: org.neo4j.driver.exceptions.SecurityException: Failed to establish secured connection with the serverI have tried to read through the documentation and tried the solution suggested but I can't seem to hack this problem.Kindly help. ...

  • 1052 Views
  • 0 replies
  • 0 kudos
Zircoz
by New Contributor II
  • 14254 Views
  • 2 replies
  • 6 kudos

Resolved! Can we access the variables created in Python in Scala's code or notebook ?

If I have a dict created in python on a Scala notebook (using magic word ofcourse):%python d1 = {1: "a", 2:"b", 3:"c"}Can I access this d1 in Scala ?I tried the following and it returns d1 not found:%scala println(d1)

  • 14254 Views
  • 2 replies
  • 6 kudos
Latest Reply
cpm1
New Contributor II
  • 6 kudos

Martin is correct. We could only access the external files and objects. In most of our cases, we just use temporary views to pass data between R & Python.https://docs.databricks.com/notebooks/notebooks-use.html#mix-languages

  • 6 kudos
1 More Replies
Anonymous
by Not applicable
  • 3382 Views
  • 1 replies
  • 2 kudos

Are there any costs or quotas associated with the Databricks managed Hive metastore?

When using the default hive metastore that is managed within the Databricks control plane are there any associated costs? I.e. if I switched to an external metastore would I expect to see any reduction in my Databricks cost (ignoring total costs).Do ...

  • 3382 Views
  • 1 replies
  • 2 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 2 kudos

There are no costs associated by using the Databricks managed Hive metastore directly. Databricks pricing is on a compute consumption and not on data storage or access. The only real cost would be the compute used to access the data. I would not expe...

  • 2 kudos
Techmate
by New Contributor
  • 1554 Views
  • 1 replies
  • 0 kudos

Populating a array of date tuples Scala

Hi Friends i am trying to pass a list of date ranges needs to be in the below format. val predicates =Array(“2021-05-16” → “2021-05-17”,“2021-05-18” → “2021-05-19”,“2021-05-20” → “2021-05-21”) I am then using map to create a range of conditions that...

  • 1554 Views
  • 1 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

So basically this can be done by generating 2 lists which are then zipped.One list contains the first dates of the tuples, so these are in your case 2 days apart.The other list is the 2nd dates of the tuples, also 2 days apart.Now we need a function ...

  • 0 kudos
dlevy
by New Contributor II
  • 1517 Views
  • 1 replies
  • 1 kudos
  • 1517 Views
  • 1 replies
  • 1 kudos
Latest Reply
gbrueckl
Contributor II
  • 1 kudos

I think this was added Databricks Runtime 8.2https://docs.databricks.com/release-notes/runtime/8.2.html

  • 1 kudos
alphaRomeo
by New Contributor
  • 4737 Views
  • 2 replies
  • 0 kudos

Resolved! DataBricks with MySQL data source?

I have an existing data pipeline which looks like this: A small MySQL data source (around 250 GB) and data passes through Debezium/ Kafka / a custom data redactor -> to Glue ETL jobs and finally lands on Redshift, but the scale of the data is too sm...

  • 4737 Views
  • 2 replies
  • 0 kudos
Latest Reply
Dan_Z
Databricks Employee
  • 0 kudos

There is a lot in this question, so generally speaking I suggest you reach out to the sales team at Databricks. You can talk to a solutions architect who get into more detail. Here are my general thoughts having seen a lot of customer arch:Generally,...

  • 0 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels