cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

William_Scardua
by Valued Contributor
  • 2951 Views
  • 2 replies
  • 2 kudos

Resolved! Error/Exception when a read websocket with readStream

Hi guys, how are you ? Can you help me ? that my situation When I try to read a websocket with readStream I receive a unknow error exception java.net.UnknownHostException That's my code wssocket = spark\ .readStream\ .forma...

  • 2951 Views
  • 2 replies
  • 2 kudos
Latest Reply
Deepak_Bhutada
Contributor III
  • 2 kudos

It will definitely create a streaming object. So, don't go by wssocket.isStreaming = Truepiece. Also, it will create the streaming object without any issue. Since lazy evaluation Now, coming to the issue, please put the IP directly, sometimes the sla...

  • 2 kudos
1 More Replies
jay_kum
by New Contributor III
  • 1726 Views
  • 2 replies
  • 0 kudos

Resolved! Unable to execute Self Learning Path codes

I am unable to execute code examples given in the learning path. I understand it could be due to access issue. How do I change the working directory to User folder for creating/uploading/read/write etc? By default everything is on driver node. Even...

  • 1726 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@jay.kum​ - Fantastic! Thanks for letting us know.

  • 0 kudos
1 More Replies
saipujari_spark
by Valued Contributor
  • 6704 Views
  • 1 replies
  • 3 kudos

Resolved! How to restrict the number of tasks per executor?

In general, one task per core is how spark executes the tasks.If we want to restrict the number of tasks submitted to the executor to get more task to memory ratio, How can we achieve that?

  • 6704 Views
  • 1 replies
  • 3 kudos
Latest Reply
saipujari_spark
Valued Contributor
  • 3 kudos

We can use a config called "spark.task.cpus"This specifies the number of cores to allocate for each task.The default value is 1If we specify say 2, it means fewer tasks will be assigned to the executor.

  • 3 kudos
Geeya
by New Contributor II
  • 1465 Views
  • 1 replies
  • 0 kudos

After several iteration of filter and union, the data is bigger than spark.driver.maxResultSize

The process for me to build model is:filter dataset and split into two datasetsfit model based on two datasets union two datasetsrepeat 1-3 stepsThe problem is that after several iterations, the model fitting time becomes longer dramatically, and the...

  • 1465 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

I assume that you are using PySpark to train a model? It sounds like you are collecting data on the driver and likely need to increase the size. Can you share any code?

  • 0 kudos
jacek
by New Contributor II
  • 4215 Views
  • 4 replies
  • 1 kudos

Is there an option to have cell titles in notebook view 'Table of contents' ? If not - could you add one?

I like cell title more than separate %md cell. Having cell title in the table of contents seems like quite simple feature. Is it possible? If not - could you add one?

  • 4215 Views
  • 4 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@jacek​ - I wanted to pop in and give you a status update. The team is aware of your request. I can't make any promises on when something may change, but we appreciate your idea and bringing this to our attention.

  • 1 kudos
3 More Replies
gbrueckl
by Contributor II
  • 9659 Views
  • 2 replies
  • 4 kudos

Resolved! dbutils.notebook.run with multiselect parameter

I have a notebook which has a parameter defined as dbutils.widgets.multiselect("my_param", "ALL", ["ALL", "A", "B", "C")and I would like to pass this parameter when calling the notebook via dbutils.notebook.run()However, I tried passing it as an pyth...

  • 9659 Views
  • 2 replies
  • 4 kudos
Latest Reply
gbrueckl
Contributor II
  • 4 kudos

you are right, this actually works fine.I just realized I had two multiselect parameters in my tests and only changing one of them still resulted in the same error message for the second one I ended up writing a function that parses whatever comes in...

  • 4 kudos
1 More Replies
tarente
by New Contributor III
  • 1270 Views
  • 2 replies
  • 3 kudos

Resolved! How to create a csv using a Scala notebook that as " in some columns?

In a project we use Azure Databricks to create csv files to be loaded in ThoughtSpot.Below is a sample to the code I use to write the file:val fileRepartition = 1 val fileFormat = "csv" val fileSaveMode = "overwrite" var fileOptions = Map ( ...

  • 1270 Views
  • 2 replies
  • 3 kudos
Latest Reply
tarente
New Contributor III
  • 3 kudos

Hi Shan,Thanks for the link.I now know more options for creating different csv files.I have not yet completed the problem, but that is related with a destination application (ThoughtSpot) not being able to load the data in the csv file correctly.Rega...

  • 3 kudos
1 More Replies
potluri
by New Contributor II
  • 2839 Views
  • 1 replies
  • 1 kudos

Resolved! Cluster frequently crashing

Cluster crashing, prompting me to use a different cluster or restart the cluster. Previously worked fine for the same code

  • 2839 Views
  • 1 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Moderator
  • 1 kudos

Hi @potluri​ ,What kind of cluster care you using? Is it an interactive cluster or a job cluster? what is the error message you are getting? The following KB article could help you to find the cause and the solution to your problem. Please check the ...

  • 1 kudos
ArindamHalder
by New Contributor II
  • 1900 Views
  • 1 replies
  • 3 kudos

Resolved! Is there any performance result available for DeltaLake?

Specifically for write and read streaming data to HDFS or s3 etc. For IoT specific scenario how it performs on time series transactional data. Can we consider delta table as time series table?

  • 1900 Views
  • 1 replies
  • 3 kudos
Latest Reply
mathan_pillai
Valued Contributor
  • 3 kudos

Hi @Arindam Halder​ , Delta lake is more performant compared to a regular parquet table. pls check below for some stats on the performancehttps://docs.azuredatabricks.net/_static/notebooks/delta/optimize-python.htmlyes, you can use it for time series...

  • 3 kudos
Ougagagoubu
by New Contributor
  • 1031 Views
  • 0 replies
  • 0 kudos

FileBug in DBFS? Can not remove file (table) nor create it in Apache Spark (TM) SQL for Data Analysts Coursera course from Unit 6.2 onwards on.

Hello,as the title already suggests, i'm not able to remove a file via the shell (%sh rm -f "path") nor continue the notebook 6.2 onwards on (6.3 etc...) inside DataBricks. I'm using the DataBricks Community edition.While the error message is clear:"...

  • 1031 Views
  • 0 replies
  • 0 kudos
hoopla
by New Contributor II
  • 6195 Views
  • 2 replies
  • 1 kudos

Unable to copy mutiple files from file:/tmp to dbfs:/tmp

I am downloading multiple files by web scraping and by default they are stored in /tmp I can copy a single file by providing the filename and path %fs cp file:/tmp/2020-12-14_listings.csv.gz dbfs:/tmp but when I try to copy multiple files I get an ...

  • 6195 Views
  • 2 replies
  • 1 kudos
Latest Reply
hoopla
New Contributor II
  • 1 kudos

Thanks DeepakThis is what I have suspected.Hopefully the wild card feature might be available in futureThanks

  • 1 kudos
1 More Replies
sachinmkp1
by New Contributor II
  • 41579 Views
  • 2 replies
  • 1 kudos

Resolved! org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 69 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)

set spark.conf.set("spark.driver.maxResultSize", "20g") get spark.conf.get("spark.driver.maxResultSize") // 20g which is expected in notebook , I did not do in cluster level setting still getting 4g while executing the spark job , why? because of th...

  • 41579 Views
  • 2 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Moderator
  • 1 kudos

Hi @sachinmkp1@gmail.com​ ,You need to add this Spark configuration at your cluster level, not at the notebook level. When you add it to the cluster level it will apply the settings properly. For more details on this issue, please check our knowledge...

  • 1 kudos
1 More Replies
SivakrishnaSunk
by New Contributor II
  • 1587 Views
  • 1 replies
  • 2 kudos

Resolved! Azure synapse writing data with Databricks using polybase error

We are writing the data from Delta tables to Azure synapse using Azure Databricks. While loading the data into the synapse staging table getting an error " HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader: abf...

  • 1587 Views
  • 1 replies
  • 2 kudos
Latest Reply
jose_gonzalez
Moderator
  • 2 kudos

Hi SivakrishnaSunkara,You will find more information and example in the following link url https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics please follow the steps fro the authentication section to avoid thi...

  • 2 kudos
User16826992724
by New Contributor III
  • 2556 Views
  • 1 replies
  • 2 kudos
  • 2556 Views
  • 1 replies
  • 2 kudos
Latest Reply
User16826992724
New Contributor III
  • 2 kudos

Just like B-tree indices in the traditional EDW world, Z-order indexing can be used on high-cardinality columns like Primary Key columns and high-cardinality joins like facts and dimension tables joins. Z-order indexes can be created only on the ...

  • 2 kudos
User16826992724
by New Contributor III
  • 1077 Views
  • 1 replies
  • 4 kudos
  • 1077 Views
  • 1 replies
  • 4 kudos
Latest Reply
User16826992724
New Contributor III
  • 4 kudos

There are various methods like using uuid , monotonically_increasing_id(), using row_number() OVER (ORDER BY NULL) AS SK, using md5() or sha() hashing functions etc. Detailed discussion of various options and the pros/cons can be found in this youtu...

  • 4 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels