cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Jreco
by Contributor
  • 8338 Views
  • 13 replies
  • 3 kudos

Event hub streaming improve processing rate

Hi all,I'm working with event hubs and data bricks to process and enrich data in real-time.Doing a "simple" test, I'm getting some weird values (input rate vs processing rate) and I think I'm losing data:If you can see, there is a peak with 5k record...

image image
  • 8338 Views
  • 13 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Moderator
  • 3 kudos

hi @Jhonatan Reyes​ ,How many Event hubs partitions are you readying from? your micro-batch takes a few milliseconds to complete, which I think is good time, but I would like to undertand better what are you trying to improve here.Also, in this case ...

  • 3 kudos
12 More Replies
Data_Bricks1
by New Contributor III
  • 3360 Views
  • 7 replies
  • 0 kudos

data from 10 BLOB containers and multiple hierarchical folders(every day and every hour folders) in each container to Delta lake table in parquet format - Incremental loading for latest data only insert no updates

I am able to load data for single container by hard coding, but not able to load from multiple containers. I used for loop, but data frame is loading only last container's last folder record only.Here one more issue is I have to flatten data, when I ...

  • 3360 Views
  • 7 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

for sure function (def) should be declared outside loop, move it after importing libraries,logic is a bit complicated you need to debug it using display(Flatten_df2) (or .show()) and validating json after each iteration (using break or sleep etc.)

  • 0 kudos
6 More Replies
BorislavBlagoev
by Valued Contributor III
  • 2332 Views
  • 4 replies
  • 7 kudos

Resolved! Visualization of Structured Streaming in job.

Does Databricks have feature or good pattern to visualize the data from Structured Streaming? Something like display in the notebook.

  • 2332 Views
  • 4 replies
  • 7 kudos
Latest Reply
BorislavBlagoev
Valued Contributor III
  • 7 kudos

I didn't know about that. Thanks!

  • 7 kudos
3 More Replies
FMendez
by New Contributor III
  • 12600 Views
  • 3 replies
  • 6 kudos

Resolved! How can you mount an Azure Data Lake (gen2) using abfss and Shared Key?

I wanted to mount a ADLG2 on databricks and take advantage on the abfss driver which should be better for large analytical workloads (is that even true in the context of DB?).Setting an OAuth is a bit of a pain so I wanted to take the simpler approac...

  • 12600 Views
  • 3 replies
  • 6 kudos
Latest Reply
User16753724663
Valued Contributor
  • 6 kudos

Hi @Fernando Mendez​ ,The below document will help you to mount the ADLS gen2 using abfss:https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-get-started.htmlCould you please check if this helps?

  • 6 kudos
2 More Replies
fymaterials_199
by New Contributor II
  • 1031 Views
  • 1 replies
  • 0 kudos

pyspark intermediate dataframe consumes many memory

I have pyspark code running in my local mac, which has 6 cores and 16 GB. I run it in pycharm to do first test.spark = ( SparkSession.builder.appName("loc") .master("local[2]") .config("spark.driver.bindAddress","localhost") .config("...

  • 1031 Views
  • 1 replies
  • 0 kudos
Latest Reply
fymaterials_199
New Contributor II
  • 0 kudos

Here is my input fileEID,EffectiveTime,OrderHistory,dummy_col,Period_Start_Date11,2019-04-19T02:50:42.6918667Z,"[{'Codes': [{'CodeSystem': 'sys_1', 'Code': '1-2'}], 'EffectiveDateTime': '2019-04-18T23:48:00Z', 'ComponentResults': [{'Codes': [{'CodeSy...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1729 Views
  • 1 replies
  • 1 kudos
  • 1729 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 1 kudos

# providing a starting version spark.readStream.format("delta") \ .option("readChangeFeed", "true") \ .option("startingVersion", 0) \ .table("myDeltaTable")   # providing a starting timestamp spark.readStream.format("delta") \ .option("readCh...

  • 1 kudos
User16826994223
by Honored Contributor III
  • 1375 Views
  • 1 replies
  • 0 kudos
  • 1375 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

ou must explicitly enable the change data feed option using one of the following methods:New table: Set the table property  delta.enableChangeDataFeed = true in the CREATE TABLE command.CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIE...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1019 Views
  • 1 replies
  • 0 kudos

spark is reading data from source even I am persisting the data

hI allI am reading data and I am caching the data and then I am performing Action Count to get the data in memory, but still, in dag I found out that every time it reads data from SOURCE.

  • 1019 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

It looks like the the spark memory is not sufficient to cache all the data so it read always from source

  • 0 kudos
User16826987838
by Contributor
  • 1066 Views
  • 1 replies
  • 0 kudos
  • 1066 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

Yes, in your write stream you can save it as a table in the delta format without a problem. In DBR 8, the default table format is delta. See this code, please note that the "..." is supplied to show that additional options may be required: df.writeSt...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1239 Views
  • 1 replies
  • 1 kudos

Resolved! Does Databricks integrate with Immuta?

My company uses Immuta for data governance. Will Databricks be able to fit into our existing security patterns?

  • 1239 Views
  • 1 replies
  • 1 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 1 kudos

Yes, check out the immuta web page on the Databricks Integration. https://www.immuta.com/integrations/databricks

  • 1 kudos
brickster_2018
by Esteemed Contributor
  • 1575 Views
  • 1 replies
  • 0 kudos

Resolved! Why do I see data loss with Structured streaming jobs?

I have a Spark structured streaming job reading data from Kafka and loading it to the Delta table. I have some transformations and aggregations on the streaming data before writing to Delta table

  • 1575 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

The typical reason for data loss on a Structured streaming application is having an incorrect value set for watermarking. The watermarking is done to ensure the application does not develop the state for a long period, However, it should be ensured ...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 843 Views
  • 1 replies
  • 1 kudos

Does Databricks have a data processing agreement?

Does Databricks have a data processing agreement?

  • 843 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 1 kudos

Databricks offers a standalone data processing agreement to comply with certain data protection laws that contains our contractual commitments with respect to applicable data protection and privacy law. If your company determines that you require ter...

  • 1 kudos
User16826994223
by Honored Contributor III
  • 969 Views
  • 1 replies
  • 0 kudos

How do we manage data recency in Databricks

I want to know how databricks maintain data recency in databricks

  • 969 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

When using delta tables in databricks, you have the advantage of delta cache which accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. At the beginning of each query delta tables au...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 1863 Views
  • 1 replies
  • 0 kudos

Resolved! Z-order or Partitioning? Which is better for Data skipping?

For Delta tables, among Z-order and Partioning which is recommended technique for efficient Data Skipping

  • 1863 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

Partition pruning is the most efficient way to ensure Data skipping. However, choosing the right column for partitioning is very important. It's common to see choosing the wrong column for partitioning can cause a large number of small file problems ...

  • 0 kudos
Labels