cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Quan
by New Contributor III
  • 10195 Views
  • 9 replies
  • 6 kudos

Resolved! How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Hello all, I'm trying to pull table data from databricks tables that contain foreign language characters in UTF-8 into an ETL tool using a JDBC connection. I'm using the latest Simba Spark JDBC driver available from the Databricks website.The issue i...

  • 10195 Views
  • 9 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Can you try setting UseUnicodeSqlCharacterTypes=1 in the driver, and also make sure 'file.encoding' is set to UTF-8 in jvm and see if the issue still persists?

  • 6 kudos
8 More Replies
Abhendu
by New Contributor II
  • 857 Views
  • 3 replies
  • 0 kudos

Resolved! CICD Databricks

Hi TeamI was wondering if there is a document or step by step process to promote code in CICD across various environments of code repository (GIT/GITHUB/BitBucket/Gitlab) with DBx support? [Without involving code repository merging capability of the ...

  • 857 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Please refer this related thread on CICD in Databricks https://community.databricks.com/s/question/0D53f00001GHVhMCAX/what-are-some-best-practices-for-cicd

  • 0 kudos
2 More Replies
Kaniz
by Community Manager
  • 723 Views
  • 1 replies
  • 0 kudos
  • 723 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz
Community Manager
  • 0 kudos

The differences are as follows:-Pig operates on the client-side of a cluster whereas Hive operates on the server-side of a cluster.Pig uses pig-Latin language whereas Hive uses HiveQL language.Pig is a Procedural Data Flow Language whereas Hive is a ...

  • 0 kudos
Kaniz
by Community Manager
  • 702 Views
  • 1 replies
  • 0 kudos
  • 702 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz
Community Manager
  • 0 kudos

To export all collections:mongodump -d database_name -o directory_to_store_dumpsTo restore them:mongorestore -d database_name directory_backup_where_mongodb_tobe_restored

  • 0 kudos
JD2
by Contributor
  • 2241 Views
  • 6 replies
  • 4 kudos

Resolved! Auto Loader for Shape File

Hello: As you can see from below link, that it support 7 file formats. I am dealing with GeoSpatial Shape files and I want to know if Auto Loader can support Shape Files ???Any help on this is greatly appreciated. Thanks. https://docs.microsoft.com/...

  • 2241 Views
  • 6 replies
  • 4 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 4 kudos

You could try to use the binary file type. But the disadvantage of this is that the content of the shape files will be put into a column, that might not be what you want.If you absolutely want to use the autoloader, maybe some thinking outside the b...

  • 4 kudos
5 More Replies
Karankaran_alan
by New Contributor
  • 667 Views
  • 2 replies
  • 0 kudos

cluster not getting created, timing out

Hello - i've been using the Databricks notebook(for pyspark or scala/spark development), and recently have had issues wherein the cluster creation takes a long time to get created, often timing out. Any ideas on how to resolve this ?

  • 667 Views
  • 2 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Moderator
  • 0 kudos

Hi Karankaran.alang,What is the error message you are getting? did you get this error while creating/starting a cluster CE?some times these errors are intermittent and go away after a few re-tries.Thank you

  • 0 kudos
1 More Replies
RajaLakshmanan
by New Contributor
  • 1992 Views
  • 3 replies
  • 1 kudos

Resolved! Spark StreamingQuery not processing all data from source directory

Hi,I have setup a streaming process that consumers files from HDFS staging directory and writes into target location. Input directory continuesouly gets files from another process.Lets say file producer produces 5 million records sends it to hdfs sta...

  • 1992 Views
  • 3 replies
  • 1 kudos
Latest Reply
User16763506586
Contributor
  • 1 kudos

If it helps , you run try running the Left-Anti join on source and sink to identify missing records and see whether the record is in match with the schema provided or not

  • 1 kudos
2 More Replies
User15787040559
by New Contributor III
  • 1439 Views
  • 2 replies
  • 1 kudos

How can I get Databricks notebooks to stop cutting off the explain plans?

(since Spark 3.0)Dataset.queryExecution.debug.toFilewill dump the full plan to a file, without concatenating the output as a fully materialized Java string in memory.

  • 1439 Views
  • 2 replies
  • 1 kudos
Latest Reply
dazfuller
Contributor III
  • 1 kudos

Notebooks really aren't the best method of viewing large files. Two methods you could employ areSave the file to dbfs and then use databricks CLI to download the fileUse the web terminalIn the web terminal option you can do something like "cat my_lar...

  • 1 kudos
1 More Replies
Kaniz
by Community Manager
  • 3039 Views
  • 2 replies
  • 1 kudos
  • 3039 Views
  • 2 replies
  • 1 kudos
Latest Reply
dazfuller
Contributor III
  • 1 kudos

If you want to read line-by-line in python thenwith open('/path/to/file', 'r') as f: for line in f: print(line)If you want to read the entire file to a list of lineswith open('/path/to/file', 'r') as f: data = f.readlines()Or if you w...

  • 1 kudos
1 More Replies
YuvSaha
by New Contributor
  • 556 Views
  • 1 replies
  • 0 kudos

Auto Loader for the Shape File ?.

Hello, As you can see from below link, that it support 7 file formats. I am dealing with GeoSpatial Shape files and I want to know if Auto Loader can support Shape Files ???Any help on this is greatly appreciated. avro: Avro filebinaryFile: Binary f...

  • 556 Views
  • 1 replies
  • 0 kudos
Latest Reply
dbkent
New Contributor III
  • 0 kudos

Hi @Yuv Saha​ ,Currently, shapefiles are not a supported file-type when using auto-loader. Would you be willing to share more about your use case? I am the Product Manager responsible for Geospatial in Databricks, and I need help from customers like ...

  • 0 kudos
Kaniz
by Community Manager
  • 1647 Views
  • 2 replies
  • 1 kudos
  • 1647 Views
  • 2 replies
  • 1 kudos
Latest Reply
dazfuller
Contributor III
  • 1 kudos

Basic process is this (assuming you're using a python virtual environment and have activated it).(.venv) C:\source\project> python -m pip install pyspark (.venv) C:\source\project> python   >>> from pyspark.sql import SparkSession >>> spark = SparkSe...

  • 1 kudos
1 More Replies
Kaniz
by Community Manager
  • 8004 Views
  • 2 replies
  • 1 kudos
  • 8004 Views
  • 2 replies
  • 1 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 1 kudos

Assuming that the S3 bucket is mounted in the workspace you can provide a file path. If you want to write a PySpark DF then you can do something like the following: df.write.format('json').save('/path/to/file_name.json')You could also use the json py...

  • 1 kudos
1 More Replies
DouglasLinder
by New Contributor III
  • 4109 Views
  • 5 replies
  • 1 kudos

Is it possible to pass configuration to a job on high concurrency cluster?

On a regular cluster, you can use:```spark.sparkContext._jsc.hadoopConfiguration().set(key, value)```These values are then available on the executors using the hadoop configuration. However, on a high concurrency cluster, attempting to do so results ...

  • 4109 Views
  • 5 replies
  • 1 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 1 kudos

I am not sure why you are getting that error on a high concurrency cluster. As I am able to set the configuration as you show above. Can you try the following code instead? sc._jsc.hadoopConfiguration().set(key, value)

  • 1 kudos
4 More Replies
Labels
Top Kudoed Authors