cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

James1100
by New Contributor II
  • 859 Views
  • 2 replies
  • 2 kudos

Resolved! Databricks connect to GCS

Hi,Would like to ask if anyone knows how to connect to GCS - basically read csv file from GCS bucket.I have no issue connecting to Data Lake.Thank you so much in advance.

  • 859 Views
  • 2 replies
  • 2 kudos
Latest Reply
Vartika
Moderator
  • 2 kudos

Hi @James C​,Just checking in. If @Kaniz Fatma​'s answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?We'd love to hear from you.Cheers!

  • 2 kudos
1 More Replies
Pbarbosa154
by New Contributor III
  • 717 Views
  • 2 replies
  • 0 kudos

What is the best way to ingest GCS data into Databricks and apply Anomaly Detection Model?

I recently started exploring the field of Data Engineering and came across some difficulties. I have a bucket in GCS with millions of parquet files and I want to create an Anomaly Detection model with them. I was trying to ingest that data into Datab...

  • 717 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Pedro Barbosa​ :It seems like you are running out of memory when trying to convert the PySpark dataframe to an H2O frame. One possible approach to solve this issue is to partition the PySpark dataframe before converting it to an H2O frame.You can us...

  • 0 kudos
1 More Replies
explorer
by New Contributor III
  • 3072 Views
  • 6 replies
  • 3 kudos

Getting error while loading parquet data into Postgres (using spark-postgres library) ClassNotFoundException: Failed to find data source: postgres. Please find packages at http://spark.apache.org/third-party-projects.html Caused by: ClassNotFoundException

Hi Fellas - I'm trying to load parquet data (in GCS location) into Postgres DB (google cloud) . For bulk upload data into PG we are using (spark-postgres library)https://framagit.org/interhop/library/spark-etl/-/tree/master/spark-postgres/src/main/sc...

  • 3072 Views
  • 6 replies
  • 3 kudos
Latest Reply
explorer
New Contributor III
  • 3 kudos

Hi @Kaniz Fatma​ , @Daniel Sahal​ - Few updates from my side.After so many hits and trials , psycopg2 worked out in my case.We can process 200+GB data with 10 node cluster (n2-highmem-4,32 GB Memory, 4 Cores) and driver 32 GB Memory, 4 Cores with Run...

  • 3 kudos
5 More Replies
shrutis23
by New Contributor III
  • 2517 Views
  • 5 replies
  • 4 kudos

How to use delta live table with google cloud storage

Hi Team I have been working on a POC exploring delta live table with GCS location. I have some doubts :how to access the gcs bucket. We have connection established using databricks service account. In a normal cluster creation , we go to cluster page...

  • 2517 Views
  • 5 replies
  • 4 kudos
Latest Reply
Senthil1
Contributor
  • 4 kudos

Kindly mount the DBFS location to GCS cloud storage, see belowMounting cloud object storage on Databricks | Databricks on Google Cloud

  • 4 kudos
4 More Replies
MBV3
by New Contributor III
  • 1066 Views
  • 1 replies
  • 2 kudos

Delete a file from GCS folder

What is the best way to delete files from the gcp bucket inside spark job?

  • 1066 Views
  • 1 replies
  • 2 kudos
Latest Reply
Unforgiven
Valued Contributor III
  • 2 kudos

@M Baig​ yes you need just to create service account for databricks and than assign storage admin role to bucket. After that you can mount GCS standard way:bucket_name = "<bucket-name>"mount_name = "<mount-name>"dbutils.fs.mount("gs://%s" % bucket_na...

  • 2 kudos
rajib76
by New Contributor II
  • 1555 Views
  • 1 replies
  • 2 kudos

Resolved! DBFS with Google Cloud Storage(GCS)

Does DBFS support GCS?

  • 1555 Views
  • 1 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

Yes you need just to create service account for databricks and than assign storage admin role to bucket. After that you can mount GCS standard way:bucket_name = "<bucket-name>" mount_name = "<mount-name>" dbutils.fs.mount("gs://%s" % bucket_name, "/m...

  • 2 kudos
Srikanth_Gupta_
by Valued Contributor
  • 881 Views
  • 1 replies
  • 0 kudos

Resolved! Does size of optimized files after running OPTIMIZE varies between cloud providers (S3, Blob and GCS)?

are there any other parameters to consider running OPTIMIZE depending cloud vendor?

  • 881 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 0 kudos

The optimize is not dependent on the cloud provider whatsoever. Optimize will produce the same results regardless of the underlying storage. It is idempotent, meaning if it is run twice on the same dataset the the second execution has no effect.

  • 0 kudos
Labels