topic Re: How to create a dataframe with the files from S3 bucket in Data Engineering

How to create a dataframe with the files from S3 bucket

akj2784 — Thu, 19 Sep 2019 07:05:10 GMT

I have connected my S3 bucket from databricks.

Using the following command :

import urllib

import urllib.parse

ACCESS_KEY = "Test"

SECRET_KEY = "Test"

ENCODED_SECRET_KEY = urllib.parse.quote(SECRET_KEY, "") AWS_BUCKET_NAME = "Test" MOUNT_NAME = "S3_Connection_details" dbutils.fs.mount("s3n://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY,AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)

Now when I run the below command, I get the list of csv files present in the bucket.

display(dbutils.fs.ls("/mnt/S3_Connection"))

If there are 10 files, I want to create 10 different tables in postgreSQL after reading the csv files. I don't need any transformation. Is it feasible ?

First of all how to create a dataframe using one of the csv file. If anyone can help me with the syntax.

Regards,

Akash

Re: How to create a dataframe with the files from S3 bucket

shyam_9 — Thu, 19 Sep 2019 07:13:35 GMT

Hi @akj2784,

Please go through Databricks documentation on working with files in S3,

https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-s3-buckets-with-dbfs

Re: How to create a dataframe with the files from S3 bucket

akj2784 — Thu, 19 Sep 2019 07:43:15 GMT

I have already checked this... still not able to see data.

df = spark.read.text("mnt/S3_Connection/Details.csv")

Still I don't see data.

Re: How to create a dataframe with the files from S3 bucket

shyam_9 — Thu, 19 Sep 2019 07:53:03 GMT

Try to read using below methods,

df = spark.read.text("/mnt/%s/...." % MOUNT_NAME)

and

df = sc.textFile("s3a://%s:%s@%s/.../..." % ACCESS_KEY, ENCODED_SECRET_KEY, BUCKET_NAME)

Re: How to create a dataframe with the files from S3 bucket

akj2784 — Thu, 19 Sep 2019 08:15:59 GMT

I am able to create dataframe but when I do df.head(), I see only the columns names. However I want to see the data as well.

Re: How to create a dataframe with the files from S3 bucket

lee — Thu, 19 Sep 2019 15:14:55 GMT

Please take a look at the documentation. df.head() will show the first 1 row by default, but you can add an integer as a parameter to show additional: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.head

Please look at other display methods such as df.show() or the custom databricks method display(df)