<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Parsing from PDF to a Structured Table  | Looking for best practies in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/parsing-from-pdf-to-a-structured-table-looking-for-best-practies/m-p/132148#M49370</link>
    <description>&lt;P&gt;&lt;STRONG&gt;Use Case:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Converting unstructured data from PDF to structured format before sending to Salesforce&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Ask:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Best practices to structure my table better before sending to a system like salesforce&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Output in structured format looks like:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20016i822AE97B1C64054A/image-size/large?v=v2&amp;amp;px=999" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;P&gt;&lt;STRONG&gt;My code:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Extract Tables from PDF&lt;/LI&gt;&lt;LI&gt;Unify Schema and Save Delta Table&lt;/LI&gt;&lt;LI&gt;Auto Verification&lt;/LI&gt;&lt;LI&gt;Access via Delta Sharing&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;import pdfplumber&lt;BR /&gt;import re&lt;BR /&gt;from pyspark.sql import Row, SparkSession&lt;BR /&gt;from delta_sharing import SharingClient&lt;BR /&gt;import fsspec&lt;/P&gt;&lt;P&gt;# --- Configuration ---&lt;BR /&gt;pdf_path = "/Volumes/workspace/default/pdf/ntt docomo japan.pdf"&lt;BR /&gt;target_table = "workspace.default.structured_pdf_table"&lt;/P&gt;&lt;P&gt;# --- Delta Sharing token from Databricks secret ---&lt;BR /&gt;token = dbutils.secrets.get(scope="delta_sharing_scope", key="delta_token")&lt;BR /&gt;print("Delta Sharing token retrieved successfully.")&lt;/P&gt;&lt;P&gt;# --- Helper: Normalize headers ---&lt;BR /&gt;def normalize_header(header, idx):&lt;BR /&gt;if not header or not isinstance(header, str) or header.strip() == "":&lt;BR /&gt;return f"col{idx+1}"&lt;BR /&gt;clean = re.sub(r'[^0-9a-zA-Z_]', '', header.strip().lower().replace(" ", "_"))&lt;BR /&gt;return clean if clean else f"col{idx+1}"&lt;/P&gt;&lt;P&gt;# --- Step 1: Extract PDF tables + fallback text ---&lt;BR /&gt;structured_data = []&lt;BR /&gt;try:&lt;BR /&gt;with pdfplumber.open(pdf_path) as pdf:&lt;BR /&gt;print(f"Opened PDF '{pdf_path}' with {len(pdf.pages)} page(s).")&lt;BR /&gt;for page_num, page in enumerate(pdf.pages, start=1):&lt;BR /&gt;tables = page.extract_tables()&lt;BR /&gt;if tables:&lt;BR /&gt;print(f"Table(s) found on page {page_num}")&lt;BR /&gt;for table in tables:&lt;BR /&gt;headers = table[0]&lt;BR /&gt;rows = table[1:]&lt;BR /&gt;clean_headers = [normalize_header(h, i) for i, h in enumerate(headers)]&lt;BR /&gt;for row in rows:&lt;BR /&gt;row_dict = {clean_headers[i]: row[i] if i &amp;lt; len(row) else None&lt;BR /&gt;for i in range(len(clean_headers))}&lt;BR /&gt;structured_data.append(Row(**row_dict))&lt;BR /&gt;else:&lt;BR /&gt;print(f"No tables found on page {page_num}, using fallback text parsing...")&lt;BR /&gt;page_text = page.extract_text()&lt;BR /&gt;if page_text:&lt;BR /&gt;lines = page_text.split("\n")&lt;BR /&gt;for line in lines:&lt;BR /&gt;cols = re.split(r"\s{2,}", line.strip())&lt;BR /&gt;row_dict = {f"col{i+1}": cols[i] if i &amp;lt; len(cols) else None&lt;BR /&gt;for i in range(6)}&lt;BR /&gt;row_dict["raw_line"] = line&lt;BR /&gt;structured_data.append(Row(**row_dict))&lt;BR /&gt;except Exception as e:&lt;BR /&gt;print(f"Error during PDF parsing: {e}")&lt;/P&gt;&lt;P&gt;# --- Step 2: Unify schema + create Delta table ---&lt;BR /&gt;if structured_data:&lt;BR /&gt;try:&lt;BR /&gt;# Determine all possible columns&lt;BR /&gt;all_cols = set()&lt;BR /&gt;for r in structured_data:&lt;BR /&gt;all_cols.update(r.asDict().keys())&lt;BR /&gt;all_cols = list(all_cols)&lt;/P&gt;&lt;P&gt;# Ensure all rows have all columns&lt;BR /&gt;uniform_rows = []&lt;BR /&gt;for r in structured_data:&lt;BR /&gt;d = r.asDict()&lt;BR /&gt;for c in all_cols:&lt;BR /&gt;if c not in d:&lt;BR /&gt;d[c] = None&lt;BR /&gt;uniform_rows.append(Row(**d))&lt;/P&gt;&lt;P&gt;# Create DataFrame&lt;BR /&gt;spark = SparkSession.builder.getOrCreate()&lt;BR /&gt;df = spark.createDataFrame(uniform_rows)&lt;/P&gt;&lt;P&gt;# Force all columns to string to avoid type inference issues&lt;BR /&gt;df = df.selectExpr(*[f"CAST({c} AS STRING) as {c}" for c in all_cols])&lt;/P&gt;&lt;P&gt;# Preview before saving&lt;BR /&gt;print("\n&lt;span class="lia-unicode-emoji" title=":magnifying_glass_tilted_right:"&gt;🔎&lt;/span&gt; Preview of structured data before saving:")&lt;BR /&gt;display(df.limit(20))&lt;/P&gt;&lt;P&gt;# Save Delta table&lt;BR /&gt;df.write.mode("overwrite").saveAsTable(target_table)&lt;BR /&gt;print(f"\n Structured data saved to Delta table: {target_table}")&lt;/P&gt;&lt;P&gt;except Exception as e:&lt;BR /&gt;print(f"Error while saving Delta table: {e}")&lt;BR /&gt;else:&lt;BR /&gt;print(" No structured data extracted from the PDF.")&lt;/P&gt;&lt;P&gt;# --- Step 3: Auto Verification ---&lt;BR /&gt;try:&lt;BR /&gt;print("\n&lt;span class="lia-unicode-emoji" title=":magnifying_glass_tilted_right:"&gt;🔎&lt;/span&gt; Step 3a: Check if table exists in workspace.default")&lt;BR /&gt;tables = spark.catalog.listTables("workspace.default")&lt;BR /&gt;table_names = [t.name for t in tables]&lt;BR /&gt;if "structured_pdf_table" in table_names:&lt;BR /&gt;print("Table exists in workspace.default")&lt;BR /&gt;else:&lt;BR /&gt;print("Table NOT found in workspace.default")&lt;/P&gt;&lt;P&gt;df = spark.table(target_table)&lt;BR /&gt;print("\nStep 3b: Preview first 20 rows of the Delta table")&lt;BR /&gt;display(df.limit(20))&lt;/P&gt;&lt;P&gt;print("\nStep 3c: Delta table schema")&lt;BR /&gt;df.printSchema()&lt;/P&gt;&lt;P&gt;print("\nStep 3d: Total number of rows in the table")&lt;BR /&gt;print(df.count())&lt;/P&gt;&lt;P&gt;except Exception as e:&lt;BR /&gt;print(f"Error during table verification: {e}")&lt;/P&gt;&lt;P&gt;# --- Step 4: Access via Delta Sharing ---&lt;BR /&gt;try:&lt;BR /&gt;profile_url = f"&lt;A href="https://dbc-5ec65784-4b66.cloud.databricks.com/shares/pdf_share/profile.json?token={token" target="_blank" rel="noopener"&gt;https://dbc-5ec65784-4b66.cloud.databricks.com/shares/pdf_share/profile.json?token={token&lt;/A&gt;}"&lt;/P&gt;&lt;P&gt;# Use fsspec to handle token safely&lt;BR /&gt;fs = fsspec.filesystem("https", expand=True)&lt;BR /&gt;with fs.open(profile_url) as f:&lt;BR /&gt;client = SharingClient(f)&lt;BR /&gt;shared_table = client.load_table(target_table)&lt;BR /&gt;shared_df = shared_table.to_pandas()&lt;BR /&gt;print("\nStep 4: Preview Delta Shared table (first 20 rows)")&lt;BR /&gt;display(shared_df.head(20))&lt;/P&gt;&lt;P&gt;except Exception as e:&lt;BR /&gt;print(f"Could not access table via Delta Sharing: {e}")&lt;/P&gt;</description>
    <pubDate>Tue, 16 Sep 2025 17:39:25 GMT</pubDate>
    <dc:creator>ManojkMohan</dc:creator>
    <dc:date>2025-09-16T17:39:25Z</dc:date>
    <item>
      <title>Parsing from PDF to a Structured Table  | Looking for best practies</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-from-pdf-to-a-structured-table-looking-for-best-practies/m-p/132148#M49370</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Use Case:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Converting unstructured data from PDF to structured format before sending to Salesforce&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Ask:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Best practices to structure my table better before sending to a system like salesforce&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Output in structured format looks like:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20016i822AE97B1C64054A/image-size/large?v=v2&amp;amp;px=999" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;P&gt;&lt;STRONG&gt;My code:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Extract Tables from PDF&lt;/LI&gt;&lt;LI&gt;Unify Schema and Save Delta Table&lt;/LI&gt;&lt;LI&gt;Auto Verification&lt;/LI&gt;&lt;LI&gt;Access via Delta Sharing&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;import pdfplumber&lt;BR /&gt;import re&lt;BR /&gt;from pyspark.sql import Row, SparkSession&lt;BR /&gt;from delta_sharing import SharingClient&lt;BR /&gt;import fsspec&lt;/P&gt;&lt;P&gt;# --- Configuration ---&lt;BR /&gt;pdf_path = "/Volumes/workspace/default/pdf/ntt docomo japan.pdf"&lt;BR /&gt;target_table = "workspace.default.structured_pdf_table"&lt;/P&gt;&lt;P&gt;# --- Delta Sharing token from Databricks secret ---&lt;BR /&gt;token = dbutils.secrets.get(scope="delta_sharing_scope", key="delta_token")&lt;BR /&gt;print("Delta Sharing token retrieved successfully.")&lt;/P&gt;&lt;P&gt;# --- Helper: Normalize headers ---&lt;BR /&gt;def normalize_header(header, idx):&lt;BR /&gt;if not header or not isinstance(header, str) or header.strip() == "":&lt;BR /&gt;return f"col{idx+1}"&lt;BR /&gt;clean = re.sub(r'[^0-9a-zA-Z_]', '', header.strip().lower().replace(" ", "_"))&lt;BR /&gt;return clean if clean else f"col{idx+1}"&lt;/P&gt;&lt;P&gt;# --- Step 1: Extract PDF tables + fallback text ---&lt;BR /&gt;structured_data = []&lt;BR /&gt;try:&lt;BR /&gt;with pdfplumber.open(pdf_path) as pdf:&lt;BR /&gt;print(f"Opened PDF '{pdf_path}' with {len(pdf.pages)} page(s).")&lt;BR /&gt;for page_num, page in enumerate(pdf.pages, start=1):&lt;BR /&gt;tables = page.extract_tables()&lt;BR /&gt;if tables:&lt;BR /&gt;print(f"Table(s) found on page {page_num}")&lt;BR /&gt;for table in tables:&lt;BR /&gt;headers = table[0]&lt;BR /&gt;rows = table[1:]&lt;BR /&gt;clean_headers = [normalize_header(h, i) for i, h in enumerate(headers)]&lt;BR /&gt;for row in rows:&lt;BR /&gt;row_dict = {clean_headers[i]: row[i] if i &amp;lt; len(row) else None&lt;BR /&gt;for i in range(len(clean_headers))}&lt;BR /&gt;structured_data.append(Row(**row_dict))&lt;BR /&gt;else:&lt;BR /&gt;print(f"No tables found on page {page_num}, using fallback text parsing...")&lt;BR /&gt;page_text = page.extract_text()&lt;BR /&gt;if page_text:&lt;BR /&gt;lines = page_text.split("\n")&lt;BR /&gt;for line in lines:&lt;BR /&gt;cols = re.split(r"\s{2,}", line.strip())&lt;BR /&gt;row_dict = {f"col{i+1}": cols[i] if i &amp;lt; len(cols) else None&lt;BR /&gt;for i in range(6)}&lt;BR /&gt;row_dict["raw_line"] = line&lt;BR /&gt;structured_data.append(Row(**row_dict))&lt;BR /&gt;except Exception as e:&lt;BR /&gt;print(f"Error during PDF parsing: {e}")&lt;/P&gt;&lt;P&gt;# --- Step 2: Unify schema + create Delta table ---&lt;BR /&gt;if structured_data:&lt;BR /&gt;try:&lt;BR /&gt;# Determine all possible columns&lt;BR /&gt;all_cols = set()&lt;BR /&gt;for r in structured_data:&lt;BR /&gt;all_cols.update(r.asDict().keys())&lt;BR /&gt;all_cols = list(all_cols)&lt;/P&gt;&lt;P&gt;# Ensure all rows have all columns&lt;BR /&gt;uniform_rows = []&lt;BR /&gt;for r in structured_data:&lt;BR /&gt;d = r.asDict()&lt;BR /&gt;for c in all_cols:&lt;BR /&gt;if c not in d:&lt;BR /&gt;d[c] = None&lt;BR /&gt;uniform_rows.append(Row(**d))&lt;/P&gt;&lt;P&gt;# Create DataFrame&lt;BR /&gt;spark = SparkSession.builder.getOrCreate()&lt;BR /&gt;df = spark.createDataFrame(uniform_rows)&lt;/P&gt;&lt;P&gt;# Force all columns to string to avoid type inference issues&lt;BR /&gt;df = df.selectExpr(*[f"CAST({c} AS STRING) as {c}" for c in all_cols])&lt;/P&gt;&lt;P&gt;# Preview before saving&lt;BR /&gt;print("\n&lt;span class="lia-unicode-emoji" title=":magnifying_glass_tilted_right:"&gt;🔎&lt;/span&gt; Preview of structured data before saving:")&lt;BR /&gt;display(df.limit(20))&lt;/P&gt;&lt;P&gt;# Save Delta table&lt;BR /&gt;df.write.mode("overwrite").saveAsTable(target_table)&lt;BR /&gt;print(f"\n Structured data saved to Delta table: {target_table}")&lt;/P&gt;&lt;P&gt;except Exception as e:&lt;BR /&gt;print(f"Error while saving Delta table: {e}")&lt;BR /&gt;else:&lt;BR /&gt;print(" No structured data extracted from the PDF.")&lt;/P&gt;&lt;P&gt;# --- Step 3: Auto Verification ---&lt;BR /&gt;try:&lt;BR /&gt;print("\n&lt;span class="lia-unicode-emoji" title=":magnifying_glass_tilted_right:"&gt;🔎&lt;/span&gt; Step 3a: Check if table exists in workspace.default")&lt;BR /&gt;tables = spark.catalog.listTables("workspace.default")&lt;BR /&gt;table_names = [t.name for t in tables]&lt;BR /&gt;if "structured_pdf_table" in table_names:&lt;BR /&gt;print("Table exists in workspace.default")&lt;BR /&gt;else:&lt;BR /&gt;print("Table NOT found in workspace.default")&lt;/P&gt;&lt;P&gt;df = spark.table(target_table)&lt;BR /&gt;print("\nStep 3b: Preview first 20 rows of the Delta table")&lt;BR /&gt;display(df.limit(20))&lt;/P&gt;&lt;P&gt;print("\nStep 3c: Delta table schema")&lt;BR /&gt;df.printSchema()&lt;/P&gt;&lt;P&gt;print("\nStep 3d: Total number of rows in the table")&lt;BR /&gt;print(df.count())&lt;/P&gt;&lt;P&gt;except Exception as e:&lt;BR /&gt;print(f"Error during table verification: {e}")&lt;/P&gt;&lt;P&gt;# --- Step 4: Access via Delta Sharing ---&lt;BR /&gt;try:&lt;BR /&gt;profile_url = f"&lt;A href="https://dbc-5ec65784-4b66.cloud.databricks.com/shares/pdf_share/profile.json?token={token" target="_blank" rel="noopener"&gt;https://dbc-5ec65784-4b66.cloud.databricks.com/shares/pdf_share/profile.json?token={token&lt;/A&gt;}"&lt;/P&gt;&lt;P&gt;# Use fsspec to handle token safely&lt;BR /&gt;fs = fsspec.filesystem("https", expand=True)&lt;BR /&gt;with fs.open(profile_url) as f:&lt;BR /&gt;client = SharingClient(f)&lt;BR /&gt;shared_table = client.load_table(target_table)&lt;BR /&gt;shared_df = shared_table.to_pandas()&lt;BR /&gt;print("\nStep 4: Preview Delta Shared table (first 20 rows)")&lt;BR /&gt;display(shared_df.head(20))&lt;/P&gt;&lt;P&gt;except Exception as e:&lt;BR /&gt;print(f"Could not access table via Delta Sharing: {e}")&lt;/P&gt;</description>
      <pubDate>Tue, 16 Sep 2025 17:39:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-from-pdf-to-a-structured-table-looking-for-best-practies/m-p/132148#M49370</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-16T17:39:25Z</dc:date>
    </item>
    <item>
      <title>Re: Parsing from PDF to a Structured Table  | Looking for best practies</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-from-pdf-to-a-structured-table-looking-for-best-practies/m-p/132153#M49373</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155141"&gt;@ManojkMohan&lt;/a&gt;&amp;nbsp; My advice for parsing PDFs:&lt;BR /&gt;&lt;BR /&gt;1. Will your project have PDFs that are all the same in terms of formatting? i.e. invoices of a particular type where things like addresses and values might change but their position on the page is mostly the same. If yes, you could use standard OCR/CV python packages to parse the documents and use Regex to parse out the relevant pieces of information. You'd also be able to trust this pipeline is pretty robust as the invoices will be the same format. You'll just need to lean into Regex a little more and ensure you've built Regular Expressions that are dynamic.&lt;BR /&gt;&lt;BR /&gt;2. If all the PDFs are pretty much&amp;nbsp;&lt;STRONG&gt;exactly&lt;/STRONG&gt; the same (formatting-wise), you could also consider looking into Image Templating. It's probably a combination of OCR and CV. I'm sure there's some projects out there to help with this part. You'd supply the template, almost like training a model, it'd then strip this info out on subsequent PDFs that match.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;3. Of course, GenAI will massively help with PDFs. Imagine the sheer volume of PDFs it's looked through. So you could use a model to help tabulate the data. Especially if all the PDFs are in different formats with loads of different info. You really would need GenAI for a largescale PDF solution if you value time.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;4. Have you tried the new ai_parse_document function in Databricks?&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document&lt;/A&gt;&amp;nbsp;I think this would be worth exploring.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Tue, 16 Sep 2025 18:34:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-from-pdf-to-a-structured-table-looking-for-best-practies/m-p/132153#M49373</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-09-16T18:34:31Z</dc:date>
    </item>
  </channel>
</rss>

