{"id":69913,"date":"2025-03-09T09:57:28","date_gmt":"2025-03-09T04:27:28","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=69913"},"modified":"2025-03-19T15:37:15","modified_gmt":"2025-03-19T10:07:15","slug":"snowflake-document-ai-unlocking-insights-from-unstructured-data","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/snowflake-document-ai-unlocking-insights-from-unstructured-data\/","title":{"rendered":"Snowflake Document AI : Unlocking Insights from Unstructured Data"},"content":{"rendered":"<p>Fun fact! Around 80%-90% of the world&#8217;s data is unstructured. I was shocked when I read this fact. The unstructured data contains images, emails, PDF files social media posts, and other formats. Even though it is widely present 70% of data is not being used to drive insights and get analytics.<\/p>\n<p>As a Data Engineer, you must have developed many data pipelines using different tools for structured or semi-structured data. But have you ever built a single pipeline for highly unstructured data like PDFs, Emails, data captured from images, etc? The answer probably might be NO. It&#8217;s obvious that extracting data from unstructured data is really tough.<\/p>\n<p>In the Gen AI and ML era, It is no longer a tedious job to do. Thanks to the latest feature from Snowflake, It&#8217;s Document AI. This is a game-changing innovation, a powerful feature built to extract and analyze data from unstructured data.<\/p>\n<h3>Document AI: Intelligent Document Processing<\/h3>\n<p><a href=\"https:\/\/www.tothenew.com\/data-analytics\/snowflake-cloud-services\">Snowflake<\/a> <strong>Document AI<\/strong> is the latest feature that comes under AI &amp; ML. It is an AI &amp; ML-powered feature that extracts data from unstructured documents such as PDFs, scanned images, and emails. It integrates with other Snowflake features smoothly, which enables users to analyze extracted information without any other tools or ETL process.<\/p>\n<h2>How Does Snowflake Document AI Extract Data?<\/h2>\n<div id=\"attachment_69927\" style=\"width: 635px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-69927\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-69927 size-large\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/02\/extraction-1024x498.webp\" alt=\"extraction info from the document\" width=\"625\" height=\"304\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/02\/extraction-1024x498.webp 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/extraction-300x146.webp 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/extraction-768x373.webp 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/extraction-1536x747.webp 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/extraction-624x303.webp 624w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/extraction.webp 1600w\" sizes=\"(max-width: 625px) 100vw, 625px\" \/><p id=\"caption-attachment-69927\" class=\"wp-caption-text\">extraction info from the document<\/p><\/div>\n<p>The process of extracting data from unstructured documents involves several key steps:<\/p>\n<ol>\n<li><strong>Document upload to stage<\/strong>\n<ul>\n<li>Users upload unstructured documents (PDFs, scanned images, or text-based files) into Snowflake&#8217;s internal or external stage.<\/li>\n<li>Snowflake preprocesses it to check for readability and stores metadata.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Information Extraction<\/strong>\n<ul>\n<li><strong>Optical Character Recognition (OCR): <\/strong>It converts PDF files and images to natural English readable text.<\/li>\n<li><strong>Named Entity Recognition (NER): <\/strong>It identifies context-based important tags like dates, names, addresses, invoice numbers, etc.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Data Structure &amp; Storage<\/strong>\n<ul>\n<li>The extracted information is converted into structured JSON formats.<\/li>\n<li>The structured data is loaded into a snowflake table for querying, analytics, and reporting.<\/li>\n<\/ul>\n<\/li>\n<li>Querying &amp; Analytics\n<ul>\n<li>Once the data is loaded to tables, users can perform SQL queries, ML modeling, and advanced analytics.<\/li>\n<li>It enables organizations to derive hidden insights from unstructured data.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<h3>What\u2019s the Technology Behind Snowflake Document AI?<\/h3>\n<p>Snowflake Document AI leverages a combination of advanced <strong>AI\/ML models<\/strong> and <strong>proprietary algorithms<\/strong> to extract and structure data efficiently. Here are some of the key technologies behind the scenes:<\/p>\n<ol>\n<li><strong>Optical Character Recognition (OCR)<\/strong>\n<ul>\n<li>OCR technology helps convert images, scanned text, and PDFs into digital text.<\/li>\n<li>Snowflake uses deep-learning-based OCR models to improve accuracy in handwritten and printed text recognition.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Natural Language Processing (NLP)<\/strong>\n<ul>\n<li>NLP models identify and categorize key entities, sections, and tables within a document.<\/li>\n<li>Advanced language models (similar to Transformer-based AI models like BERT or GPT) enable context-aware text extraction.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Named Entity Recognition (NER)<\/strong>\n<ul>\n<li>Categories and identifies important context-based data tags such as Names, dates, addresses, invoice numbers, legal terms, etc.<\/li>\n<li>It can also highly classify personal and sensitive data and tag it for business-critical applications.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Machine Learning-Based Pattern Recognition<\/strong>\n<ul>\n<li>Uses ML algorithms to detect tables, forms, and structured data patterns within documents.<\/li>\n<li>It ensures that extracted data has its contextual meaning and relationships.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Snowflake\u2019s Native AI Integration<\/strong>\n<ul>\n<li>Snowflake Document AI is automatically integrated into Snowflake\u2019s ecosystem, It means users can perform SQL-based queries on extracted data without needing any additional ETL processes.<\/li>\n<li>The processed data is directly available for BI tools, dashboards, and machine learning applications.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>Oh&#8230; A lot of theory. Now it&#8217;s time to make your hands dirty. Head towards now Snowflake Snowsight interface. We will go step by step to extract the data from PDF files specifically internet bills that are uploaded to an internal stage in Snowflake. We&#8217;ll then insert the extracted information into a Snowflake table. Let&#8217;s get started!<\/p>\n<h4>Step 1:\u00a0Create the database and the schema.<\/h4>\n<pre>-- Create the database \r\nCREATE DATABASE TESTDB; \r\n-- Create the schema inside the database \r\nCREATE SCHEMA TESTDB.TESTSCHEMA;\r\n\r\nUSE database testdb;\r\nUSE schema testschema;<\/pre>\n<h4>Step 2: Create an internal stage.<\/h4>\n<pre>CREATE OR REPLACE STAGE airtel_stage\r\nDIRECTORY = (ENABLE = TRUE)\r\nENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');<\/pre>\n<h4>Step 3: Create a stream on stage.<\/h4>\n<pre>CREATE STREAM airtel_stream ON STAGE airtel_stage;\r\nSHOW STREAMS;\r\nALTER STAGE airtel_stage REFRESH;<\/pre>\n<h4>Step 4: Create a table.<\/h4>\n<pre>-- The metadata about the pdf file will be loaded to the first four columns and the information \r\n-- extracted will be loaded to the json_content column in JSON format which we need to parse JSON later.\r\n\r\nCREATE OR REPLACE TABLE airtel_bill (\r\nfile_name VARCHAR,\r\nfile_url VARCHAR,\r\nfile_size VARIANT,\r\nlast_modified VARCHAR,\r\njson_content VARCHAR\r\n);<\/pre>\n<h4>Step 5: Create a Document AI model<\/h4>\n<div id=\"attachment_69928\" style=\"width: 635px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-69928\" decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-69928\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/02\/document-AI-Model-1024x439.png\" alt=\"Document AI Model\" width=\"625\" height=\"268\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/02\/document-AI-Model-1024x439.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/document-AI-Model-300x129.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/document-AI-Model-768x329.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/document-AI-Model-1536x658.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/document-AI-Model-624x267.png 624w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/document-AI-Model.png 1904w\" sizes=\"(max-width: 625px) 100vw, 625px\" \/><p id=\"caption-attachment-69928\" class=\"wp-caption-text\">Document AI Model<\/p><\/div>\n<ul>\n<li>Go to the AI &amp; ML menu from the left panel and select Document AI.<\/li>\n<li>Click on the &#8220;<strong>+ Build&#8221;<\/strong> button top right corner to create a new model.<\/li>\n<li>Provide the name of the model, database name, schema name, and description.<\/li>\n<li>Click on the newly created model.<\/li>\n<li>Go to the <strong>Document<\/strong> section and add a few test PDF files to train the model.<\/li>\n<li>Go to the <strong>Values<\/strong> section and click on the <strong>Manage Values<\/strong> button.<\/li>\n<li>Click on the &#8220;<strong>+ value&#8221;<\/strong> button to add a new Value. Provide a key name and question to extract the information.\n<ul>\n<li>Example 1: The key name is <strong>&#8220;invoice_no&#8221;<\/strong> and the question would be <strong>&#8220;Extract the invoice number from this document&#8221;.<\/strong><\/li>\n<li>Example 2: The key name is <strong>&#8220;bill_period&#8221;<\/strong> and the question would be <strong>&#8220;What is the billing period&#8221;.<\/strong><\/li>\n<\/ul>\n<\/li>\n<li>You can add as many values as you need. It will try to extract the information from the document and If the extracted value is correct click on the <strong>&#8220;OK&#8221;<\/strong> button.<\/li>\n<\/ul>\n<div id=\"attachment_69929\" style=\"width: 1914px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-69929\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-69929 size-full\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/02\/manage-values.png\" alt=\"Manage Values\" width=\"1904\" height=\"585\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/02\/manage-values.png 1904w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/manage-values-300x92.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/manage-values-1024x315.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/manage-values-768x236.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/manage-values-1536x472.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/manage-values-624x192.png 624w\" sizes=\"(max-width: 1904px) 100vw, 1904px\" \/><p id=\"caption-attachment-69929\" class=\"wp-caption-text\">Manage Values<\/p><\/div>\n<ul>\n<li>Now run this model on all of your test files and check whether it is extracting correct values or not. If not you have to change the format of your question so that it will extract correct information.<\/li>\n<\/ul>\n<div id=\"attachment_69931\" style=\"width: 1925px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-69931\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-69931 size-full\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/02\/Extracting-info-1.png\" alt=\"Extracting Info\" width=\"1915\" height=\"819\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/02\/Extracting-info-1.png 1915w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/Extracting-info-1-300x128.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/Extracting-info-1-1024x438.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/Extracting-info-1-768x328.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/Extracting-info-1-1536x657.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/Extracting-info-1-624x267.png 624w\" sizes=\"(max-width: 1915px) 100vw, 1915px\" \/><p id=\"caption-attachment-69931\" class=\"wp-caption-text\">Extracting Info<\/p><\/div>\n<ul>\n<li>For testing try to extract the information from the files that have different formats because it will make your model efficient for any document.<\/li>\n<li>Publish the document. It will show the &#8220;<strong>latest version&#8221;<\/strong> of your model and <strong>&#8220;extract a query section&#8221;<\/strong> to get the values from the document.<\/li>\n<\/ul>\n<h4>Step 6: Create a task.<\/h4>\n<pre>-- Create a new task which will run interval of 1 minute and process the file uploaded to internal stage and loads the data to table airtel_bill.\r\n-- &lt;model_name&gt;!PREDICT() is the syntax to use to get the extracted data from a uploaded file.\r\n\r\nCREATE OR REPLACE TASK load_new_file_data\r\nWAREHOUSE = XSMALL\r\nSCHEDULE = '1 minute'\r\nCOMMENT = 'Process new files in the stage and insert data into the airtel_bill table.'\r\nWHEN SYSTEM$STREAM_HAS_DATA('airtel_stream')\r\nAS\r\nINSERT INTO airtel_bill (\r\nSELECT\r\nRELATIVE_PATH AS file_name,\r\nfile_url AS file_url,\r\nsize AS file_size,\r\nlast_modified,\r\nAIRTEL_BILL!PREDICT(GET_PRESIGNED_URL('@airtel_stage', RELATIVE_PATH), 3) AS json_content\r\nFROM airtel_stream\r\nWHERE METADATA$ACTION = 'INSERT'\r\n);\r\n\r\nALTER TASK load_new_file_data RESUME;<\/pre>\n<h4>Step 7: Show the extracted data by parsing the JSON-type column<\/h4>\n<pre>-- Use of LATERAL FLATTEN function to parse the JSON Object\r\nSELECT \r\nfile_name, \r\nfile_url, \r\nfile_size, \r\nlast_modified, \r\nb.value:value::STRING AS bill_date,\r\np.value:value::STRING AS bill_period,\r\ni.value:value::STRING AS invoice_no,\r\nn.value:value::STRING AS name,\r\nt.value:value::STRING AS total_bill,\r\no.value as ocr_score\r\nFROM \r\nairtel_bill,\r\nLATERAL FLATTEN(INPUT =&gt; PARSE_JSON(json_content):bill_date) b,\r\nLATERAL FLATTEN(INPUT =&gt; PARSE_JSON(json_content):bill_period) p,\r\nLATERAL FLATTEN(INPUT =&gt; PARSE_JSON(json_content):invoice_no) i,\r\nLATERAL FLATTEN(INPUT =&gt; PARSE_JSON(json_content):name) n,\r\nLATERAL FLATTEN(INPUT =&gt; PARSE_JSON(json_content):total_bill) t,\r\nLATERAL FLATTEN(INPUT =&gt; PARSE_JSON(json_content):__documentMetadata) o;<\/pre>\n<div id=\"attachment_69932\" style=\"width: 1928px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-69932\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-69932 size-full\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/02\/info-loaded.png\" alt=\"Info Loaded into table\" width=\"1918\" height=\"682\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/02\/info-loaded.png 1918w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/info-loaded-300x107.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/info-loaded-1024x364.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/info-loaded-768x273.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/info-loaded-1536x546.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2025\/02\/info-loaded-624x222.png 624w\" sizes=\"(max-width: 1918px) 100vw, 1918px\" \/><p id=\"caption-attachment-69932\" class=\"wp-caption-text\">Info Loaded into the table<\/p><\/div>\n<h2>Conclusion<\/h2>\n<p>Unstructured data has long been a missed opportunity for analytics. However, with the power of AI and ML, <a href=\"https:\/\/www.tothenew.com\/data-analytics\/snowflake-cloud-services\">Snowflake<\/a> Document AI is transforming the way businesses extract and utilize information from PDFs, images, and other unstructured sources. By leveraging OCR, NLP, and ML-based data structuring, Snowflake Document AI ensures that enterprises unlock hidden value, drive efficiency, and enhance decision-making. If your organization deals with large volumes of unstructured data, it\u2019s time to explore how Snowflake Document AI can revolutionize your data strategy.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Fun fact! Around 80%-90% of the world&#8217;s data is unstructured. I was shocked when I read this fact. The unstructured data contains images, emails, PDF files social media posts, and other formats. Even though it is widely present 70% of data is not being used to drive insights and get analytics. As a Data Engineer, [&hellip;]<\/p>\n","protected":false},"author":1694,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":147},"categories":[6194],"tags":[4782,6697,7076,5236,7077],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/69913"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/1694"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=69913"}],"version-history":[{"count":9,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/69913\/revisions"}],"predecessor-version":[{"id":70655,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/69913\/revisions\/70655"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=69913"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=69913"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=69913"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}