`ai_extract` 函数

适用于：勾选标记为“是” Databricks SQL Databricks Runtime

重要

此功能符合公共预览版和 HIPAA 要求。

在预览期间：

基础语言模型可以处理多种语言，但此 AI 函数针对英语进行了优化。
请查看仅限区域可用的功能，了解 AI 功能的区域可用性。

该 ai_extract() 函数根据提供的架构从文本和文档中提取结构化数据。可以使用简单的字段名称进行基本提取，或使用嵌套对象、数组、类型验证和业务文档（如发票、合同和财务文件）的字段说明定义复杂的架构。

该函数接受来自其他 AI 函数的文本或 VARIANT 输出，例如 ai_parse_document，为端到端文档处理启用可组合工作流。

有关要验证和循环访问结果的 ai_extract视觉 UI，请参阅信息提取。

要求

Apache 2.0 许可证

此时可能使用的基础模型根据 Apache 2.0 许可证（版权 © Apache Software Foundation）获得许可。客户需负责确保遵守适用的模型许可条款。

Databricks 建议查看这些许可证，以确保遵守任何适用的条款。如果模型在未来根据 Databricks 的内部基准表现更好，Databricks 可能会更改模型（以及本页中提供的适用许可证列表）。

支持此函数的模型是使用模型服务基础模型 API 提供的。有关 Databricks 上可用的模型以及控制这些模型的使用的许可证和策略的信息，请参阅适用的模型条款。

如果将来出现根据 Databricks 的内部基准性能更好的模型，Databricks 可能会更改模型并更新文档。

此函数仅在某些区域中可用，请参阅 AI 函数可用性。
此函数在 SQL 经典版Azure Databricks不可用。
查看 Databricks SQL 定价页。
在 Databricks Runtime 15.1 及更高版本中，Databricks 笔记本（包括作为任务在 Databricks 工作流中运行的笔记本）支持此函数。
批处理推理工作负荷需要 Databricks Runtime 15.4 ML LTS 来提高性能。

语法

Databricks 建议使用此函数的版本 2，因为它支持嵌套字段提取和说明。

版本 2.1（建议）

ai_extract(
    content VARIANT | STRING,
    schema STRING,
    [options MAP<STRING, STRING>]
) RETURNS VARIANT

版本 2

ai_extract(
    content VARIANT | STRING,
    schema STRING,
    [options MAP<STRING, STRING>]
) RETURNS VARIANT

版本 1

ai_extract(
    content STRING,
    labels ARRAY<STRING>,
    [options MAP<STRING, STRING>]
) RETURNS STRUCT

争论

版本 2.1（建议）

content：VARIANT 或 STRING 表达式。接受以下任一：
- 原始文本作为 STRING
- VARIANT由另一个 AI 函数（例如ai_parse_document）生成的
schema：用于 STRING 定义用于提取的 JSON 架构的文本。架构可以是：
- 简单架构：字段名称的 JSON 数组（假定为字符串）
```
"[\"vendor_name\", \"invoice_id\", \"total_amount\"]"
```
- 高级架构：包含类型信息、说明和嵌套结构的 JSON 对象
  - 支持string、integer、number和booleanenum类型。执行类型验证。无效的值会导致错误。最多 500 个枚举值。
  - 支持使用的嵌套对象
  - 支持使用 "type": "array" 的基元或对象的数组 "items"
  - 用于指导提取质量的每个属性的可选 "description" 字段
options：包含配置选项的可选 MAP<STRING, STRING> 选项：
- version：版本切换以支持迁移（"2.1"、、"2.0""1.0"）。默认值基于输入类型。
- instructions：用于提高提取质量的任务和域的全局说明。必须少于 20,000 个字符。
- enableCitations：提取 true架构中每个字段的输出包括一个零个或多个引文列表，该列表指示在文档中提取的输出。
- enableConfidenceScores：提取 true架构中每个字段的输出包括 0 到 1 之间的置信度分数，指示模型与该值的关系。适当的置信度阈值取决于特定的用例，你应该选择一个与风险和错误的容忍度一致的截止。

版本 2

content：VARIANT 或 STRING 表达式。接受以下任一：
- 原始文本作为 STRING
- VARIANT由另一个 AI 函数（例如ai_parse_document）生成的
schema：用于 STRING 定义用于提取的 JSON 架构的文本。架构可以是：
- 简单架构：字段名称的 JSON 数组（假定为字符串）
```
"[\"vendor_name\", \"invoice_id\", \"total_amount\"]"
```
- 高级架构：包含类型信息、说明和嵌套结构的 JSON 对象
  - 支持string、integer、number和booleanenum类型。执行类型验证。无效的值会导致错误。最多 500 个枚举值。
  - 支持使用的嵌套对象
  - 支持使用 "type": "array" 的基元或对象的数组 "items"
  - 用于指导提取质量的每个属性的可选 "description" 字段
options：包含配置选项的可选 MAP<STRING, STRING> 选项：
- version：版本切换以支持迁移（"1.0" 对于 v1 行为， "2.0" 对于 v2 行为）。默认值基于输入类型，但回退到 "1.0"。
- instructions：用于提高提取质量的任务和域的全局说明。必须少于 20,000 个字符。

版本 1

content STRING：包含原始文本的表达式。
labels：一个 ARRAY<STRING> 文本。每个元素都是一个要提取的实体类型。
options：包含配置选项的可选 MAP<STRING, STRING> 选项：
- version：版本切换以支持迁移（"1.0" 对于 v1 行为， "2.0" 对于 v2 行为）。默认值基于输入类型，但会回退到 "1.0"。

返回值

版本 2.1（建议）

返回一个 VARIANT 包含：

{
  "response": {...},       // Extracted data matching the provided schema. Each leaf is returned as a Field object (see below).
  "error_message": null,   // null on success, or error message on failure
  "metadata": { ... }      // Metadata about the response, including unfurled citation ids.
}

该 response 字段包含根据架构提取的结构化数据：

字段名称和类型与架构定义匹配
架构的结构保留在响应中：嵌套对象和数组保留其原始形状。提取架构中的每个“标量”字段都有具有以下字段的输出对象：
- value：提取的值，根据架构键入。 Null 如果无法提取字段，则为
- citation_ids：仅当为true时enableCitations。索引到 metadata.citations的 ID 数组。
- confidence_score：仅当为true时enableConfidenceScores。一个浮点数介于 0 和 1 之间。
对整数、数字、布尔和枚举类型强制实施类型验证
如果 content 为 NULL，则结果为 NULL。

该 metadata 字段包含有关响应的元数据。当 enableCitations 是 true时，该 metadata 字段包含有关字段中每个引文 ID 的详细信息，这些引文 ID response 将提取的值跟踪回输入中的位置。

根据输入的类型，引文可以是以下两种类型之一：

对于原始文本（STRING）输入，引文是原始输入中的文本范围。 metadata.citations 中的每个对象都包含：
- id：与字段上的citation_ids项匹配的整数。
- start：输入字符串中基于 0 的字符偏移量。
- stop：输入字符串中基于 0 的独占字符偏移量。
对于 PDF 文档和图像（使用 ai_extract 下游 ai_parse_document时），引文是原始输入中的边界框。每个 metadata.citations 对象都包含：
- id：与字段上的条目匹配 citation_ids 的整数。
- bbox：对象数组 {coord, page_id} ，形状与输出中的 ai_parse_document element.bbox 相同。 coord 是页面图像上的像素坐标，如 [x0, y0, x1, y1]; page_id 基于 0 的页面索引。

版本 2

返回一个 VARIANT 包含：

{
  "response": { ... },   // Extracted data matching the provided schema
  "error_message": null          // null on success, or error message on failure
}

该 response 字段包含根据架构提取的结构化数据：

字段名称和类型与架构定义匹配
嵌套对象和数组保留在结构中
字段可能 null 未找到
对integer类型numberboolean和enum类型强制实施类型验证

如果 content 为 NULL，则结果为 NULL。

版本 1

返回一个位置， STRUCT 其中每个字段对应于在 labels中指定的实体类型。每个字段包含一个表示提取实体的字符串。如果函数查找任何实体类型的多个候选项，则它只返回一个。

示例

版本 2.1（建议）

简单架构 - 仅字段名称

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
    '["invoice_id", "vendor_name", "total_amount", "invoice_date"]',
    options => map('version', '2.1')
  );
 {
   "response": {
     "invoice_id":   {"value": "12345"},
     "vendor_name":  {"value": "Acme Corp"},
     "total_amount": {"value": "1250.00"},
     "invoice_date": {"value": "2024-01-15"}
   },
   "error_message": null
 }

高级架构 - 具有类型和说明

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
    '{
      "invoice_id": {"type": "string", "description": "Unique invoice identifier"},
      "vendor_name": {"type": "string", "description": "Legal business name"},
      "total_amount": {"type": "number", "description": "Total invoice amount"},
      "invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
    }',
    options => map('version', '2.1')
  );
 {
   "response": {
     "invoice_id":   {"value": "12345"},
     "vendor_name":  {"value": "Acme Corp"},
     "total_amount": {"value": 1250.00},
     "invoice_date": {"value": "2024-01-15"}
   },
   "error_message": null
 }

嵌套对象和数组

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp
     Line 1: Widget A, qty 10, $50.00 each
     Line 2: Widget B, qty 5, $100.00 each
     Subtotal: $1,000.00, Tax: $80.00, Total: $1,080.00',
    '{
      "invoice_header": {
        "type": "object",
        "properties": {
          "invoice_id": {"type": "string"},
          "vendor_name": {"type": "string"}
        }
      },
      "line_items": {
        "type": "array",
        "description": "List of invoiced products",
        "items": {
          "type": "object",
          "properties": {
            "description": {"type": "string"},
            "quantity": {"type": "integer"},
            "unit_price": {"type": "number"}
          }
        }
      },
      "totals": {
        "type": "object",
        "properties": {
          "subtotal": {"type": "number"},
          "tax_amount": {"type": "number"},
          "total_amount": {"type": "number"}
        }
      }
    }',
    options => map('version', '2.1')
  );
 {
   "response": {
     "invoice_header": {
       "invoice_id":  {"value": "12345"},
       "vendor_name": {"value": "Acme Corp"}
     },
     "line_items": [
       {"description": {"value": "Widget A"}, "quantity": {"value": 10}, "unit_price": {"value": 50.00}},
       {"description": {"value": "Widget B"}, "quantity": {"value": 5},  "unit_price": {"value": 100.00}}
     ],
     "totals": {
       "subtotal":     {"value": 1000.00},
       "tax_amount":   {"value": 80.00},
       "total_amount": {"value": 1080.00}
     }
   },
   "error_message": null
 }

可组合性 `ai_parse_document`

> WITH parsed_docs AS (
    SELECT
      path,
      ai_parse_document(
        content,
        MAP('version', '2.0')
      ) AS parsed_content
    FROM READ_FILES('/Volumes/finance/invoices/', format => 'binaryFile')
  )
  SELECT
    path,
    ai_extract(
      parsed_content,
      '["invoice_id", "vendor_name", "total_amount"]',
      MAP('version', '2.1', 'instructions', 'These are vendor invoices.')
    ) AS invoice_data
  FROM parsed_docs;

使用枚举

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp, amount: $1,250.00 USD',
    '{
      "invoice_id": {"type": "string"},
      "vendor_name": {"type": "string"},
      "total_amount": {"type": "number"},
      "currency": {
        "type": "enum",
        "labels": ["USD", "EUR", "GBP", "CAD", "AUD"],
        "description": "Currency code"
      },
      "payment_terms": {"type": "string"}
    }',
    options => map('version', '2.1')
  );
 {
   "response": {
     "invoice_id":     {"value": "12345"},
     "vendor_name":    {"value": "Acme Corp"},
     "total_amount":   {"value": 1250.00},
     "currency":       {"value": "USD"},
     "payment_terms":  {"value": null}
   },
   "error_message": null
 }

引文（STRING 输入、SPAN 引文）

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
    '{
      "invoice_id": {"type": "string", "description": "Unique invoice identifier"},
      "vendor_name": {"type": "string", "description": "Legal business name"},
      "total_amount": {"type": "number", "description": "Total invoice amount"},
      "invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
    }',
   options => map(
     'version', '2.1',
     'enableCitations', 'true'
   )
  );
 {
   "response": {
     "invoice_id": {"citation_ids": [0], "value": "12345"},
     "vendor_name": {"citation_ids": [0], "value": "Acme Corp"},
     "total_amount": {"citation_ids": [1], "value": 1250.00},
     "invoice_date": {"citation_ids": [1], "value": "2024-01-15"}
   },
   "metadata": {
     "chunk_type": "span",
     "citations": [
       {"id": 0, "start": 0, "stop": 29},
       {"id": 1, "start": 29, "stop": 60}
     ]
   },
   "error_message": null
 }

引文（来自ai_parse_document的 VARIANT，BBOX 引文）

> WITH parsed AS (
    SELECT ai_parse_document(
             content,
             map('imageOutputPath', '/Volumes/main/default/parsed_images/')  // necessary for rendering bboxes
           ) AS doc
    FROM READ_FILES('/Volumes/main/default/invoices/invoice.pdf', format => 'binaryFile')
  )
  SELECT ai_extract(
    doc,
    '{"invoice_id":{"type":"string"}, "total_amount":{"type":"number"}}',
    options => map('version','2.1','enableCitations','true')
  ) AS extracted
  FROM parsed;
{
  "response": {
    "invoice_id":   {"citation_ids": [0], "value": "12345"},
    "total_amount": {"citation_ids": [1], "value": 1250.00}
  },
  "metadata": {
    "chunk_type": "bbox",
    "citations": [
      {"id": 0, "bbox": [{"coord": [120, 80,  240, 110], "page_id": 0}]},
      {"id": 1, "bbox": [{"coord": [400, 500, 560, 530], "page_id": 0}]}
    ],
    "pages": [{"id": 0, "image_uri": "/Volumes/main/default/parsed_images/6077ca79...f8efdb2ed05.jpg"}]
  },
  "error_message": null
}

置信度分数


> SELECT ai_extract(
    'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
    '{
      "invoice_id": {"type": "string", "description": "Unique invoice identifier"},
      "vendor_name": {"type": "string", "description": "Legal business name"},
      "total_amount": {"type": "number", "description": "Total invoice amount"},
      "invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
    }',
   options => map(
    'version', '2.1',
    'enableConfidenceScores', 'true'
   )
  );
{
  "response": {
    "invoice_id": {"confidence_score": 0.95, "value": "12345"},
    "vendor_name": {"confidence_score": 0.62, "value": "Acme Corp"},
    "total_amount": {"confidence_score": 1.0, "value": 1250.00},
    "invoice_date": {"confidence_score": 0.99, "value": "2024-01-15"}
  },
  "error_message": null
}

笔记本示例

以下笔记本提供可视化调试界面，用于分析函数的 ai_extract 引文输出。它演示如何将引文元数据呈现为子字符串片段（STRING 输入）或边界框覆盖（VARIANT 输入），并将引文联接 ai_extract 回 ai_parse_document SQL 中的元素，以便标记低置信度提取以供手动评审。

引文呈现笔记本

获取笔记本

版本 2

简单架构 - 仅字段名称

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
    '["invoice_id", "vendor_name", "total_amount", "invoice_date"]'
  );
 {
   "response": {
     "invoice_id": "12345",
     "vendor_name": "Acme Corp",
     "total_amount": "1250.00",
     "invoice_date": "2024-01-15"
   },
   "error_message": null
 }

高级架构 - 具有类型和说明

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
    '{
      "invoice_id": {"type": "string", "description": "Unique invoice identifier"},
      "vendor_name": {"type": "string", "description": "Legal business name"},
      "total_amount": {"type": "number", "description": "Total invoice amount"},
      "invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
    }'
  );
 {
   "response": {
     "invoice_id": "12345",
     "vendor_name": "Acme Corp",
     "total_amount": 1250.00,
     "invoice_date": "2024-01-15"
   },
   "error_message": null
 }

嵌套对象和数组

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp
     Line 1: Widget A, qty 10, $50.00 each
     Line 2: Widget B, qty 5, $100.00 each
     Subtotal: $1,000.00, Tax: $80.00, Total: $1,080.00',
    '{
      "invoice_header": {
        "type": "object",
        "properties": {
          "invoice_id": {"type": "string"},
          "vendor_name": {"type": "string"}
        }
      },
      "line_items": {
        "type": "array",
        "description": "List of invoiced products",
        "items": {
          "type": "object",
          "properties": {
            "description": {"type": "string"},
            "quantity": {"type": "integer"},
            "unit_price": {"type": "number"}
          }
        }
      },
      "totals": {
        "type": "object",
        "properties": {
          "subtotal": {"type": "number"},
          "tax_amount": {"type": "number"},
          "total_amount": {"type": "number"}
        }
      }
    }'
  );
 {
   "response": {
     "invoice_header": {
       "invoice_id": "12345",
       "vendor_name": "Acme Corp"
     },
     "line_items": [
       {"description": "Widget A", "quantity": 10, "unit_price": 50.00},
       {"description": "Widget B", "quantity": 5, "unit_price": 100.00}
     ],
     "totals": {
       "subtotal": 1000.00,
       "tax_amount": 80.00,
       "total_amount": 1080.00
     }
   },
   "error": null
 }

可组合性 ai_parse_document

> WITH parsed_docs AS (
    SELECT
      path,
      ai_parse_document(
        content,
        MAP('version', '2.0')
      ) AS parsed_content
    FROM READ_FILES('/Volumes/finance/invoices/', format => 'binaryFile')
  )
  SELECT
    path,
    ai_extract(
      parsed_content,
      '["invoice_id", "vendor_name", "total_amount"]',
      MAP('instructions', 'These are vendor invoices.')
    ) AS invoice_data
  FROM parsed_docs;

使用枚举

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp, amount: $1,250.00 USD',
    '{
      "invoice_id": {"type": "string"},
      "vendor_name": {"type": "string"},
      "total_amount": {"type": "number"},
      "currency": {
        "type": "enum",
        "labels": ["USD", "EUR", "GBP", "CAD", "AUD"],
        "description": "Currency code"
      },
      "payment_terms": {"type": "string"}
    }'
  );
 {
   "response": {
     "invoice_id": "12345",
     "vendor_name": "Acme Corp",
     "total_amount": 1250.00,
     "currency": "USD",
     "payment_terms": null
   },
   "error": null
 }

版本 1

> SELECT ai_extract(
    'John Doe lives in New York and works for Acme Corp.',
    array('person', 'location', 'organization')
  );
 {"person": "John Doe", "location": "New York", "organization": "Acme Corp."}

> SELECT ai_extract(
    'Send an email to jane.doe@example.com about the meeting at 10am.',
    array('email', 'time')
  );
 {"email": "jane.doe@example.com", "time": "10am"}

局限性

版本 2.1（建议）

此函数在 SQL 经典版Azure Databricks不可用。
此函数不能与视图一起使用。
该架构最多支持 128 个字段。
字段名称最多可以包含 150 个字符。
架构最多支持嵌套字段的七个嵌套级别。
枚举字段最多支持 500 个值。
对integer类型numberboolean验证和enum类型强制实施。如果值与指定的类型不匹配，该函数将返回错误。
最大上下文大小为 128,000 个令牌。

版本 2

此函数在 SQL 经典版Azure Databricks不可用。
此函数不能与视图一起使用。
该架构最多支持 128 个字段。
字段名称最多可以包含 150 个字符。
架构最多支持嵌套字段的七个嵌套级别。
枚举字段最多支持 500 个值。
对integer类型numberboolean验证和enum类型强制实施。如果值与指定的类型不匹配，该函数将返回错误。
最大上下文大小为 128,000 个令牌。

版本 1

此函数在 SQL 经典版Azure Databricks不可用。
此函数不能与视图一起使用。
如果在内容中找到实体类型的多个候选项，则只返回一个值。

反馈

此页面是否有帮助？

Last updated on 2026-05-03

ai_extract 函数

要求

语法

版本 2.1（建议）

版本 2

版本 1

争论

版本 2.1（建议）

版本 2

版本 1

返回值

版本 2.1（建议）

版本 2

版本 1

示例

版本 2.1（建议）

简单架构 - 仅字段名称

高级架构 - 具有类型和说明

嵌套对象和数组

可组合性 ai_parse_document

使用枚举

引文（STRING 输入、SPAN 引文）

引文（来自ai_parse_document的 VARIANT，BBOX 引文）

置信度分数

笔记本示例

引文呈现笔记本

版本 2

版本 1

局限性

版本 2.1（建议）

版本 2

版本 1

相关函数

反馈

其他资源

`ai_extract` 函数

可组合性 `ai_parse_document`