⬡ █

Canadian Government Data Rag Roadmap

Canadian Government Data RAG Roadmap

This document outlines the implementation strategy for indexing all Canadian government data (federal, provincial, municipal) for LLM Retrieval-Augmented Generation (RAG). It builds upon the findings in [[Canadian Government Data Indexing]].

1. Open-Source Tools & Libraries

CKAN API Harvesting

Most Canadian open data portals run on CKAN. The optimal tools for harvesting are:

  • Python: ckanapi (CLI and Python library) is the standard for scripting and interacting with the CKAN Action API. For scheduled, large-scale ETL pipelines, ckan-harvesters (a modern standalone approach) or the classic ckanext-harvest framework should be used.
  • JavaScript: ckan-client-js (by Datopian) is a comprehensive SDK for interacting with CKAN instances via JS frontend or Node.

PDF-to-Markdown at Scale

Extracting policy data from legacy unstructured PDFs requires robust layout parsing:

  • Unstructured.io: Best for enterprise RAG pipelines. Its “Auto” strategy uses Vision-Language Models (VLMs) and OCR to extract elements (titles, tables, text) and provides deep metadata for chunking. Tables are preserved structurally.
  • Marker: Best for high-fidelity layout preservation (developed by EndlessAI). Highly accurate for scientific/technical documents, table reconstruction, and mathematics. Capable of ~25 pages/sec on an H100.
  • MinerU: Strong open-source alternative for complex layouts and multi-column formats, offering native support for diverse file types (PPTX, XLSX).

Hybrid RAG Integration

To handle massive scale and hierarchical sectorization, a Hybrid RAG architecture (combining semantic search and structured relationships) is required:

  • Orchestration: LlamaIndex is recommended due to its robust Property Graph Index abstraction, which natively blends graph traversal with vector similarity. LangChain (specifically Neo4jGraph via graph QA chains) is a viable alternative for complex agentic workflows.
  • Vector Database: Qdrant or Milvus for massive-scale semantic similarity search of embedded markdown chunks.
  • Knowledge Graph: Neo4j as the core entity storage to enforce the strict sectorization hierarchy (e.g., Policy -> Agency -> Sector).

The ingestion layer must prioritize API integration. Below are the direct links to the official API documentation for major jurisdictions:

3. Data Volume Estimation

The total volume of Government of Canada web and document assets spans the low hundreds of terabytes for publicly accessible archives, scaling towards the petabyte range when including raw geospatial data and internal records.

  • Web Archival Assets: The Government of Canada Web Archive (GCWA) currently holds over 120+ Terabytes of data (comprising over 3.1 billion web objects). Federal government data collected since 2005 accounts for roughly 55+ Terabytes. The COVID-19 collection alone is over 20 Terabytes.
  • Active Open Data: The federal portal hosts over 47,000 active datasets. While structured CSVs are small, the roughly 10,000 geospatial datasets are massive.
  • Conclusion: To index the active policy and text-based archives, the system must be provisioned to handle 100-300 TB of raw unstructured and structured data.

4. Multi-Stage Implementation Strategy

Based on the core mandate of sustainability and API-first ingestion, the roadmap is divided into four stages:

Stage 1: MVP (Metadata & Structured Data)

  • Goal: Establish the ingestion pipeline and build the structural Knowledge Graph.
  • Action: Deploy ckanapi scripts to hit the package_list and package_show endpoints across the 8 major portals listed above.
  • Outcome: A localized metadata index and a baseline Neo4j graph mapping the government hierarchy (Federal/Provincial/Municipal -> Departments -> Datasets).

Stage 2: Unstructured Data Pipeline (PDF/HTML to Markdown)

  • Goal: Extract semantic meaning from raw documents.
  • Action: Implement an ETL pipeline using Unstructured.io (or Marker for complex layouts). Download unstructured assets (PDFs, legacy HTML) referenced in the CKAN metadata, convert them to Semantic Markdown, and split them into localized chunks preserving structural metadata (headers, tables).

Stage 3: Hybrid RAG Core (Vector + Graph)

  • Goal: Enable high-performance, hierarchical LLM querying.
  • Action:
    1. Embed the Markdown chunks using a suitable embedding model and ingest them into Qdrant.
    2. Use LlamaIndex to orchestrate a Hybrid Retriever: map the vector chunks to their corresponding nodes in the Neo4j Knowledge Graph.
    3. Expose an API endpoint for the LLM to query “What are the latest BC housing policies?” using both vector similarity (for content) and graph traversal (to ensure the source is the BC Ministry of Housing).

Stage 4: Full Scale & Sustainable Automation

  • Goal: Keep the massive index up-to-date without violating government rate limits.
  • Action: Pivot from batch ingestion to incremental updates. Utilize the CKAN recently_changed_packages endpoint and standard sitemap.xml monitoring. Document hashes must be tracked to prevent duplicate processing of the 100+ TB data lake.

Gardener’s Summary

For ML development targeting regulatory tech or public policy automation, this roadmap solves the core bottleneck of government data: scale and structure. By leveraging Hybrid RAG (Vector + Graph) and avoiding brute-force scraping in favor of native CKAN APIs, you build a sustainable system. The semantic sectorization provides the “moat”—an LLM that understands not just what a policy says, but exactly where it sits within the Canadian bureaucratic hierarchy.

Related concepts: [[Retrieval-Augmented Generation (RAG)]], [[Vector Databases]], [[Canadian Government Data Indexing]].