Introducing Harbour

THE PROACTIVE AI-NATIVE CATALOG FOR MODERN LAKEHOUSES

Data context is scattered across wikis, dashboards, Slack threads, and tribal knowledge. Harbour is the central control plane where all data context lives—so every AI agent, every application, and every team operates from the same source of truth. Connect your data agents and they get rich, consistent context in seconds.

See how it works →

AI-Native

Built for autonomous agents, not just dashboards

Interoperable

Spark, Trino, PyIceberg, Databricks—one catalog

Proactive

Detects issues and recommends fixes before they break

Scalable

Horizontally scales for ingest-heavy and analytical workloads

INTELLIGENCE LAYER

One place for all your data context

Every table creation, schema change, and query pattern automatically enriches a living context graph. No manual lineage work. No external tools. The catalog itself captures relationships, detects patterns, and serves context to every connected agent and application.

PageRank: 0.42

Rec: Add partition on order_date

Agent connected: 3 active

AI Agent

Namespace

Table

Column

PII

Relationship

WHY HARBOUR

Your AI agents are flying blind without centralized context

Every team is building data agents and AI-native applications. But these tools are only as smart as the context they can access. When context is fragmented across dozens of systems, agents make inconsistent, uninformed decisions. Harbour centralizes it all.

Without Harbour

Context is scattered across data catalogs, BI tools, documentation wikis, and Slack threads. Every agent and application builds its own incomplete picture. Nothing is consistent.

With Harbour

One canonical context graph with every relationship, quality signal, and semantic enrichment. Connect an agent and it instantly understands your entire data landscape. One source of truth for all.

Without Harbour

Every data tool maintains its own metadata copy. Lineage in one system, quality metrics in another, documentation in a third. Nothing stays in sync. Agents get stale, contradictory context.

With Harbour

Single source of truth. Lineage, quality, PII classification, importance scores, and recommendations—all in one place, always consistent, always current. Every connected tool sees the same picture.

Without Harbour

PII hides in column names like "col_7" or "user_info". Compliance requires months of manual data classification. Agents have no idea which data is sensitive.

With Harbour

Automatic PII detection classifies columns as email, phone, SSN, or credit card. Domain tagging organizes tables by business context. Agents know what's sensitive before they touch it.

Without Harbour

Connecting Spark, Trino, and Databricks to the same catalog requires custom glue code, format converters, and constant maintenance. Each engine sees a different version of the truth.

With Harbour

Full Iceberg REST Catalog spec compliance. Any Iceberg-compatible engine connects natively. One catalog, every engine, zero glue code. Context flows consistently to all.

Without Harbour

AI agents hit opaque APIs, get cryptic errors, and have no context about data relationships, quality, or importance. They're guessing. Badly.

With Harbour

Rich semantic endpoints give agents full context: table importance scores, related tables, data quality signals, and actionable recommendations. Connect in seconds, not weeks.

ARCHITECTURE

Harbour is a layered control plane that separates intelligence from storage and security from compute. Every layer is independently scalable. Context flows from the graph to every connected agent and engine through rich semantic APIs.

Full Iceberg REST Catalog spec with 45+ endpoints across namespaces, tables, views, and commits
In-memory context graph with PageRank, persisted to PostgreSQL for durability across restarts
Semantic REST endpoints that expose context, relationships, and recommendations to any connected agent
Credential vending with scoped temporary credentials for S3, GCS, and ADLS

Traditional REST Catalog

SPARK / TRINO / PYICEBERG

↓

REST CATALOGReturns table location only

↓

NO INTELLIGENCE

No context, no recommendations, no PII detection

↓

S3 / GCS / ADLS

e6 Harbour

AI AGENTS / SPARK / TRINO / DATABRICKS

↓

CONTEXT CONTROL PLANEMetadata + context + scoped credentials in one call

↓

CONTEXT GRAPH + AI ENGINEPageRank, recommendations, PII, lineage, quality

↓

S3 / GCS / ADLS / MinIO

RECOMMENDATION ENGINE

Proactive recommendations, not reactive firefighting

The recommendation engine continuously analyzes metadata patterns across your entire catalog. It detects issues, surfaces optimization opportunities, and provides actionable guidance to every connected agent and team—before anything breaks.

SNAPSHOT MANAGEMENT

Automatic lifecycle optimization

Detects tables with runaway snapshot counts, recommends expiry policies, and suggests metadata compaction. Catches write-heavy patterns before they become production incidents.

SCHEMA & ACCESS INTELLIGENCE

Workload-aware optimization

Identifies unpartitioned tables under heavy scan load, detects missing sort orders, spots idle tables, and recommends join optimizations based on context graph patterns.

HIGH PRIORITY

SNAPSHOT_EXPIRY

Detected > 100 snapshots

Table has accumulated excessive snapshots indicating missing lifecycle policies. Configure expiry to prevent metadata bloat and query planning slowdowns.

SCAN_OPTIMIZATION

High scan-to-query ratio

Queries are scanning significantly more data than necessary. Add partition pruning or sort order on frequently filtered columns to reduce I/O.

AI INSIGHTS

PARTITION_SUGGESTION

Based on context graph

Filter columns detected in repeated query patterns. Partitioning on these columns would reduce scan volume across all connected engines and agents.

JOIN_OPTIMIZATION

Frequent join patterns

Context graph detected frequent join patterns between tables. Co-locating or pre-computing these joins could significantly reduce latency for connected applications.

INTEROPERABLE

One catalog. Every engine. Consistent context.

Connect any Iceberg-compatible engine and it gets the same rich context. No custom integrations. No format converters. Just plug in and go.

Query Engine

Spark

Full read/write support

v3.5Schema EvolutionTime Travel

Query Engine

Trino

Cross-engine reads & JOINs

v459Partition TransformsSchema Evolution

Python SDK

PyIceberg

Native Python data access

v0.11FiltersProjections

Unified Platform

Databricks

Native REST catalog connectivity

Runtime 14+Unity Compatible

REST Spec

45+

API Endpoints

NamespacesTablesViewsCommits

Performance

<5ms

Average API latency

p95 < 14msCaffeine Cache

Explore the REST API → View engine compatibility →

USE CASES

One control plane for every data workflow

Whether you're building ETL pipelines, deploying AI agents, or running ad-hoc analytics—Harbour is where all your data context lives. Connect any tool and it gets the full picture.

Connect AI Agents

Give autonomous agents rich context about your entire data landscape. Semantic endpoints expose importance, relationships, PII classifications, and quality signals. Agents connect in seconds and make informed decisions instead of guessing.

Multi-Engine Analytics

Run Spark for ETL, Trino for ad-hoc queries, PyIceberg for notebooks, and Databricks for ML—all pointing to one catalog. Schema evolution and time travel work consistently across every engine.

Centralize Data Context

Stop scattering context across tools. Lineage, quality, PII, documentation—it all lives in Harbour. Every team, every tool, every agent reads from the same source of truth. Context stays consistent.

Data Governance

Automatic PII detection, domain classification, RBAC with namespace-level grants, and a complete audit trail. Meet GDPR, CCPA, and SOC 2 requirements without bolting on external tools.

Platform Engineering

Deploy on Kubernetes with health probes, Prometheus metrics, and structured logging. Maintenance policies automate snapshot expiry and compaction. Self-service for data teams, full control for platform teams.

Real-Time Lakehouse

Credential vending with scoped temporary credentials for every table operation. Optimistic concurrency control. Sub-millisecond metadata cache for high-throughput streaming workloads.

DEVELOPER EXPERIENCE

Deploy in minutes. Connect everything.

Harbour runs on Kubernetes, connects to any Iceberg engine, and ships with observability built in. Your AI agents connect through semantic REST endpoints. No agents to install, no sidecars to manage, no vendor lock-in.

Runs with your data stack

Plugs into your existing infrastructure. No migration required.

View the API reference →

Storage

S3, GCS, ADLS, MinIO. Credential vending for scoped access.

Compute

Spark, Trino, PyIceberg, Databricks. Any Iceberg engine.

Database

PostgreSQL with Flyway migrations. H2 for development.

Orchestration

Kubernetes-native. Docker Compose for local development.

AI Agents

Semantic REST APIs. Connect any agent in seconds.

Context Graph with PageRank

Every table, column, and relationship mapped in a knowledge graph. PageRank scores surface the most important tables automatically. Connected agents query the graph directly.

Credential Vending

Temporary, scoped cloud credentials issued with every table load. No long-lived keys, no ambient permissions. S3, GCS, and ADLS supported out of the box.

Enterprise Security

OAuth2, API keys, RBAC with namespace-level grants, multi-tenancy, and a complete audit trail. Three security modes: disabled, permissive, enforced.

Resilience Built In

Circuit breaker, rate limiting, idempotency keys, and optimistic concurrency control with pessimistic fallback. Production-grade from day one.

Maintenance Policies

Snapshot expiry, metadata compaction, orphan file cleanup. Hierarchical resolution from table to namespace to catalog. Scheduled execution with dry-run mode.

Horizontally Scalable

Deploy multiple replicas behind a load balancer for ingest-heavy and analytical workloads. Stateless architecture with shared PostgreSQL means you scale out by adding pods. Built for Kubernetes autoscaling.