Reading External Data in SQL
Hyper provides three options to refer to external data in SQL:
External data can be copied into a Hyper table with the COPY FROM SQL command.
External data can be read directly in a SQL query using the set returning function external. In this case, no Hyper table is involved, so such a query can even be used if no database is attached to the current session.
External data can be exposed as if it was a table using the CREATE EXTERNAL TABLE SQL command. It can then subsequently be queried using the name of the external table. Again, no Hyper table is involved; querying an external table will instead result in the data being read from the external source directly.
Usage examples of the individual alternatives can be found in the documentations of the respective statements. The choice between these ways to access external data depends on the use case.
COPY
vs. INSERT
+ external
COPY
will create a copy of the data in a Hyper table. It is
functionally equivalent to an INSERT command that
reads from the external
function. For example, the following two
statements have the same effect, assuming that 'products.parquet'
has the same columns as the existing table products
:
COPY products FROM 'products.parquet';
INSERT INTO products (SELECT * FROM external('products.parquet'))
The two statements differ in that the COPY
statement will
automatically select only the columns that are in the target table,
while the insert statement needs to name the columns to insert
explicitly, in case the external file has more columns. Also, for
formats that do not carry schema information (such as CSV), COPY
will
assume that the file has the schema of the table, while external
will
require the schema to be given explicitly.
In general, using INSERT
with external
is more flexible, as it
allows to transform and filter the data before it is inserted into the
table. For example, the following query will only insert products with a
price greater than 100 and it will transform the product name to upper
case:
INSERT INTO products (
SELECT upper(name), price
FROM external('products.parquet')
WHERE price > 100)
external
can also be used in CREATE TABLE AS
statements to concisely
create and fill a new Hyper table. For example, this will create and
fill a products table:
CREATE TABLE products AS (SELECT * FROM external('products.parquet'))
Creating External Tables vs. Using Ad-hoc Queries
The set returning function external
and the
CREATE TEMPORARY EXTERNAL TABLE
statement can both be used to query
external data without inserting it into a Hyper table.
Using external
has the advantage that it enables to write a fully
self-contained query that does not rely on external tables being set up
before. CREATE TEMPORARY EXTERNAL TABLE
on the other hand enables
reading external data as if it was a Hyper table. It therefore enables
SQL queries that can either operate on a Hyper table or on an external
table without changing their syntax. Also, these queries contains less
visual noise.
CREATE TEMPORARY EXTERNAL TABLE
will infer the schema of the source on
creation. If the source file then gets replaced by a file with a
different schema, subsequent queries will fail. In contrast, external
will re-infer the schema whenever the query containing it is executed.
Creating an external table is also a hint to Hyper that the external data might be accessed multiple times. Therefore, Hyper might cache more statistics or schema information for the data, so subsequent queries might be faster. This however is by no means guaranteed.