COMMAND NAME: gpload

Runs a load job as defined in a YAML formatted control file.


*****************************************************
SYNOPSIS
*****************************************************

gpload -f <control_file> [-l <log_file>] [-h <hostname>] [-p <port>]
[-U <username>] [-d <database>] [-W] [--gpfdist_timeout <seconds>] 
[--no_auto_trans] [[-v | -V] [-q]] [-D]

gpload -? 

gpload --version


*****************************************************
PREREQUISITES
*****************************************************

The client machine where gpload is executed must have the following: 

* Python 2.6.2 or later, pygresql (the Python interface to PostgreSQL), 
  and pyyaml. Note that Python and the required Python libraries are 
  included with the Apache Cloudberry server installation, so if you have 
  Apache Cloudberry installed on the machine where gpload is running, you 
  do not need a separate Python installation. 

  Note: Cloudberry Loaders for Windows supports only Python 2.5 
  (available from www.python.org). 

* The gpfdist parallel file distribution program installed and in your 
  $PATH. This program is located in $GPHOME/bin of your Apache Cloudberry 
  server installation. 

* Network access to and from all hosts in your Apache Cloudberry array 
 (coordinator and segments). 

* Network access to and from the hosts where the data to be loaded 
  resides (ETL servers). 

  
*****************************************************
DESCRIPTION
*****************************************************

gpload is a data loading utility that acts as an interface to Cloudberry 
Databases external table parallel loading feature. Using a load 
specification defined in a YAML formatted control file, gpload executes 
a load by invoking the Cloudberry parallel file server (gpfdist), 
creating an external table definition based on the source data defined, 
and executing an INSERT, UPDATE or MERGE operation to load the source 
data into the target table in the database. 

The operation, including any SQL commands specified in the SQL 
collection of the YAML control file, are performed as a single 
transaction to prevent inconsistent data when performing multiple, 
simultaneous load operations on a target table. 

 
*****************************************************
OPTIONS
*****************************************************

-f <control_file> 

 Required. A YAML file that contains the load specification details. See 
 following section "Control File Format". 

 
--gpfdist_timeout <seconds> 

 Sets the timeout for the gpfdist parallel file distribution program to 
 send a response. Enter a value from 0 to 30 seconds (entering "0" to 
 disables timeouts). Note that you might need to increase this value when 
 operating on high-traffic networks. 

-l <log_file>

 Specifies where to write the log file. Defaults to 
 ~/gpAdminLogs/gpload_YYYYMMDD. See Also: LOG FILE FORMAT section. 


--no_auto_trans 

 Specify --no_auto_trans to disable processing the load operation as a 
 single transaction if you are performing a single load operation on the 
 target table. 

 By default, gpload processes each load operation as a single transaction 
 to prevent inconsistent data when performing multiple, simultaneous 
 operations on a target table. 

 
-q (no screen output) 

 Run in quiet mode. Command output is not displayed on the screen, but is 
 still written to the log file. 

 
-D (debug mode) 

 Check for error conditions, but do not execute the load. 

 
-v (verbose mode) 

 Show verbose output of the load steps as they are executed. 


-V (very verbose mode) 

 Shows very verbose output. 

 
-? (show help) 

 Show help, then exit. 

 
--version 

Show the version of this utility, then exit. 

*********************
CONNECTION OPTIONS
*********************

-d <database> 

 The database to load into. If not specified, reads from the load control 
 file, the environment variable $PGDATABASE or defaults to the current 
 system user name. 


-h <hostname>

 Specifies the host name of the machine on which the Cloudberry coordinator 
 database server is running. If not specified, reads from the load 
 control file, the environment variable $PGHOST or defaults to localhost. 


-p <port> 

 Specifies the TCP port on which the Cloudberry coordinator database server is 
 listening for connections. If not specified, reads from the load control 
 file, the environment variable $PGPORT or defaults to 5432. 


-U <username> 

 The database role name to connect as. If not specified, reads from the 
 load control file, the environment variable $PGUSER or defaults to the 
 current system user name. 


-W (force password prompt) 

 Force a password prompt. If not specified, reads the password from the 
 environment variable $PGPASSWORD or from a password file specified by 
 $PGPASSFILE or in ~/.pgpass. If these are not set, then gpload will 
 prompt for a password even if -W is not supplied. 


*****************************************************
CONTROL FILE FORMAT
*****************************************************

The gpload control file uses the YAML 1.1 document format and then 
implements its own schema for defining the various steps of a Cloudberry 
Database load operation. The control file must be a valid YAML document. 

The gpload program processes the control file document in order and uses 
indentation (spaces) to determine the document hierarchy and the 
relationships of the sections to one another. The use of white space is 
significant. White space should not be used simply for formatting 
purposes, and tabs should not be used at all. 

The basic structure of a load control file is: 

---
VERSION: 1.0.0.1
DATABASE: <db_name>
USER: <db_username>
HOST: <coordinator_hostname>
PORT: <coordinator_port>
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - <hostname_or_ip>
         PORT: <http_port>
       | PORT_RANGE: [<start_port_range>, <end_port_range>]
         FILE: 
           - </path/to/input_file>
         SSL: true | false
         CERTIFICATES_PATH: </path/to/certificates>
    - COLUMNS:
           - <field_name>: <data_type>
    - TRANSFORM: '<transformation>' 
    - TRANSFORM_CONFIG: '<configuration-file-path>' 
    - MAX_LINE_LENGTH: <integer> 
    - FORMAT: text | csv
    - DELIMITER: '<delimiter_character>'
    - ESCAPE: '<escape_character>' | 'OFF'
    - NULL_AS: '<null_string>'
    - FORCE_NOT_NULL: true | false
    - QUOTE: '<csv_quote_character>'
    - HEADER: true | false
    - ENCODING: <database_encoding>
    - ERROR_LIMIT: <integer>
    - ERROR_TABLE: <schema>.<table_name>
    - LOG_ERRORS: true | false
    - NEWLINE: 'LF' | 'CR' | 'CRLF'
    
   EXTERNAL:
    - SCHEMA: <schema> | '%'
   OUTPUT:
    - TABLE: <schema>.<table_name>
    - MODE: insert | update | merge
    - MATCH_COLUMNS:
           - <target_column_name>
    - UPDATE_COLUMNS:
           - <target_column_name>
    - UPDATE_CONDITION: '<boolean_condition>'
    - MAPPING:
            <target_column_name>: <source_column_name> | '<expression>'
   PRELOAD:
    - TRUNCATE: true | false
    - REUSE_TABLES: true | false
   SQL:
    - BEFORE: "<sql_command>"
    - AFTER: "<sql_command>"


*****************************************************
CONTROL FILE SCHEMA ELEMENT DESCRIPTIONS
*****************************************************


VERSION - Optional. The version of the gpload control file 
          schema. The current version is 1.0.0.1. 

DATABASE - Optional. Specifies which database in Cloudberry to 
           connect to. If not specified, defaults to $PGDATABASE 
           if set or the current system user name. You can also 
           specify the database on the command line using the -d option. 

USER - Optional. Specifies which database role to use to connect. 
       If not specified, defaults to the current user or $PGUSER if set. 
       You can also specify the database role on the command line using 
       the -U option. 

HOST - Optional. Specifies Cloudberry coordinator host name. If not specified, 
       defaults to localhost or $PGHOST if set. You can also specify the 
       coordinator host name on the command line using the -h option.

PORT - Optional. Specifies Cloudberry coordinator port. If not specified, defaults
       to 5432 or $PGPORT if set. You can also specify the coordinator port on 
       the command line using the -p option.

GPLOAD - Required. Begins the load specification section. A GPLOAD specification
         must have an INPUT and an OUTPUT section defined.

INPUT - Required. Defines the location and the format of the input data to 
        be loaded. gpload will start one or more instances of the gpfdist file
        distribution program on the current host and create the required external
        table definition(s) in Apache Cloudberry that point to the source data. 
        Note that the host from which you run gpload must be accessible over the 
        network by all Cloudberry hosts (coordinator and segments).

SOURCE - Required. The SOURCE block of an INPUT specification defines the 
         location of a source file. An INPUT section can have more than one 
         SOURCE block defined. Each SOURCE block defined corresponds to one 
         instance of the gpfdist file distribution program that will be started 
         on the local machine. Each SOURCE block defined must have a FILE 
         specification.
         
         For more information about using the gpfdist parallel file server 
         and single and multiple gpfdist instances, see the "Apache Cloudberry 
         Database Administrator Guide."

LOCAL_HOSTNAME - Optional. Specifies the host name or IP address of the local 
                 machine on which gpload is running. If this machine is configured
                 with multiple network interface cards (NICs), you can specify the 
                 host name or IP of each individual NIC to allow network traffic 
                 to use all NICs simultaneously. The default is to use the local 
                 machines primary host name or IP only. 

PORT - Optional. Specifies the specific port number that the gpfdist file 
       distribution program should use. You can also supply a PORT_RANGE to 
       select an available port from the specified range. If both PORT and 
       PORT_RANGE are defined, then PORT takes precedence. If neither PORT or 
       PORT_RANGE are defined, the default is to select an available port between 
       8000 and 9000. 
       
       If multiple host names are declared in LOCAL_HOSTNAME, this port number
       is used for all hosts. This configuration is desired if you want to use 
       all NICs to load the same file or set of files in a given directory location.

PORT_RANGE - Optional. Can be used instead of PORT to supply a range of port 
             numbers from which gpload can choose an available port for this 
             instance of the gpfdist file distribution program.

FILE - Required. Specifies the location of a file, named pipe, or directory location 
       on the local file system that contains data to be loaded. You can declare 
       more than one file so long as the data is of the same format in all files 
       specified.

       If the files are compressed using gzip or bzip2 (have a .gz or .bz2 file 
       extension), the files will be uncompressed automatically (provided that 
       gunzip or bunzip2 is in your path).

       When specifying which source files to load, you can use the wildcard 
       character (*) or other C-style pattern matching to denote multiple files. The 
       files specified are assumed to be relative to the current directory from which 
       gpload is executed (or you can declare an absolute path).

SSL - Optional. Specifies usage of SSL encryption. If SSL is set to true, gpload 
      starts the gpfdist server with the --ssl option and uses the gpfdists 
      protocol.

CERTIFICATES_PATH - Required when SSL is true; cannot be specified when SSL is 
                    false or unspecified. The location specified in 
                    CERTIFICATES_PATH must contain the following files:
                    *	The server certificate file, server.crt
                    *	The server private key file, server.key
                    *	The trusted certificate authorities, root.crt

                    The root directory (/) cannot be specified as 
                    CERTIFICATES_PATH.

COLUMNS - Optional. Specifies the schema of the source data file(s) in the 
          format of  <field_name>: <data_type>. The DELIMITER character 
          in the source file is what separates two data value fields (columns). 
          A row is determined by a line feed character (0x0a).
          
          If the input COLUMNS are not specified, then the schema of the output
          TABLE is implied, meaning that the source data must have the same 
          column order, number of columns, and data format as the target table.
          
          The default source-to-target mapping is based on a match of column names
          as defined in this section and the column names in the target TABLE. 
          This default mapping can be overridden using the MAPPING section.

TRANSFORM - Optional. Specifies the name of the input XML transformation passed to 
            gpload. For more information about XML transformations, see the 
            "Apache Cloudberry Database Administrator Guide."

TRANSFORM_CONFIG - Optional. Specifies the location of the XML transformation 
                   configuration file that is specified in the TRANSFORM 
                   parameter, above.

MAX_LINE_LENGTH - Optional. An integer that specifies the maximum length of 
                  a line in the XML transformation data passed to gpload.

FORMAT - Optional. Specifies the format of the source data file(s) - either plain
         text (TEXT) or comma separated values (CSV) format. Defaults to TEXT 
         if not specified. For more information about the format of the source
         data, see the "Apache Cloudberry Database Administrator Guide."

DELIMITER - Optional. Specifies a single ASCII character that separates columns
            within each row (line) of data. The default is a tab character in TEXT
            mode, a comma in CSV mode. You can also specify a non-printable ASCII
            character or a non-printable unicode character, for example: "\x1B" or
            "\u001B". The escape string syntax, E'<character-code>', is also
            supported for non-printable characters. The ASCII or unicode character
            must be enclosed in single quotes. For example: E'\x1B' or E'\u001B'. 

NEWLINE - Optional. Specifies the newline used in your data files – LF
          (Line feed, 0x0A), CR (Carriage return, 0x0D), or CRLF
          (Carriage return plus line feed, 0x0D 0x0A). If not
          specified, a Apache Cloudberry segment will detect the
          newline type by looking at the first row of data it receives
          and using the first newline type encountered.

ESCAPE - Specifies the single character that is used for C escape sequences
        (such as \n,\t,\100, and so on) and for escaping data characters 
        that might otherwise be taken as row or column delimiters. Make sure 
        to choose an escape character that is not used anywhere in your actual
        column data. The default escape character is a \ (backslash) for 
        text-formatted files and a " (double quote) for csv-formatted files, 
        however it is possible to specify another character to represent an 
        escape. It is also possible to disable escaping in text-formatted 
        files by specifying the value 'OFF' as the escape value. This is very
        useful for data such as text-formatted web log data that has many 
        embedded backslashes that are not intended to be escapes. 

NULL_AS - Optional. Specifies the string that represents a null value. 
          The default is \N (backslash-N) in TEXT mode, and an empty value 
          with no quotations in CSV mode. You might prefer an empty string 
          even in TEXT mode for cases where you do not want to distinguish 
          nulls from empty strings. Any source data item that matches this 
          string will be considered a null value. 

FORCE_NOT_NULL - Optional. In CSV mode, processes each specified column as 
                 though it were quoted and hence not a NULL value. For the 
                 default null string in CSV mode (nothing between two 
                 delimiters), this causes missing values to be evaluated as 
                 zero-length strings. 

QUOTE - Required when FORMAT is CSV. Specifies the quotation character for 
        CSV mode. The default is double-quote ("). 

HEADER - Optional. Specifies that the first line in the data file(s) is a 
         header row (contains the names of the columns) and should not be 
         included as data to be loaded. If using multiple data source files, 
         all files must have a header row. The default is to assume that 
         the input files do not have a header row. 

ENCODING - Optional. Character set encoding of the source data. Specify 
           a string constant (such as 'SQL_ASCII'), an integer encoding 
           number, or 'DEFAULT' to use the default client encoding. If 
           not specified, the default client encoding is used. For 
           information about supported character sets, see the 
           "Apache Cloudberry Reference Guide."

ERROR_LIMIT - Optional. Enables single row error isolation mode for this
              load operation. When enabled, input rows that have format errors
              will be discarded provided that the error limit count is not 
              reached on any Cloudberry segment instance during input processing. 
              If the error limit is not reached, all good rows will be loaded 
              and any error rows will either be discarded or logged to the table 
              specified in ERROR_TABLE. The default is to abort the load operation
              on the first error encountered. Note that single row error isolation
              only applies to data rows with format errors; for example, extra
              or missing attributes, attributes of a wrong data type, or invalid
              client encoding sequences. Constraint errors, such as primary key 
              violations, will still cause the load operation to abort if 
              encountered. For information about handling load errors, see 
              the "Apache Cloudberry Database Administrator Guide." 

ERROR_TABLE - Deprecated, LOG_ERRORS is encouraged instead.
              Optional when ERROR_LIMIT is declared. Specifies an error
              table where rows with formatting errors will be logged when
              running in single row error isolation mode. You can then examine
              this error table to see error rows that were not loaded (if any).
              If the error_table specified already exists, it will be used. 
              If it does not exist, it will be automatically generated.
              For more information about handling load errors, see the
              "Apache Cloudberry Database Administrator Guide."

LOG_ERRORS  - Optional when ERROR_LIMIT is declared. If true(default false),
              gpload would create an internal error table where rows with
              formatting errors will be logged when running in single row error
              isolation mode. You can then examine this error table by using
              GPDB built-in function gp_read_error_log() to see error rows that
              were not loaded (if any). For more information about handling load
              errors, see the "Apache Cloudberry Database Administrator Guide."
              NOTE: LOG_ERRORS is not allowed to use together with ERROR_TABLE.

EXTERNAL - Optional. Defines the schema of the external table database 
           objects created by gpload. The default is to use the Cloudberry 
           Database search_path. 

SCHEMA - Required when EXTERNAL is declared. The name of the schema of 
         the external table. If the schema does not exist, an error is returned. 

         If % (percent character) is specified, the schema of the table name 
         specified by TABLE in the OUTPUT section is used. If the table name 
         does not specify a schema, the default schema is used.

OUTPUT - Required. Defines the target table and final data column values 
         that are to be loaded into the database. 

TABLE - Required. The name of the target table to load into. 

MODE - Optional. Defaults to INSERT if not specified. There are 
       three available load modes: 

       INSERT - Loads data into the target table using the following 
       method: INSERT INTO target_table SELECT * FROM input_data; 

       UPDATE - Updates the UPDATE_COLUMNS of the target table 
       where the rows have MATCH_COLUMNS attribute values equal 
       to those of the input data, and the optional UPDATE_CONDITION is true. 

       MERGE - Inserts new rows and updates the UPDATE_COLUMNS of 
       existing rows where MATCH_COLUMNS attribute values are equal 
       to those of the input data, and the optional UPDATE_CONDITION 
       is true. New rows are identified when the MATCH_COLUMNS value 
       in the source data does not have a corresponding value in the 
       existing data of the target table. In those cases, the entire 
       row from the source file is inserted, not only the MATCH and 
       UPDATE columns. If there are multiple new MATCH_COLUMNS values 
       that are the same, only one new row for that value will be inserted. 
       Use UPDATE_CONDITION to filter out the rows to discard. 

MATCH_COLUMNS - Required if MODE is UPDATE or MERGE. Specifies the column(s)
                to use as the join condition for the update. The attribute 
                value in the specified target column(s) must be equal to that
                of the corresponding source data column(s) in order for the
                row to be updated in the target table. 

UPDATE_COLUMNS - Required if MODE is UPDATE or MERGE. Specifies the 
                 column(s) to update for the rows that meet the 
                 MATCH_COLUMNS criteria and the optional UPDATE_CONDITION. 

UPDATE_CONDITION - Optional. Specifies a Boolean condition (similar to 
                   what you would declare in a WHERE clause) that must 
                   be met in order for a row in the target table to be 
                   updated (or inserted in the case of a MERGE).

MAPPING - Optional. If a mapping is specified, it overrides the default 
          source-to-target column mapping. The default source-to-target 
          mapping is based on a match of column names as defined in the 
          source COLUMNS section and the column names of the target TABLE.
          A mapping is specified as either:
            <target_column_name>: <source_column_name>
          or
            <target_column_name>: '<expression>'
            
          Where expression is any expression that you would specify in the 
          SELECT list of a query, such as a constant value, a column reference,
          an operator invocation, a function call, and so on.

PRELOAD - Optional. Specifies operations to run prior to the load operation. 
          Right now the only preload operation is TRUNCATE.

TRUNCATE - Optional. If set to true, gpload will remove all rows in 
           the target table prior to loading it.

REUSE_TABLES - Optional. If set to true, gpload will not drop the external table 
               objects and staging table objects it creates. These objects will 
               be reused for future load operations that use the same load 
               specifications. This improves performance of trickle loads 
               (ongoing small loads to the same target table).

SQL - Optional. Defines SQL commands to run before and/or after the load 
      operation. You can specify multiple BEFORE and/or AFTER commands. List 
      commands in the order of desired execution.

BEFORE - Optional. An SQL command to run before the load operation starts. 
         Enclose commands in quotes.

AFTER - Optional. An SQL command to run after the load operation completes. 
        Enclose commands in quotes.


*****************************************************
NOTES
*****************************************************
If your database object names were created using a double-quoted 
identifier (delimited identifier), you must specify the delimited name 
within single quotes in the gpload control file. For example, if you 
create a table as follows: 

     CREATE TABLE "MyTable" ("MyColumn" text);

Your YAML-formatted gpload control file would refer to the above 
table and column names as follows:

        - COLUMNS:
          - '"MyColumn"': text
    
    OUTPUT:
        - TABLE: public.'"MyTable"'



*****************************************************
LOG FILE FORMAT
*****************************************************

Log files output by gpload have the following format:

<timestamp>|<level>|<message>

Where <timestamp> takes the form: YYYY-MM-DD HH:MM:SS, 
<level> is one of DEBUG, LOG, INFO, ERROR, 
and <message> is a normal text message.

Some INFO messages that may be of interest 
in the log files are (where # corresponds 
to the actual number of seconds, units of data, 
or failed rows):

INFO|running time: #.## seconds
INFO|transferred #.# kB of #.# kB.
INFO|gpload succeeded
INFO|gpload succeeded with warnings
INFO|gpload failed
INFO|1 bad row
INFO|# bad rows


*****************************************************
EXAMPLES
*****************************************************

Run a load job as defined in my_load.yml:

  gpload -f my_load.yml

Example load control file:

---
VERSION: 1.0.0.1
DATABASE: ops
USER: gpadmin
HOST: cdw-1
PORT: 5432
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - etl1-1
           - etl1-2
           - etl1-3
           - etl1-4
         PORT: 8081
         FILE: 
           - /var/load/data/*
    - COLUMNS:
           - name: text
           - amount: float4
           - category: text
           - descr: text
           - date: date
    - FORMAT: text
    - DELIMITER: '|'
    - ERROR_LIMIT: 25
    - LOG_ERRORS: True
   OUTPUT:
    - TABLE: payables.expenses
    - MODE: INSERT
   SQL:
   - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
   - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"


*****************************************************
SEE ALSO
*****************************************************

gpfdist, CREATE EXTERNAL TABLE 

See the "Apache Cloudberry Reference Guide" for information 
about CREATE EXTERNAL TABLE.
