Skip to content

Operations reference

Running, monitoring, and troubleshooting an installed /data/iserver/ deployment. Assumes you've completed INSTALL.md. Everything in this doc is run as the unprivileged user that owns the install (maestro in our convention) after sourcing ~/.env_integrator.

Bare-metal vs container. Every command below is the bare-metal form, run from inside /data/iserver/. The canonical 192.168.1.201 deployment runs the same install layout inside the eltmaestro docker container — wrap each command with docker exec eltmaestro <cmd> to run it there. The paths, env vars, and mstart-* / mstop-* admin tools are identical inside and outside the container. See BACKUP.md for the host-vs-container map.

Start / stop / restart

mstart-* and mstop-* are the canonical admin commands. They're on PATH via ~/.env_integrator and live in /data/iserver/server/webserver/admin/.

Command Effect
mstart-all Start Postgres, meta-service, engine, Spark service, and any auxiliary services in the right order.
mstop-all Reverse order stop.
mrestart-all Stop + start everything.
mstart-domain / mstop-domain / mrestart-domain Just the Spring Boot meta-service (the SOAP "domain"). Use when iterating on the meta-service jar without bouncing Postgres.
mstart-database / mstop-database Just Postgres.
mstart-spark-service / mstop-spark-service Optional long-running Spark driver (ELTM_SPARK_DRIVER) that hosts org.maestro.meta.service.StartMetaService via spark-submit --master local[2]. Used for streaming / online-query paths. The standalone Spark master + worker are started separately by mstart-all.
mstart-gpfdist Greenplum gpfdist external-table reader (optional).
mdeploy-service Re-deploy the meta-service from server/engine/MaestroMetaDataProviderService.jar.

PID files for each service live under /data/iserver/transient/socket/.

Watching the stack

show-mprocess              # list running BATCH WORKFLOWS (java -jar sqlm.jar --batch ...).
                           # Format: PID|<workflow>|<started>|<duration>. Does NOT list
                           # the meta-service, Postgres, Spark, or other long-running
                           # services -- use the ps/ss snippets below for those.
kill-mprocess <pid>        # kill a workflow by its show-mprocess PID (exact match).
show-webserver-log         # tail -500 of the meta-service log
                           # ($MAESTRO_CONSOLE_LOG_DIR/meta-service.log).

Live logs:

tail -F /data/iserver/data/log/console/*.log    # service stdout/stderr
tail -F /data/iserver/data/log/load/*.log       # data-load worker output
tail -F /data/iserver/server/pgsql/log/*.log    # PostgreSQL server log

Health check (WSDL endpoint should return HTTP 200):

curl -sk -o /dev/null -w "%{http_code}\n" \
    https://localhost:8181/MaestroMetaDataProviderService/MaestroMetaService?wsdl

Install layout reference

/data/iserver/
├── server/
│   ├── engine/
│   │   ├── MaestroMetaDataProviderService.jar    # Spring Boot fat-jar (SOAP meta-service)
│   │   ├── sqlm.jar                              # engine CLI
│   │   └── lib/                                  # engine deps
│   ├── pgsql/
│   │   ├── data/                                 # PG data dir (PRESERVED across upgrades)
│   │   └── log/                                  # PG server logs
│   ├── pgsql.factory/                            # initdb.sql + metadata + migrations (REPLACED on upgrade)
│   ├── utils/                                    # show-mprocess, kill-mprocess, ADMIN_* SOAP utils
│   ├── webserver/admin/                          # mstart-*, mstop-* (on PATH)
│   └── keystore.jks                              # TLS cert (PRESERVED across upgrades; SAN matches host)
├── library/
│   ├── JDBC/                                     # all bundled JDBC drivers (REPLACED on upgrade)
│   └── jdbc_list.json                            # manifest used by fetch-jdbc-drivers.sh
├── tools/
│   ├── spark/                                    # Spark 4.1.1 (downloaded by install.sh)
│   └── hadoop/                                   # Hadoop 3.4.3 (downloaded by install.sh)
├── data/
│   ├── log/{load,console}/                       # runtime logs
│   └── workflow/                                 # deployed jobs (PRESERVED across upgrades)
├── transient/
│   ├── socket/                                   # PID files
│   └── staging/                                  # transient staging area
├── bin/                                          # install.sh, update-server.sh, lib/ (REPLACED on upgrade)
├── env_integrator                                # template (the live copy is ~/.env_integrator)
└── VERSION                                       # built version + git SHA + build host + timestamp

Container ports of interest (per source/Jenkinsfile's deploy block — host:container):

Host port Container port Service
8181 8181 Meta-service HTTPS (WSDL endpoint, Spring Boot + Jetty 12)
9080 8080 Spark UI
5432 5432 PostgreSQL (audit DB + Marquez DB)
7077 7077 Spark master
10000 10000 Hive Thrift (Spark HiveServer2)
3443 3443 Audit-analytics HTTPS
3080 3080 Audit-analytics HTTP
5050 5000 Marquez API (host port remapped because the docker registry owns :5000)
3000 3000 Marquez Web UI

The meta-service itself only binds 8181 (HTTPS only — there is no HTTP→HTTPS redirect). The other ports are bound by supervisord-managed sidecars (Spark, audit-analytics, Marquez) and by postgres directly.

Common admin tasks

Reset a user's password

mset-password-md5 <USER_ID_UPPERCASE> '<new-password>'
USER_ID in t_user is stored uppercased — INTEGRATOR, not integrator.

Apply pending migrations manually

Normally update-server.sh runs them as part of the upgrade. If you need to apply them mid-life:

mstop-domain
. ~/.env_integrator
psql -h localhost -U maestro -d sqlmaestro -f /data/iserver/server/pgsql.factory/migrations/<file>.sql
mstart-domain

Re-fetch JDBC drivers (e.g. after editing jdbc_list.json)

/data/iserver/bin/fetch-jdbc-drivers.sh \
    /data/iserver/library/jdbc_list.json \
    /data/iserver/library/JDBC

(Dockerized .201 deploy: library/JDBC/ lives inside the image at /opt/iserver/library/JDBC/ and is populated at image-build time by the Dockerfile's fetch-jdbc-drivers.sh invocation. To pick up new drivers you must rebuild + redeploy the image — hot-swap only covers sqlm.jar and does NOT update this directory.)

Reseed an empty t_mpp_system / t_jdbc_metadata (one-off, older deployments)

Deployments built before commit 3064076d5 (2026-05-27) silently end up with these two seed dictionaries empty because the initdb.sql COPY block for t_mpp_system declared 3 columns but the data carried 2 fields, and platform_license_string was NOT NULL with no default. Symptom: the WPF connection editor's "platform default" dropdown is unpopulated. The audit DB on a fresh post-fix deploy populates both correctly; an EXISTING deploy needs the one-time manual reseed below.

Detect:

docker exec eltmaestro mpsql -c "select 't_mpp_system' as tbl, count(*) from t_mpp_system
                                  union all select 't_jdbc_metadata', count(*) from t_jdbc_metadata;"
If either count is 0, reseed by re-running the canonical COPY blocks from source/root-engine-install/db/initdb.sql (search for ^COPY t_mpp_system and ^COPY t_jdbc_metadata). One-shot recipe:

# Run from the repo root on a host that can reach the audit DB
docker exec -i eltmaestro mpsql -d sqlmaestro <<'SQL'
BEGIN;
TRUNCATE t_mpp_system, t_jdbc_metadata;
\copy t_mpp_system (platform_name, platform_class) FROM PROGRAM 'awk ''/^COPY t_mpp_system/,/^\\\\\\.$/'' source/root-engine-install/db/initdb.sql | sed ''1d;$d'''
\copy t_jdbc_metadata (...) FROM PROGRAM '...'   -- mirror for the t_jdbc_metadata COPY block
COMMIT;
SQL

In practice the 192.168.1.201 reseed on 2026-05-27 was done by SSH'ing to the host, copying the relevant COPY block out of the deployed initdb.sql (inside the container at /data/iserver/server/pgsql.factory/initdb.sql), and \copy-ing it via mpsql. Either approach works; the in-container variant avoids shipping the SQL out and back in.

Edit integrator config from the WPF client

Admin-only path that avoids SSH for routine knob changes (hostnames, ports, $STAGE_COMMAND, env_integrator operator variables):

  1. WPF → Administration → Maestro Server → Integrator Config tab.
  2. Reload — fetches the live ~/.env_integrator plus every file under $MAESTRO_ENGINE_HOME/server/pgsql/metadata/integrators/<vendor>/. Each file appears as a separate row in the list (collapsed by default — just the file path + chevron + a ● modified marker that lights up when you edit).
  3. (Optional) Type into the Filter box above the list to narrow ~357 rows by file-path substring (case-insensitive). Example: typing clickhouse shows just the ~20 clickhouse rows; system.cfg shows all 39 vendors' system.cfg files. Clear the box to see all rows again. Edits hidden by a filter are not lost — Save still picks them up.
  4. Click a row to expand → multi-line Consolas editor for that one file. Edit in place. Files you don't touch are left alone (the server's applyJson treats absent keys as "leave alone").
  5. Show Edits — local-only sanity check; status bar reports the dirty count + first 5 file paths so you can confirm you're about to save what you think.
  6. Save — server writes only the rows you actually edited. Writes to the live tree first (immediate effect for engine batches), then mirrors to factory (pgsql.factory/) so update-server.sh won't revert your edits. Pre-existing files are renamed to <file>.bak.yyyyMMddHHmm (one shared timestamp per Save operation, so find -name '*.bak.<stamp>' lists the whole change set for rollback). A typical single-file edit reports written: 2 (1 live + 1 factory mirror) — way less noise than the older "rewrite everything" flow.

Caveats: - env_integrator changes do not take effect until the meta-service is restarted (mrestart-domain bare-metal, docker restart eltmaestro containers). Integrator *.cfg / SQL templates / launcher.sh are read per-batch and take effect on the next workflow run. - The Save response summarises {written, skipped, backupSuffix}; skipped entries name the path + reason (factory dir missing, illegal filename, no existing file to overwrite for non-system.cfg names). - Rollback for a bad save: SSH to the service host and for f in $(docker exec eltmaestro find /data/iserver -name '*.bak.<stamp>'); do docker exec eltmaestro mv "$f" "${f%.bak.<stamp>}"; done — replace <stamp> with the suffix the WPF reported. (Bare-metal: drop the docker exec eltmaestro wrapper.) - The action is audit-logged in meta-service.log as setIntegratorConfigJson (greppable). - Backup files (<name>.bak.yyyyMMddHHmm) live alongside the originals in each vendor dir but are filtered out of subsequent Reload responses (fix in 2a07573e0) so the JSON tree doesn't bloat over repeated Save cycles. Backups stay on disk indefinitely until an operator cleans them — docker exec eltmaestro find /data/iserver -name '*.bak.*' -mtime +30 -delete is the rough housekeeping pattern (or skip the docker exec wrapper bare-metal).

Connect to ClickHouse without password prompts

clickhouse-client ships in the image (apt lts channel — tracks the same LTS minor as the clickhouse/clickhouse-server:24.8 image the operator runs). To avoid the password prompt on every invocation:

One-time, server-side (host 192.168.1.201, in /home/jenkins/dev-dockers/clickhouse/users.d/maestro_etl.xml): - <password>welcome123</password> plaintext for the maestro_etl user. - Remove any <ssh_keys> or <password_sha256_hex> blocks if present — keep the file simple. - ClickHouse hot-reloads users.d/ automatically; no restart.

Bootstrap once on the host (then survives every destructive deploy via a bind-mount in the Jenkinsfile Deploy stage — /root/.clickhouse-client/config.xml inside the container is mounted read-only from /home/jenkins/maestro-ch-client/config.xml on the host):

# On 192.168.1.201, as the jenkins user:
sudo install -d -o jenkins -g jenkins -m 0700 /home/jenkins/maestro-ch-client

cat > /home/jenkins/maestro-ch-client/config.xml <<'XML'
<?xml version="1.0"?>
<config>
  <host>192.168.1.201</host>
  <port>9100</port>
  <user>maestro_etl</user>
  <password>welcome123</password>
</config>
XML

chmod 600 /home/jenkins/maestro-ch-client/config.xml

(Replace welcome123 with the password you actually use. Native TCP on host :9100 — see reference_clickhouse_host.md for why it's remapped from 9000. For TLS use <port>9440</port> + <secure>true</secure>; for HTTP use <port>8123</port>. The Jenkinsfile Deploy stage's destructive block refuses to run if this file is missing — see CLICKHOUSE-AUTH.md § 2.2 for the precondition.)

Verify (single one-liner, no docker exec -it):

docker exec eltmaestro clickhouse-client --query "SELECT version(), currentUser()"
# expected (no prompt):
# 24.8.x.x   maestro_etl

Full design context, common-error mapping, SHA256/SSH-key hardening notes: CLICKHOUSE-AUTH.md.

Graceful shutdown of the whole stack

For maintenance windows, host moves, or planned power events. Brings down all four containers (eltmaestro + clickhouse + sql1 + registry) in shutdown-safe order so PG checkpoints, HDFS flushes its fsimage, Marquez drains pending writes, ClickHouse flushes its WAL, and SQL Server flushes its log — none of which finish inside docker stop's default 10s.

Canonical script: source/root-engine-install/bin/eltm-shutdown.sh. Runs on the docker host, not inside any container.

# Containers only (host stays up). Aborts if any ELTMaestro batches are running:
sudo /path/to/eltm-shutdown.sh

# Force container stop even with active batches (the docker stop --time 60 on
# eltmaestro propagates SIGTERM through supervisord, which forwards SIGINT to
# any running engine JVM — batches get a chance to land mid-step, then die):
sudo /path/to/eltm-shutdown.sh --force

# Containers AND host poweroff. SSH session drops when the host goes down:
sudo /path/to/eltm-shutdown.sh --poweroff

# Print actions without executing (use this first if you've never run it):
/path/to/eltm-shutdown.sh --dry-run

What it does:

Step Action Why
1 docker exec eltmaestro show-mprocess Detect in-flight batches. Aborts with exit 3 unless --force.
2 docker stop --time 60 eltmaestro Supervisord forwards SIGTERM to its 11 subprocesses in reverse-priority order. PG checkpoint (stopsignal=INT per supervisord.conf), HDFS fsimage flush, Marquez API + Web drain, meta-service shut. 60s is empirically enough on the .201 stack.
2 docker stop --time 15 clickhouse WAL flush.
2 docker stop --time 15 sql1 SQL Server log flush.
2 docker stop --time 5 registry Stateless cache; trivially fast.
3 docker ps Verify nothing's still running. Exit 4 if anything is.
4 (optional) sudo systemctl poweroff Host poweroff, only with --poweroff. Needs NOPASSWD: ALL for the invoking user or interactive sudo.

From a Windows shell with the SSH helper configured: Invoke-EltmaestroShutdown [-Force] [-Poweroff] [-DryRun] scp's the script + execs it via the same SSH path as Invoke-EltmaestroHotswapViaSsh.

Coming back up after the shutdown: docker daemon starts containers with restart-policy=always automatically, but ours are all unless-stopped — they stay stopped until docker start eltmaestro clickhouse sql1 registry (or docker start $(docker ps -aq)). Supervisord brings the eltmaestro internals up in priority order on container start.

What NOT to do: docker kill instead of docker stop (SIGKILL — no chance for PG checkpoint, HDFS flush, Marquez commit; risks WAL corruption). sudo poweroff before stopping containers (docker daemon gets ~10s default before host force-stops). docker compose down is not applicable — these four containers are docker-run-managed, no compose file orchestrates them.

VM full backup (Proxmox)

The whole 192.168.1.201 VM (eltmaestro + clickhouse + sql1 + registry + jenkins + every named volume) gets backed up via Proxmox vzdump --mode stop — a single command that gracefully shuts the VM down, snapshots the disk, and starts it back up.

One-off (before a risky change):

ssh root@<proxmox-host>
vzdump $VMID --mode stop --storage <storage-id> --compress zstd

Recurring weekly job — paste into /etc/pve/jobs.cfg on the Proxmox host:

vzdump: eltmaestro-vm-weekly
        enabled 1
        schedule sun 02:00
        vmid <VMID>
        storage <storage-id>
        mode stop
        compress zstd
        prune-backups keep-last=4,keep-monthly=2
        mailnotification failure

Downtime is 15–30 min during the run — Jenkins / eltmaestro / ClickHouse / SQL Server all unreachable. Schedule outside business hours.

Full procedure (QEMU Guest Agent setup, backup-storage registration with pvesm add, restore commands, restore-drill recipe, verification one-liners): BACKUP.md.

Jenkins controller daily backup

Between the weekly vzdumps, the Jenkins controller's $JENKINS_HOME is snapshotted off-host every night via rsync with hardlinked rotation — no downtime, incremental, separate rrsync-restricted receiver. Use this for fast restore of config drift, accidentally-deleted jobs, or credential rotation rollback without waiting for the next weekly VM snapshot.

Full procedure (apt-install rsync on both ends, rrsync forced-command setup, --link-dest rotation, BKHOST current symlink + retention cron, restore-in-place + single-job + drill recipes): JENKINS-BACKUP.md.

Inspect what's running

# Inside the container (or bare-metal):
ps -ef | grep -E '[j]ava.*maestro|[p]ostgres'
# Host-side listener check (matches the Jenkinsfile port table):
ss -lntp | grep -E ':(8181|9080|5432|7077|10000|3443|3080|5050|3000)'

Troubleshooting

install.sh fails with 'jq' not on PATH

You're running an ad-hoc publish.sh invocation on the build host, not install.sh on a target host. install.sh does not need jq. If this triggers from inside install.sh, you're hitting fetch-jdbc-drivers.sh — install jq (sudo dnf install jq / sudo apt install jq) and re-run.

install.sh preflight: "JDK 21 not found"

The preflight looks under /usr/lib/jvm/. If you installed JDK 21 to a non-standard path, set JAVA_HOME in the environment before running install.sh and the script will respect it.

WSDL returns HTTP 503 / connection refused

Service didn't fully start. Check in order: 1. show-mprocess — is the domain listed as running? 2. tail -n 200 /data/iserver/data/log/console/meta-service.log — look for stack traces. Most common: PG didn't come up first, or keystore.jks is missing/wrong-permission. 3. Postgres up? pg_isready -h localhost -p 5432 4. Keystore readable by the maestro user? ls -l /data/iserver/server/keystore.jks (must be 600 or 644, owned by the install user)

HTTPS works on localhost but not from another host

install.sh generates the keystore with SAN=DNS:$(hostname -f),DNS:localhost,IP:$(hostname -I),IP:127.0.0.1. If the host's FQDN or IP changed after install, regenerate:

mstop-domain
rm /data/iserver/server/keystore.jks
/data/iserver/install.sh         # idempotent; only the keystore step will run
mstart-domain

Postgres won't start: "database files are incompatible with server"

The PG version installed on the host doesn't match the data dir's catalog version. The bundled initdb.sql requires PG 16; if the host has PG 15 or PG 17 installed, point PGBIN in ~/.env_integrator.paths at the right /usr/{pgsql-16,lib/postgresql/16}/bin/ directory.

"Address already in use" on 8181 or 5432

Another service is binding the port. ss -lntp | grep :8181 to identify; usually a leftover meta-service or a system Postgres. mstop-all then start; if the system Postgres is interfering, sudo systemctl disable --now postgresql (the bundled PG runs from /data/iserver/server/pgsql/).

WPF Administration → Server Log Refresh shows nothing

Two-link chain: WPF's ADMIN_SHOW_WEBSERVER_LOG SOAP op invokes show-webserver-log on the service host; the script tail -500 $MAESTRO_CONSOLE_LOG_DIR/meta-service.log. If the textbox is empty: 1. Reader present? which show-webserver-log from inside the service process's environment. Lives in MAESTRO_UTILS_DIR (= …/server/utils/), which env_integrator appends to PATH. If missing, the service was launched without sourcing env_integrator. 2. Writer and reader agree? Verify $MAESTRO_CONSOLE_LOG_DIR/meta-service.log is the file Spring Boot is actually writing to. On bare-metal mstart-domain redirects to that exact path. On Docker start-meta-service.sh exports LOGGING_FILE_NAME=$MAESTRO_CONSOLE_LOG_DIR/meta-service.log (08e2f3093 — older images hard-coded /opt/iserver/data/log/meta-service.log, missing the console/ subdir, which made the reader exit 1 silently). 3. File exists and non-empty? docker exec eltmaestro ls -la /opt/iserver/data/log/console/meta-service.log (Docker) or ls -la $MAESTRO_CONSOLE_LOG_DIR/meta-service.log (bare-metal). If the file is missing, the launcher didn't set LOGGING_FILE_NAME or the directory wasn't created — see MaestroMeta.ADMIN_SHOW_WEBSERVER_LOG which now surfaces stderr to the WPF caller on empty stdout (08e2f3093), so future drift of the same shape shows the diagnostic in the textbox itself.

Upgrade left services down

update-server.sh runs mstop-all then mstart-all. If the start phase failed, services stay down. Look at: - tail -n 500 /data/iserver/data/log/console/*.log - cat /data/iserver/VERSION to confirm which release is actually in place - The update-server.sh console output — it logs each step

If the new release is unhealthy, see UPGRADE.md "Rollback" for the recovery path.