1. Introduction

This page describes the steps used in DM-24396 to ingest the DESC-DC2 tables into Qserv.

Joanne Bogart provided 3 tsv files containing some DESC DC2 (run12p_v4) "ccdvisit" table, "object" table, and "forcedsource" table. The data files and their schemas are stored at /datasets/gapon/desc/ which is a shared filesystem available on all machines.

ccdvisit is a "regular" table and not spatial. The data are not partitioned and the entirety loaded to all workers.
object is a "director" table. The data are spatially partitioned into chunks and distributed across the workers.
forcedsource is a "chid" table of object, with the director key objectid . Overlaps are not stored for child tables.

The steps basically follow Igor Gaponenko 's Live demo: test ingest of a subset of one track of the HSC Object catalog, with a few updates following the newer versions of the system. This is only an example of ingesting today but not the only way of ingesting (e.g. one can use multiple super-transactions or a different sequence); also the system evolves and things can change. Some approaches here will not scale but okay for this dataset because the data volume here is small.

If you want to follow along the example, make sure you have access to the lsst* machines and can read the mentioned files.

Currently steps needing the docker container (that is, the catalog loading step) are done on qserv-master01; others can be done on lsst-dev* or lsst-verify-worker*.

The Qserv instance lsp-int is deployed at NCSA and usually has a version newer than the released version.

2. Make all configuration files

Using the given schemas, manually make the configurations files for sph-partition and Qserv table creation. The made files are at /datasets/gapon/desc/configs/

For sph-partition, note that the partitioning overlap = 0 for the forcedsource table; there are no overlap files for the forcedsource table.

For Qserv tables, examples below:

{"database":"desc_dc2_run12p_v4",
"table":"object",
"auth_key":"",
"is_partitioned":1,
"chunk_id_key":"chunkId",
"sub_chunk_id_key":"subChunkId",
"is_director":1,
"director_key":"objectid",
"latitude_key":"decl",
"longitude_key":"ra",
"schema":[
{"name": "objectid", "type": "bigint NOT NULL"},
{"name": "parentobjectid", "type": "bigint NOT NULL"},
{"name": "ra", "type": "double"},
{"name": "decl", "type": "double"},
...

{"database":"desc_dc2_run12p_v4",
"table":"forcedsource",
"auth_key":"",
"is_partitioned":1,
"chunk_id_key":"chunkId",
"sub_chunk_id_key":"subChunkId",
"is_director":0,
"director_key":"objectid",
"latitude_key":"decl",
"longitude_key":"ra",
"schema":[
{"name": "objectid", "type": "bigint NOT NULL"},
{"name": "ccdvisitid", "type": "bigint NOT NULL"},
{"name": "ra", "type": "double"},
{"name": "decl", "type": "double"},
...

{"database":"desc_dc2_run12p_v4",
"table":"ccdvisit",
"auth_key":"",
"is_partitioned":0,
"is_director":0,
"schema":[
{"name": "ccdvisitid", "type": "bigint NOT NULL"},
{"name": "visitid", "type": "int"},
...

3. Partition the CSV files

sph-partition --verbose -c /datasets/gapon/desc/configs/object.cfg \
  --out.dir /datasets/gapon/desc/object \
  --in=/datasets/gapon/desc/dpdd_object_qserv_data.txt \
  >& log-partition-object

sph-partition --verbose -c /datasets/gapon/desc/configs/forcedsource.cfg \
  --out.dir /datasets/gapon/desc/forcedsource \
  --in=/datasets/gapon/desc/forcedsource_qserv_data.txt \
  >& log-partition-forcedsource

These use the configuration files made in the last step, partition the input data, and write the chunk files to /datasets/gapon/desc/object and /datasets/gapon/desc/forcedsource on the shared filesystem.

The names of the chunk & overlap files follow the format of chunk_12345.txt or chunk_12345_overlap.txt where 12345 is the chunk number (or chunk id).

The sph-partition command comes from the https://github.com/lsst/partition package. There are multiple ways to use it. I cloned the package to disk, built it, set it up as a local eups package with the shared stack in /software/lsstsw/stack/ and partition these small files directly on lsst-dev* machines. Alternatively one can also use partition in the Qserv docker images.

4. Get the unique chunk numbers

As this dataset is small, I can "ls" the chunk file folders to get a list of chunk files, then a list of the 140 unique chunk numbers.

Checked around and found out that a few chunks exist in object but not forcedsource: chunks 78965,74213,77604. In a previous version of qserv we needed to create empty chunk files for these missing chunks, but the current implementations can handle this scenario now.

5. Create a database

Use the Qserv REST web service to create a new database. The parameters shall match the values in the partitioning configurations (Step 2).

echo '{"database":"desc_dc2_run12p_v4","num_stripes":340,"num_sub_stripes":3,"overlap":0.01667,"auth_key":""}' | \
curl 'http://lsst-qserv-master01:25080/ingest/v1/database' \
  -X POST \
  -H "Content-Type: application/json" \
  -d @-

A JSON object containing "success":1 is returned upon a successful operation via the web service. Same for the following steps using the web services.

6. Configure table "object"

The configuration file at /datasets/gapon/desc/configs/desc_dc2_object.json was made in Step 2. Take a look of the file.

cat /datasets/gapon/desc/configs/desc_dc2_object.json | \
curl 'http://lsst-qserv-master01:25080/ingest/v1/table' \
  -X POST \
  -H "Content-Type: application/json" \
  -d @

7. Configure table "forcedsource"

The configuration file at /datasets/gapon/desc/configs/desc_dc2_forcedsource.json was made in Step 2. Take a look of the file.

cat /datasets/gapon/desc/configs/desc_dc2_forcedsource.json | \
curl 'http://lsst-qserv-master01:25080/ingest/v1/table' \
  -X POST \
  -H "Content-Type: application/json" \
  -d @

8. Start a super-transaction

echo '{"database":"desc_dc2_run12p_v4","auth_key":""}' | \
    curl 'http://lsst-qserv-master01:25080/ingest/v1/trans' \
      -X POST \
      -H "Content-Type: application/json" \
      -d @-

A super-transaction ID will be returned; that number is needed in many subsequent steps. In the examples below the super-transaction ID is 102.

9. Allocate chunks

Conceptually two (related) things happen here (1) need to allocate/register the chunks before loading, and (2) need to know which worker a chunk is to be loaded.

I used the script /datasets/gapon/desc/ingest/allocate_chunks.py (like below). It reads the unique chunk numbers compiled in Step 4, and I piped the results to a file named "chunk2worker" which then has the chunk-to-worker mapping for all 140 chunks.

import requests
import sys
url='http://lsst-qserv-master01:25080/ingest/v1/chunk'
transactionId=102
database='desc_dc2_run12p_v4e'
f = open('unique_chunk_numbers')
for chunkStr in f:
    chunk = int(chunkStr)
    response = requests.post(url, json={"transaction_id":transactionId,"chunk":chunk,"auth_key":""})
    responseJson = response.json()
    if not responseJson['success']:
        print("failed for chunk: %d, error: %s" % (chunk,responseJson['error'],))
        sys.exit(1)
    else:
        host = responseJson['location']['host']
        port = responseJson['location']['port']
        print("%d %s %d" % (chunk,host,port))

10. Load data of tables "object" and "forcedsource"

10.1. Make data-loading commands

qserv-replica-file-ingest is used for the loading, and I used it from the docker container qserv/replica:tools on lsst-qserv-master01.

Each loading command looks like this:

docker run --rm -t --network host -u 1000:1000 \
  -v /etc/passwd:/etc/passwd:ro \
  -v /datasets/gapon/desc/object/:/datasets/gapon/desc/object/:ro \
  --name qserv-replica-file-ingest-1 \
  qserv/replica:tools \
  qserv-replica-file-ingest --auth-key=AUTHKEY FILE lsst-qserv-db22 25002 102 object P /datasets/gapon/desc/object/chunk_79653_overlap.txt

In the above example 102 is the super-transaction ID. The chunk number of the chunk file, qserv-db worker hostname, and port have to match accordingly. That's why in the last step, we stored the chunk-to-worker mapping in a file; for the small dataset a lookup is quick (but this is not to scale).

I used the script /datasets/gapon/desc/ingest/worker_scripts_generator.py to make the loading command for each chunk or chunk overlap file. For example in /datasets/gapon/desc/object/ there are 279 *txt files, so I got 279 docker command to load one file per command.

There is also a batch mode for the loading that I wasn't using here.

10.2. Load chunks of table "object"

Doing all 279 commands directly on qserv-master took around 15 minutes.

Alternatively, we can use the 30 qserv workers to load. First I used split -n l/30 --numeric-suffixes=1 <file_with_all_279_command> <split_file_prefix> to split the 279 commands into 30 files . Then each qserv worker does one set of loading; for example,

for i in {01..30}; do \
    echo $i;
    ssh -n qserv-db$i "source /PATH/TO/FILE/load_$i >& /tmp/ingest_$i.log" & \
done

Those split files live on a shared file system all workers can access, so I could do the above.

It took a minute or so for the object table loading. Then I checked all went fine, seeing nothing in the log files:

for i in {01..30}; do \
    echo $i;
    ssh -n qserv-db$i "cat /tmp/ingest_$i.log" ;
done

10.3. Load chunks of table "forcedsource"

Same as loading the object chunks.

11. Configure table "ccdvisits"

Same as Step 6 or 7. This step may also be done earlier.

The configuration file at /datasets/gapon/desc/configs/desc_dc2_ccdvisit.json was made in Step 2. Take a look of the file.

cat /datasets/gapon/desc/configs/desc_dc2_ccdvisit.json | \
    curl 'http://lsst-qserv-master01:25080/ingest/v1/table' \
      -X POST \
      -H "Content-Type: application/json" \
      -d @-

12. Load data of table "ccdvisits"

An example loading command is:

docker run --rm -t --network host -u 1000:1000 \
  -v /etc/passwd:/etc/passwd:ro \
  -v /datasets/gapon/desc/:/datasets/gapon/desc/:ro \
  --name qserv-replica-file-ingest-1 \
  qserv/replica:tools \
  qserv-replica-file-ingest --auth-key=AUTHKEY FILE lsst-qserv-db01 25002 102 ccdvisit R /datasets/gapon/desc/ccdvisit_data.csv

Do this for all workers.

13. Commit super-transaction

curl 'http://localhost:25080/ingest/v1/trans/102?abort=0&build-secondary-index=1' \
    -X PUT  -H "Content-Type: application/json" -d '{"auth_key":""}'

14. Publish the database

curl 'http://localhost:25080/ingest/v1/database/desc_dc2_run12p_v4' \
    -X PUT  -H "Content-Type: application/json" -d '{"auth_key":""}'

15. Test the catalog

Run some simple queries to sanity-check the ingested catalogs. It may take a while after publishing for it to work.

Space shortcuts

Page tree

1. Introduction

2. Make all configuration files

3. Partition the CSV files

4. Get the unique chunk numbers

5. Create a database

6. Configure table "object"

7. Configure table "forcedsource"

8. Start a super-transaction

9. Allocate chunks

10. Load data of tables "object" and "forcedsource"

10.1. Make data-loading commands

10.2. Load chunks of table "object"

10.3. Load chunks of table "forcedsource"

11. Configure table "ccdvisits"

12. Load data of table "ccdvisits"

13. Commit super-transaction

14. Publish the database

15. Test the catalog

Space shortcuts

Page tree

Steps of ingesting a small DESC-DC2 dataset to Qserv

1. Introduction

2. Make all configuration files

3. Partition the CSV files

4. Get the unique chunk numbers

5. Create a database

6. Configure table "object"

7. Configure table "forcedsource"

8. Start a super-transaction

9. Allocate chunks

10. Load data of tables "object" and "forcedsource"

10.1. Make data-loading commands

10.2. Load chunks of table "object"

10.3. Load chunks of table "forcedsource"

11. Configure table "ccdvisits"

12. Load data of table "ccdvisits"

13. Commit super-transaction

14. Publish the database

15. Test the catalog