Page History

...

Time

Item

Who

Notes

DM-503-1 testing plan

Gregory Dubois-Felsmann

We still need to complete the milestone testing for the F17 "load and serve the WISE single-epoch data" effort.
The data were loaded just at the end of the cycle, but the PDAC team were then diverted by the LSP workshop and the holidays; we are just returning to wiring up the data into the Portal Aspect and completing the testing now.
The goals of the testing are:
- At DB/DAX level: verify that the expected data were loaded and are accessible; capture performance numbers for a set of representative queries.
- At Portal level: verify that the data are accessible.
- This round of testing does not cover UX improvements - the next "user testing" cycle will be after S18.
Initial attempts to work with the data at the beginning of this week ran into a lot of problems and so it hasn't been possible to complete the connections with the Portal.
We want to do these tests ASAP in order to allow the migration of PDAC to a Kubernetes deployment to advance.

Qserv / DAX status re: WISE data

Fritz Mueller

Data were ingested at the end of the last cycle. Discovered problems with the behavior/stability/performance of the version of Qserv and dbserv that were deployed on the PDAC systems.
A major change of the Qserv code base (breaking change to xrootd APIs) was deployed just before the holiday, and downstream consequences are still being shaken out.
Error handling in dbserv needed corrections.
Qserv is now running properly again and was able to perform a full table scan test on the WISE data.
Kian-Tat Lim: are there any lessons-learned from this for testing packages for dbserv/Qserv that might exercise the code paths that triggered the problems? Fritz Mueller: not really: most/all of the problems required running against a very large dataset. PDAC is ahead of the KPM testing at the moment, so the problems didn't show up in those tests.
Kenny Lo, via Fritz Mueller: still needs to do some final tests on the updated dbserv. Awaiting end of today's maintenance window.

PDAC Portal status

Unknown User (xiuqin)

Portal code should be ready to handle the now-more-stable dbserv/Qserv stack for the WISE single-epoch source tables.
The milestone test affects only catalog queries; there is no new image data.
Development news: now able to display HiPS images natively in Firefly.

Summary of the Meltdown/Spectre exploits

Donald Petravick

Gergory, Michelle,

I am not abel to attend the PDAC meeting, but can make a statement about what the LSST project at NCSA knows/understands/would ask for.

My understanding is that programs have to execute on a machine in order to exploit these bugs. There are many machines which are 1) Run by trusted administrators and 2) run only trusted code (after making a few things explicit e.g java script on a browers on such a machine). I can see that a set of operational controls can be derived such that the machines can be run un-patched, if a patch degrades the system to a graet extent. I also understand that Linux system can boot into patched or unpatched mode.

I would say that any mcahine used by a community where these sorts of operational controls are absent whould likley need patching, in absence of an similar story about non-techincal controls.

I also expect that smart engineeers are working on this, and that a way will be found to make future generations of patches more efficient.

I undersatand (but only from from hallway conversations), that that NCSA is not pursuing a one-size-fits-all solution, and that the criteria above are valid input to the thinking about the patch at NCSA.

‹ Best
‹ Don

Current status of Meltdown/Spectre patching and
other infrastructure issues

Unknown User (pdomagala)

As of early this morning, patching and BIOS updates on the verify-worker nodes is underway.
- ~ 6 nodes are being difficult. Unsure if the team has wrestled them into submission yet.
- BIOS updates are required in order to address the Spectre vulnerability.
- The full set of patches needed for the Dell hosts are available.
  - Once completed, externally-facing systems in the LSST-dev and verification cluster environments should need no further patching.
- We do not have the full set of patches needed for Lenovo. They withdrew their initial BIOS updates and announced a new ETA of 12 Feb 2018.
  - this This effects the PDAC cluster, which is Lonovohas entirely Lenovo hardware.
Same updates on lsst-dev01 started at 08:30
Not patching non-user-login systems, such as GPFS servers, at this time. Still looking into performance implications. NCSA testing of client-side impacts so far show sub-1% impacts. Asking users to look out for impacts.
Impacts are more likely on I/O heavy servers. Therefore we should pay special attention to any performance changes on the Qserv machines.
There are exploits circulating, so NCSA would like to apply the patches ASAP.
We will attempt to complete the milestone tests before next Thursday's ( 18 Jan 2018 ) maintenance window, so that patching can proceed. Expect a check-in on the Slack #dm-pdac channel on Wednesday morning to verify status and confirm that it's OK to proceed with the patching.

Kubernetes status

Unknown User (aloftus) ?

Kubernetes installation done on a single server: qserv-test01; see IHS-
Jira
server JIRA
columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId 9da94fb6-5771-303d-a785-1b6c5ab0f2d2
key DM-12847
and its children for future updates.
Loi Ly will resume work on the IPAC Kubernetes cluster setup next week; IPAC has a cluster of 4 VMs ready, but Kubernetes installation was delayed by urgent IRSA public release work.
Jira
server JIRA
columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId 9da94fb6-5771-303d-a785-1b6c5ab0f2d2
key DM-12950
Fritz Mueller: Steve Pietrowicz has been able to get multicast working with the Weave network overlay.
Let's all stay in close touch on versions, networking configuration, etc.

AOB

Simon Krughoff: see
Jira
server JIRA
columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId 9da94fb6-5771-303d-a785-1b6c5ab0f2d2
key IHS-695
- there is an issue with the scheduling of slurm jobs in the hours immediately preceding the maintenance window. Unknown User (mbutler): it's a reasonable request to improve this behavior, but it may take a bit of work to implement. Conclusion: Simon Krughoff will create a more specific ticket to capture the request.
- Post-meeting:
  Jira
  server JIRA
  columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
  serverId 9da94fb6-5771-303d-a785-1b6c5ab0f2d2
  key IHS-699
  has been created for this request.

Space shortcuts

Page tree

Versions Compared

Old Version 7

New Version Current

Key

Action items