Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Additional details and ticket IDs.

...

TimeItemWhoNotes
DM-503-1 testing plan
  • We still need to complete the milestone testing for the F17 "load and serve the WISE single-epoch data" effort.
  • The data were loaded just at the end of the cycle, but the PDAC team were then diverted by the LSP workshop and the holidays; we are just returning to wiring up the data into the Portal Aspect and completing the testing now.
  • The goals of the testing are:
    • At DB/DAX level: verify that the expected data were loaded and are accessible; capture performance numbers for a set of representative queries.
    • At Portal level: verify that the data are accessible.
    • This round of testing does not cover UX improvements - the next "user testing" cycle will be after S18.
  • Initial attempts to work with the data at the beginning of this week ran into a lot of problems and so it hasn't been possible to complete the connections with the Portal.
  • We want to do these tests ASAP in order to allow the migration of PDAC to a Kubernetes deployment to advance.
 Qserv / DAX status re: WISE data
  • Data were ingested at the end of the last cycle. Discovered problems with the behavior/stability/performance of the version of Qserv and dbserv that were deployed on the PDAC systems.
  • A major change of the Qserv code base (breaking change to xrootd APIs) was deployed just before the holiday, and downstream consequences are still being shaken out.
  • Error handling in dbserv needed corrections.
  • Qserv is now running properly again and was able to perform a full table scan test on the WISE data.
  • Kian-Tat Lim: are there any lessons-learned from this for testing packages for dbserv/Qserv that might exercise the code paths that triggered the problems? Fritz Mueller: not really: most/all of the problems required running against a very large dataset. PDAC is ahead of the KPM testing at the moment, so the problems didn't show up in those tests.
  • Kenny Lo, via Fritz Mueller: still needs to do some final tests on the updated dbserv. Awaiting end of today's maintenance window.

PDAC Portal status
  • Portal code should be ready to handle the now-more-stable dbserv/Qserv stack for the WISE single-epoch source tables.
  • The milestone test affects only catalog queries; there is no new image data.
  • Development news: now able to display HiPS images natively in Firefly.

Summary of the Meltdown/Spectre exploitsDonald Petravick

Gergory, Michelle,

I am not abel to attend the PDAC meeting, but can make a statement about what the LSST project at NCSA knows/understands/would ask for.

My understanding is that programs have to execute on a machine in order to exploit these bugs.  There are many machines which  are  1) Run by trusted administrators  and 2) run only trusted code (after making a few things explicit e.g java script on a browers on such a machine).   I can see that a set of operational controls can be derived such that the machines can be run un-patched, if a patch degrades the system to a graet extent.  I also understand that Linux system can boot into patched or unpatched mode.

I would say that any mcahine used by a community where these sorts of operational controls are absent whould likley need patching, in absence of an similar story about non-techincal controls.

I also expect that smart engineeers are working on this, and that a way will be found to make future generations of patches more efficient.

I undersatand (but only from from hallway conversations), that that NCSA is not pursuing a one-size-fits-all solution, and that the criteria above are valid input to the thinking about the patch at NCSA.

‹ Best
‹ Don


Current status of Meltdown/Spectre patching and
other infrastructure issues
  • As of early this morning, patching and BIOS updates on the verify-worker nodes is underway.
    • ~ 6 nodes are being difficult. Unsure if the team has wrestled them into submission yet.
    • BIOS updates are required in order to address the Spectre vulnerability.
    • The full set of patches needed for the Dell hosts are available.
      • Once completed, externally-facing systems in the LSST-dev and verification cluster environments should need no further patching.
    • We do not have the full set of patches needed for Lenovo. They withdrew their initial BIOS updates and announced a new ETA of 12 Feb 2018.
      • this This effects the PDAC cluster, which is Lonovohas entirely Lenovo hardware.
  • Same updates on lsst-dev01 started at 08:30

  • Not patching non-user-login systems, such as GPFS servers, at this time. Still looking into performance implications. NCSA testing of client-side impacts so far show sub-1% impacts. Asking users to look out for impacts.
  • Impacts are more likely on I/O heavy servers. Therefore we should pay special attention to any performance changes on the Qserv machines.
  • There are exploits circulating, so NCSA would like to apply the patches ASAP.
  • We will attempt to complete the milestone tests before next Thursday's (  ) maintenance window, so that patching can proceed. Expect a check-in on the Slack #dm-pdac channel on Wednesday morning to verify status and confirm that it's OK to proceed with the patching.

Kubernetes status
  • Kubernetes installation done on a single server: qserv-test01; see IHS-
    Jira
    serverJIRA
    columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
    serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
    keyDM-12847
    and its children for future updates.
  • Loi Ly will resume work on the IPAC Kubernetes cluster setup next week; IPAC has a cluster of 4 VMs ready, but Kubernetes installation was delayed by urgent IRSA public release work.
    Jira
    serverJIRA
    columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
    serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
    keyDM-12950
  • Fritz Mueller: Steve Pietrowicz has been able to get multicast working with the Weave network overlay.
  • Let's all stay in close touch on versions, networking configuration, etc.

AOB
  • Simon Krughoff: see
    Jira
    serverJIRA
    columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
    serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
    keyIHS-695
    - there is an issue with the scheduling of slurm jobs in the hours immediately preceding the maintenance window. Unknown User (mbutler): it's a reasonable request to improve this behavior, but it may take a bit of work to implement. Conclusion: Simon Krughoff will create a more specific ticket to capture the request.
    • Post-meeting:
      Jira
      serverJIRA
      columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
      serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
      keyIHS-699
      has been created for this request.

Action items

  •