We still need to complete the milestone testing for the F17 "load and serve the WISE single-epoch data" effort.
The data were loaded just at the end of the cycle, but the PDAC team were then diverted by the LSP workshop and the holidays; we are just returning to wiring up the data into the Portal Aspect and completing the testing now.
The goals of the testing are:
At DB/DAX level: verify that the expected data were loaded and are accessible; capture performance numbers for a set of representative queries.
At Portal level: verify that the data are accessible.
This round of testing does not cover UX improvements - the next "user testing" cycle will be after S18.
Initial attempts to work with the data at the beginning of this week ran into a lot of problems and so it hasn't been possible to complete the connections with the Portal.
We want to do these tests ASAP in order to allow the migration of PDAC to a Kubernetes deployment to advance.
Data were ingested at the end of the last cycle. Discovered problems with the behavior/stability/performance of the version of Qserv and dbserv that were deployed on the PDAC systems.
A major change of the Qserv code base (breaking change to xrootd APIs) was deployed just before the holiday, and downstream consequences are still being shaken out.
Error handling in dbserv needed corrections.
Qserv is now running properly again and was able to perform a full table scan test on the WISE data.
Kian-Tat Lim: are there any lessons-learned from this for testing packages for dbserv/Qserv that might exercise the code paths that triggered the problems? Fritz Mueller: not really: most/all of the problems required running against a very large dataset. PDAC is ahead of the KPM testing at the moment, so the problems didn't show up in those tests.
Kenny Lo, via Fritz Mueller: still needs to do some final tests on the updated dbserv. Awaiting end of today's maintenance window.
I am not abel to attend the PDAC meeting, but can make a statement about what the LSST project at NCSA knows/understands/would ask for.
My understanding is that programs have to execute on a machine in order to exploit these bugs. There are many machines which are 1) Run by trusted administrators and 2) run only trusted code (after making a few things explicit e.g java script on a browers on such a machine). I can see that a set of operational controls can be derived such that the machines can be run un-patched, if a patch degrades the system to a graet extent. I also understand that Linux system can boot into patched or unpatched mode.
I would say that any mcahine used by a community where these sorts of operational controls are absent whould likley need patching, in absence of an similar story about non-techincal controls.
I also expect that smart engineeers are working on this, and that a way will be found to make future generations of patches more efficient.
I undersatand (but only from from hallway conversations), that that NCSA is not pursuing a one-size-fits-all solution, and that the criteria above are valid input to the thinking about the patch at NCSA.
‹ Best ‹ Don
Current status of Meltdown/Spectre patching and other infrastructure issues
As of early this morning, patching and BIOS updates on the verify-worker nodes is underway.
~ 6 nodes are being difficult. Unsure if the team has wrestled them into submission yet.
BIOS updates are required in order to address the Spectre vulnerability.
The full set of patches needed for the Dell hosts are available.
Once completed, externally-facing systems in the LSST-dev and verification cluster environments should need no further patching.
We do not have the full set of patches needed for Lenovo. They withdrew their initial BIOS updates and announced a new ETA of 12 Feb 2018.
this This effects the PDAC cluster, which is Lonovohas entirely Lenovo hardware.
Same updates on lsst-dev01 started at 08:30
Not patching non-user-login systems, such as GPFS servers, at this time. Still looking into performance implications. NCSA testing of client-side impacts so far show sub-1% impacts. Asking users to look out for impacts.
Impacts are more likely on I/O heavy servers. Therefore we should pay special attention to any performance changes on the Qserv machines.
There are exploits circulating, so NCSA would like to apply the patches ASAP.
We will attempt to complete the milestone tests before next Thursday's ( ) maintenance window, so that patching can proceed. Expect a check-in on the Slack #dm-pdac channel on Wednesday morning to verify status and confirm that it's OK to proceed with the patching.
- there is an issue with the scheduling of slurm jobs in the hours immediately preceding the maintenance window. Michelle Butler: it's a reasonable request to improve this behavior, but it may take a bit of work to implement. Conclusion: Simon Krughoff will create a more specific ticket to capture the request.