Currently have

  • 6 dbdev machines, each: 8 cores, 16 GB RAM, 2x1 TB local storage, sudo on each

Better to have (in place of dbdev)

Qserv and webserv dev

  • Why needed? Software dev
  • What is needed: 8-12 VMs, each: 4 cores, 8 GB RAM, ~50 GB local storage, sudo, access to ncsa /nfs
    • assumption: one machine per developer, occasionally we need more than 1 VM per developer
  • When needed: would be nice to have soon
  • Typical usage: daily interactive use during working hours, which sometimes happen to extend till midnight or so (smile)

Qserv on-demand integration testing

  • Why needed? To validate new code
  • What is needed: 3 VMs per developer, each: 2 cores, 4 GB RAM, ~20 GB local storage, sudo, access to ncsa /nfs
  • When needed:
    • immediately (as of July 2015)
  • Typical usage: running Qserv distributed integration tests to validate new code. Runs just for a few minutes each time when triggered
  • Longer term, say in ~1 year it'd be nice to have few more VMs per developer (say 8)

Qserv continuous integration testing

  • Why needed? To hash out hard to predict projects, test things in different configurations under different load
  • What is needed: ~40 VMs, each 2 cores, 4 GB RAM, ~20 GB local storage, sudo, access to ncsa /nfs
  • When needed: in ~1 year
  • Typical usage: running tests 24x7. 

Qserv specialized tests

  • Why needed? For specialized testing, example: debugging a more complex problem with race conditions, fault tolerance, threading, hard to reproduce problems etc
  • What is needed? Configuration will vary depending on what we are testing. In some cases 100 VMs with 1 core 1 GB RAM 10 GB local storage, in some cases 10 VMs with 16 cores 16 GB RAM 1 TB storage. 
  • When needed? in ~1 year.
  • Typical usage: used most of the time during the time period as requested
  • Note, this is not needed ad-hoc, can be planned and scheduled few days or weeks in advance

Qserv large scale tests

  • Why needed? To test qserv at scale
  • What is needed? 25, 50, 100, 150 VMs. Each: 4 cores, 8 GB RAM, at least 100GB per VM
  • When needed? up to ~100 in 1 year, more later, perhaps with more cores
  • Typical usage: used most of the time during the time period as requested
  • Note, this is not needed ad-hoc, can be planned and scheduled 1-2 months in advance

Qserv KPIs

  • We are planning to run tests to demonstrate we can meet planned KPIs. This is captured in  DLP-645 - Getting issue details... STATUS . The bottom line:

    CycleDR1 Catalog
    [%] 
    S1510
    S1620
    S1730
    S1850
    S1975
    S20100

Qserv as a service

  • Why needed? For SUI. To gain experience with running continuously.
  • What is needed? 
    • 1/2 year: 4 VM, 8 cores, 16 GB RAM, 2 TB local storage, sudo
      • to serve DC_W13_Stripe82 or equivalent
    • ~2 year time frame: 8-16 VMs, each: 8 cores, 16 GB RAM, 2 TB local storage, sudo 
      • could go to much higher numbers (even like 50-100 VMs) if we have funding and if SUI would really find it useful.
  • When needed soon
  • Typical usage: 24x7 Qserv service for SUI

Webserv as a service

  • Why needed? For SUI. To gain experience with running continuously.
  • What is needed? 1 VM: 4 cores, 8 GB RAM, small local storage, access to ncsa /nfs
  • When needed: ASAP
  • Typical usage: running webserv and DataCat as a service, continuously

General comments

  • We never tested MySQL / Qserv with object store, so we regular file system. Theoretically we could try using /nfs for some lightweight things, but given we always need I/O, that might impact the speed of development.
  • We need to be able to put the lsst stack on these VMs, and talk to github, so having access to external network is useful. If that is hard, then we need at least one head node that could talk to both external network and the VMs, with at least 100 GB of storage.
  • the VMs need to connect to each other (standard sockets)
  • No labels

14 Comments

  1. You're OK sharing your interactive development machines with CI tests in the near term? I thought you'd want more like 10 at ~50% time (but much less than 100% CPU for that time) and another 4-6 at ~10% time (but at near-100% CPU for that time).

  2. Yes,  I think it will be ok to share with CI tests as long as we get some sort of priority, e.g. having to wait for a VM when someone wants to do the coding would be pretty bad. The machines can be pretty much all for CI-tests during non working hours, especially in the short term when we are not planning to run any continuous integration tests.  (mind you though, our working hours typically extend till 12:30 or 1 am)

    I thought you'd want more like 10 at ~50% time (but much less than 100% CPU for that time)

    that is what I tried to convey through "12 VMs, daily interactive use".

    I am worried about reducing CPU too much on each VM because fast compilation does speed up development

    and another 4-6 at ~10% time (but at near-100% CPU for that time).

    that is what I tried to convey through "periodic integration tests on demand"

    I'll tweak the page.

  3. I think the main thing is that it's better not to think about "reusing/sharing machines" for anything.  Each VM should be for a single purpose.  You reuse by oversubscribing VMs to real hardware (if they're underloaded) or by pausing them and restarting them.  Just specify the resources you need for each purpose and let the admins/provisioning take care of the rest.

  4. Unknown User (danielw)

    I think I'd rather not "share" the VMs in the sense of having a particular VM house both one developer's interactive dev environment, and stuff for CI testing. I'd rather have around one VMs per dev for interactive use. Occasionally, a dev will want to spin-up another VM or two for testing, but these are extra, and can almost certainly be paused/snapshotted-to-disk except during active use. Now these interactive VMs can share physical hardware with each other–most devs are not compiling/testing at the same moment, so the interference should be small.

    The CI (or multi-node test machines), would be idle most(?) of the time. Having them be triggered by merge-to-master would be fine. Ideally, they could also track particular branches, and re-build/re-test upon each push to those branches. e.g., each dev works on a couple of branches, typically, and it would be nice for the testers to track those branches. These CI VMs can be paused when idle, as long as we can find a way for their results to be available/viewable when paused, and for them to automatically resume/spin-up when needed.

    I think the hypervisor for the boxes on which these VMs run (the dbdev pool?) can oversubscribe the cores in general. I think we can manage interference from each other socially instead of having technical controls.

     

  5. I never meant to imply "sharing" in a sense of sharing one VM for different purposes. I only meant it in a way that our VMs would share underlying hardware infrastructure with other non-db things, and as a result we might not have enough resources to run our VMs. (Note the "non-DB CI tests" words I used.)

  6. so one Thing I'd like to look at in the very short term is who will brow up the VM's and who will  have root on them, and to understand what if any policy implications there are at NCSA.   Can people come to the 1:30 meeting with their thoughts? 

  7. I'd have thought it'd be the same as on the dbdev cluster, NCSA gives us standard box with the officially supported Linux OS, we have root, and we configure things the way that is appropriate for qserv tests or development. If we see any low level issues (like disk corruption etc) we file a ticket through lsst-admin and NCSA helps us.

  8. Unknown User (bglick)

    Regarding root access... having sudo access to root only fits well on servers that don't have full access to LSST's shared NFS filesystems.  So, we need to clarify one or both of the following requirements:

    1. Do you really need full root access on development hosts?  Or would providing a list of specific commands you could do via sudo suffice?
    2. Do you really need both read and write access to all LSST NFS mounts?  If we could limit NFS access, then that could lessen the security implications of root access.  Explain how you expect to use NFS on the development nodes.

     

    1. Unknown User (danielw)

      My thoughts:

      1. I don't think we need full root-access, but I think we'll need a lot eventually: install/remove/update packages, add/remove users (e.g. for having separate accounts for the different qserv processes to run under), perhaps attaching debuggers to processes of other users (e.g., mysql user, zookeeper user, etc.) . Still, in most cases, we won't need too many commands, so if you start us off with add/remove packages and tweaking kernel parameters(?), I think we are okay with bugging an admin (so long as you don't mind multiple requests/day until we figure out which exact commands we need).
      2. Personally, I think we are okay with read-access, and maybe only access to directories that normal LSST users can read, but don't think NFS access control is configurable beyond "mount this path with read or write". I think the others in the team should decide this one–they are more involved in loading/copying data.

       

    2. Bill, root access is very useful for development, for example, we are from time to time running into missing dependency and such, and being able to quickly experiment without going through you "please install this package", "oh, and this", "no, not this version, the latest one" etc would just hamper development. 

      Read-only access to nfs is good enough.

  9. Unknown User (danielw)

    Also, note that for actual performance benchmarking, we probably want to be on real disk–qserv is architected specifically to optimize I/O performance on locally-attached disk and (more or less) and to keep nearly all disk spindles in the cluster constantly reading at their top sustained read bandwidth in order to meet our query performance requirements.  That being said, most development and testing does not require local disk–it's okay for the I/O to be slow. The I/O optimization code can be developed and tested on one (or at most a few) physical nodes. I guess this means I'm asking for one non-virtualized node for the team with at least two physical spindles accessed independently (non-RAID), but we actually have a few such machines here at SLAC, so we don't need them at NCSA (unless the SLAC ones disappear).

  10. Unknown User (danielw) - I think that you would be better served by bare metal provisioning rather than running via a VM for performance testing.  It is possible to to pass through the block devices via virt-io or to map the PCI device into the VM but this fairly complicated to setup (EC2 does it but I don't know how to do it with openstack nova unless cinder is running on all hypervisors) and it still won't have the same latency characteristics as bare metal.  Openstack's ironic baremetal provisioner is now reasonably mature and at least HP and Rackspace are both selling it. Eg http://www.rackspace.com/cloud/servers/onmetal

  11. Unknown User (bglick) I realize this is the page for DB requirements but SQRE needs to both be able to upload custom images and fully control provisoning.  We can't use fog/terraform/vagrant/etc. without root.

  12. OK  We are on the verge of shaking down an NCAS openstack – I'm not sure if Jacek  knows that –  and we are now (FInlly) on the verge of getting a hardware amendment.