Video recordings of the associated "seminar" walking through the various spreadsheets, including the points below, are attached as Recording Part 1 and Recording Part 2.

Flowdown

  • Science requirements provide common inputs (LSE-81, LSE-82)
  • Storage and bandwidth (LDM-141, LDM-139)
  • Compute and memory (LDM-138, LDM-140)
  • Network is basically ignored
  • Costs (LDM-144, LDM-143)
  • All spreadsheets are designed to be linked by single tabs: SciReq from science requirements, Output from others becomes Input in LDM-144

Science Requirements:

  • Science Estimates
    • Stars based on Milky Way model; interpolation in magnitude is nonlinear
    • US (universal survey) stars calculated so that we can figure out GP (galactic plane) stars (all stars minus US stars) because we are sized to not do MultiFit on the GP stars
    • Galaxies: note that single visit calculated differently from Objects
    • All star/galaxy models are calculated for 30K sq deg, need to be normalized to actual survey area of 25K sq deg
    • False positive fraction for transients is huge
    • Assume false positives only get DiaForcedSources back one month, not into the future
    • Assume real transients get DiaForcedSources for one month in the future until they fade
  • Camera Specifications
    • Pixels per image includes overscan
    • Bits per pixel should probably be here, not elsewhere
  • Survey/Cadence Specifications
    • Added margin for visits
    • Epochs distributed across filters according to original cadence
  • EFD Specifications
    • Mostly Large File Annex (expected to be spectra)
    • But spectra now will not come through LFA, so likely oversized
  • Network Requirements
    • Used to size disk and (unused) network bandwidths
    • Note that templates and coadds assumed to be transferred during generation, catalogs after generation, but not after all QA
  • Image Storage Requirements
    • Raw images not updated for latest Commissioning Plan
    • Master calibration images not updated for latest Calibration Plan
    • Note that template is only from last DR -- and is furthermore only the last year's
    • Unclear why 4 releases of Master Calibration Images kept; likely to make retrieval of PVIs for DRn-2 and DR1 more efficient.
  • DRP Specifications
    • Contains information about master calibration generation (monthly), co-add retention, and template generation (annual)
    • Sequencing assumed to allow parallelism and overlap of execution, but major chunks of the overall 9 month processing are required to execute in 18 weeks; affects bandwidth and compute
  • CPP Specifications
    • Calibration DB likely too small, additional non-relational data may make it substantially larger
  • User Image Access Specifications
    • csi = PVI
  • User Catalog Query Specifications
    • Shared scans have a target scan length at full load; can execute faster if lighter load
    • Multiple scans occur at the same time, so disk bandwidth needs to be allocated for all of them
    • Query result size is bogus: derived from image result size
    • Low volume query areas are in square degrees
  • L3 Processing Specifications
    • 10% of AP + DRP TFLOPS allocated to L3 Service Level 1
    • 2% of AP + DRP TFLOPS allocated to L3 Service Level 2
    • Then 16% of all L3 allocated to Chile, 84% to US for unknown reasons
  • EPO Specifications
    • Number of streams not used
    • Time to read used for disk I/O
  • Common Constants and Derived Values
    • KB/MB/GB/TB/PB are really KiB/MiB/GiB/TiB/PiB
    • Images and visits per night are overestimates, 2.75M visits in 2984 nights is 922/night
    • Images are 16 and 32 bit

Storage Sizing:

  • output_io
    • L3 community images based on all other storage on disk plus per-user storage (more than 10%)
    • MSS doesn't take into account DR for L3, other databases
    • L3 database in Chile is just 10% of L2 catalogs, not other DBs (and doesn't use proper input value)
  • otherInput
    • Still has 16 bit raw images
    • Times are used to estimate disk bandwidths
    • Database replicas (for FT, not DR) is here
    • Reserve space for columns added before production, plus additional per DR
    • Takes CPU to decompress
    • Shared scans reduce seeks/IOPs
  • imgData
    • One set of templates is kept on disk, not one per year, then thrown away
    • Templates and coadds are big, but PVIs are huge
    • Attempted to break productions down into I/Os, estimate time and size
    • Note that master calibrations for building PVIs are bigger than raw
  • dbL1
    • Every update to a DiaObject from a DiaSource or DiaForcedSource results in a new row
    • Main L1DB for user query is assumed to be compressible
    • "Up-to-date Catalog" is live L1DB
    • Additional data on disk for Alert Production, not necessarily in DB
  • dbL2
    • Note reduced partitions for DiaObject, unclear why
    • "Use SSD" here not used, is on otherInput
  • network
    • Archive to Base transfer of DR products not included because expecting to do on disk
  • rowSize
    • Row sizes include 25% overhead for unknowns before production, 3% growth during production
    • Alert size is here
    • APMean is aperture profile (surface brightness in annulus)
  • smallTables
    • Provenance and QA are added here
  • qservMaster
    • Models cores, memory, disk, network needed for czars
  • forEPO
    • Used to estimate EPO, now obsolete

Compute Sizing

  • Overall
    • Computing rates (TFLOPS), so compute TFLOPs and model seconds
    • TFLOPs back-calculated from Tcyc = seconds * clock speed and FLOPs per cycle based on TOP500 LINPACK
    • Things done in parallel add TFLOPS, in series use MAX()
  • otherInput_2
    • Algorithmic margin added here
    • Adjustments from prototype to production PhotoCal
    • Multifit Object percentage not actually used anymore
    • Provision to reduce for split DRP, including DR1/2, not including global steps like Astro/PhotoCal and SDQA
  • Calibrated Science Image
    • Break down into steps, determine Tcyc per step, add
  • NightMOPS
  • Alert
    • Note high-cost fake object insertion (retained as margin)
    • Alert Generation is assumed to use all available CPUs for a certain time
  • DR Worksheet
    • Per-Source multifit timing here is dominant; uses optimization factor of 6
  • Data Release
    • PVI generation assumed to be in series with rest, but Coadd, Diffim, ObjChar/MultiFit, SDQA all run in parallel
    • MultiFit not run on galactic-plane stars
  • MOPS
    • Noise factor is cubed because three DiaSources are used per tracklet
    • Bottom section is MOPS for DRP
  • Calibration Products
  • On-Demand
    • Regeneration of PVIs (not coadds or diffims)
    • Assumes that calibrations don't need to be redone since they can be looked up
  • L3 Community
    • 10% of AP+DRP TFLOPS
  • Cutout Service
    • Primarily expected to be cutouts of coadds
  • EPO Service
  • Memory
    • Used to be quadratic in multifit variables, now linear
    • Dominated now by coaddition

Cost Model:

  • InputCompute, Storage, LongHaulNetwork
    • Inputs are all in Survey Years, by DR
  • InputTechPredictions
    • 20xx years are intended to be fiscal year of acquisition but can be adjusted arbitrarily
  • InputTechPredictionsDiskDrives
    • Assumes price per drive decreases with time until no longer readily available, but discounting doesn't make up for growth in density, so lowest $/TB is always new reference drive
  • InputOther
    • ComCam and LSSTCam commissioning are projected back from Survey Year 1 = DR2
    • Note that DR2 has enough to process one year of data in six months
    • Base is one year earlier than Archive
    • Each site split into "CTR" and "DAC"
    • L1 archivers/forwarders lumped under "DMCS" in CTR additional support
    • Headcount for workstations not updated for Operations Plan
    • Note "Percentage of FY2022 costs that are Construction" -- says applies to limited quantities but actually applied in SUMMARY to everything
  • ComputeRequirements, Compute
    • Requirements summed over CTR and DAC, makes it hard to determine how much a given science function costs
    • Add enough each year for additional compute capacity plus replacement; heterogeneous "fleet"
    • Take into account all requirements (TFLOPS, aggregate disk bandwidth)
  • Memory
    • Model computes memory needed based on memory/core and cores on the floor, then computes added memory and DIMMs
    • Doesn't take into account allocation of DIMMs to cores, nor DIMMs to nodes
  • StorageRequirements, Storage
    • Controller-based storage for most data including images, local storage for Qserv-hosted databases -- but also uses local storage for Alert Prod DB, DRP DB, and L1DB in CTR
    • Capacity and bandwidth requirements; bandwidth is dominant in some years
  • TapeRequirements, Tape
    • Note tape price depreciation in Tape column O
  • NetworkRequirements, Network, NetworkDetails
    • NetworkRequirements basically unused; NetworkDetails used to determine $ for Network
  • Floorspace
    • Rack counts for DB storage are 0 because they are assumed to be part of DB nodes
    • Compute nodes in CTR are assumed to be compute, in DAC are assumed to be DB
    • Racks go up as space grows, then down as density increases
  • PowerAndCooling
  • Workstations
  • Shipping
    • Note tape library is one year off (uses Archive instead of Base)
  • LongHaulNetwork
  • SiteRollup
  • Summary
    • Contains Development and Integration cluster provisions (locked in as of v140 for some reason) as well as Commissioning Cluster, HQ operations center, DAQ test stand, visualization -- all as arbitrary budgets
    • Construction is total cost up to and including Survey Year 1
    • Operations is averaged over Survey Years
    • Cost is partitioned in FY2022
  • WBS
    • Note Key Technical Metrics at bottom
    • Calculating "what if":
      • Zero out things that are irrelevant
      • Look at difference in total cost on WBS tab (used to have a "this workbook minus previous baseline" column)
  • No labels