• Yorick Peterse's avatar
    Randomize metrics sample intervals · 057eb824
    Yorick Peterse authored
    Sampling data at a fixed interval means we can potentially miss data
    from events occurring between sampling intervals. For example, say we
    sample data every 15 seconds but Unicorn workers get killed after 10
    seconds. In this particular case it's possible to miss interesting data
    as the sampler will never get to actually submitting data.
    
    To work around this (at least for the most part) the sampling interval
    is randomized as following:
    
    1. Take the user specified sampling interval (15 seconds by default)
    2. Divide it by 2 (referred to as "half" below)
    3. Generate a range (using a step of 0.1) from -"half" to "half"
    4. Every time the sampler goes to sleep we'll grab the user provided
       interval and add a randomly chosen "adjustment" to it while making
       sure we don't pick the same value twice in a row.
    
    For a specified timeout of 15 this means the actual intervals can be
    anywhere between 7.5 and 22.5, but never can the same interval be used
    twice in a row.
    
    The rationale behind this change is that on dev.gitlab.org I'm sometimes
    seeing certain Gitlab::Git/Rugged objects being retained, but only for a
    few minutes every 24 hours. Knowing the code of Gitlab and how much
    memory it uses/leaks I suspect we're missing data due to workers getting
    terminated before the sampler can write its data to InfluxDB.
    057eb824