Project library
Scripting & automationDatadog · Software Engineer Intern · 2023

A Slack bot that pinged the on-call when a dashboard went stale

Shared by J. Alvarez · ex-Datadog SWE Intern

The sharer told this exact project in their Datadog interview and went on to receive a return offer.

Step into this interview

4 real follow-ups from the actual loop · 1 hard · ~12 min

You answer each question first — only then does the sharer's real take open up.

How they told it

My team kept finding out too late that a monitoring dashboard had stopped updating. I wrote a small bot that noticed first and nudged the right person.

Read the full telling

During my internship my team owned a bunch of internal dashboards, and a few times a data pipeline behind one would silently stall — nobody noticed until a manager asked why a number looked frozen. My mentor half-jokingly said someone should just check them every morning. I asked if I could automate it instead. I wrote a Python script that hit the API for each dashboard, compared the last-updated timestamp against a threshold, and if something was stale past a cutoff it posted to our team Slack channel and @-mentioned whoever was on-call that week, pulling the rotation from our PagerDuty schedule. The tricky part wasn't the happy path, it was not being annoying — early on it double-pinged because a slow refresh looked stale for a minute, so I added a grace window and a 'don't repeat within 4 hours' rule. I ran it on a cron job on a shared box. It caught two real stalls in my last month that we fixed same-day instead of days later. I kept it dead simple on purpose so my mentor could actually maintain it after I left.

What they actually got asked

What happened when the bot itself failed or the box went down?

hard

How did you know a dashboard was actually stale versus just slow to refresh?

medium

Why a bot at all — why not just check them manually?

easy

Who else used it, and was it still running after you left?

medium