- SQL Chips and Chunks Newsletter
- Posts
- Week 35 - SQL Chips and Chunks Newsletter
Week 35 - SQL Chips and Chunks Newsletter
SQL Activity Analytics: Mastering Log-in Anomaly Detection
1. SQL Chip: Understanding Window Functions
A Window Function performs a calculation across a set of table rows that are somehow related to the current row. Think of it as a moving window of data. Unlike aggregate functions (like SUM() or AVG() with GROUP BY), a Window Function does not collapse the rows into a single output row. It returns an aggregated value for each row, allowing you to compare individual rows to a group (the "window") without losing the detail of the original data.
2. SQL Chunk: Calculating Historical Average Logins
We need to establish a baseline for normal activity. We'll use the AVG() Window Function, partitioned by user_id, to find the average daily login attempts for each user over a 30-day period.
SELECT
log_date,
user_id,
login_attempts,
-- Calculate the user's average login attempts over the last 30 days
AVG(login_attempts) OVER (
PARTITION BY user_id
ORDER BY log_date
ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
) AS historical_avg_30d
FROM
user_daily_logins
ORDER BY
user_id, log_date DESC;
3. SQL Challenge: Identifying the Anomaly
Now we leverage the historical average to detect unusual spikes. The concept here is calculating the difference between today's login attempts and the historical average, then flagging any row where this difference exceeds a certain threshold (e.g., 50% more than the average).
WITH DailyActivityComparison AS (
-- CTE 1: Calculate the historical average
SELECT
log_date,
user_id,
login_attempts,
AVG(login_attempts) OVER (
PARTITION BY user_id
ORDER BY log_date
ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
) AS historical_avg_30d
FROM
user_daily_logins
),
AnomalyStatusCalculation AS (
-- CTE 2: Calculate the difference and anomaly status
SELECT
*,
(login_attempts - historical_avg_30d) AS attempt_difference,
CASE
WHEN login_attempts > (historical_avg_30d * 1.5) THEN 'ANOMALY_FLAG'
ELSE 'NORMAL'
END AS anomaly_status
FROM
DailyActivityComparison
)
SELECT
* -- Select all columns from the second CTE
FROM
AnomalyStatusCalculation
WHERE
anomaly_status = 'ANOMALY_FLAG' -- Valid since anomaly_status is now a pre-calculated column
ORDER BY
log_date DESC, attempt_difference DESC;
4. SQL Mistake: Misusing GROUP BY for Comparison
A common mistake when trying to compare an individual daily value to an average is attempting to use AVG() with a GROUP BY clause on the user_id and expecting the individual dates to still be present.
Incorrect Approach: If you used GROUP BY user_id, you would get one average value for the entire table per user, and you would lose the ability to see the specific log_date and compare that day's activity against the average, as the individual dates are collapsed.
The Fix: Use the Window Function syntax (AVG() OVER (PARTITION BY...)) to calculate the average without collapsing the rows. This is why Window Functions are essential for this type of row-by-row comparison.
5. SQL in Practice: Telecommunications
In the telecommunications industry, detecting anomalies in data access is critical for security and compliance. A sudden spike in failed login attempts to a sensitive customer database, especially from an account that normally logs in only once a day, suggests a brute-force attack or a compromised employee credential.
The anomaly detection query helps security teams:
Stop data breaches by locking down compromised accounts faster.
Maintain regulatory compliance by auditing unusual access patterns.
SELECT
u.login_time,
u.account_id,
u.login_status,
-- Rank attempts by time for each account on that day (identifying rapid-fire attempts)
ROW_NUMBER() OVER (
PARTITION BY DATE(u.login_time), u.account_id
ORDER BY u.login_time
) AS attempt_rank
FROM
telecom_access_logs AS u
WHERE
u.login_status = 'FAILED'
-- Look for anomalous attempts that happened outside of standard 8am-6pm (08:00:00 - 18:00:00)
AND TIME(u.login_time) NOT BETWEEN '08:00:00' AND '18:00:00'
ORDER BY
u.login_time DESC
LIMIT 50;6. SQL Resource: Window Functions Tutorial
A natural next step after mastering GROUP BY and conditional counting is exploring Window Functions. These functions allow you to perform aggregate-like calculations without collapsing the rows, which is incredibly useful for calculating running totals, moving averages, or seeing how the current record's status compares to the overall set.
Name: Window Functions Tutorial
Hyperlink: https://www.thoughtspot.com/sql-tutorial/sql-window-functions
Description: This guide breaks down the
OVER()clause and functions likeROW_NUMBER(),RANK(), andSUM() OVER (PARTITION BY...). This knowledge is essential for any advanced data analyst role and will allow you to track more complex metrics like daily progress trends or rank departments by their migration speed without using a subquery.