Week 23 - SQL Chips and Chunks Newsletter

SQL Insights: The Power of SQL String Functions

1. SQL Chip: Cleaning and Standardizing Text Data

SQL isn't just for numbers and dates; it's also incredibly powerful for working with text. SQL String Functions are special tools that let you manipulate, clean, and analyze text data. This is essential because text data from different sources is often messy, with inconsistent formatting, extra spaces, or mixed capitalization. Functions like TRIM(), UPPER(), and LENGTH() help you prepare this data, making it consistent and ready for analysis or reporting.

2. SQL Chunk: Analyzing Cleaned Product Review Data

Let's imagine you have a table of product reviews with a review_text column that might be a bit messy. Here's how you can use three key string functions to standardize and measure your data.

SELECT
    product_id,
    TRIM(review_text) AS cleaned_review, -- Removes leading/trailing spaces
    UPPER(TRIM(review_text)) AS normalized_review, -- Converts the cleaned text to all uppercase
    LENGTH(TRIM(review_text)) AS review_length_characters -- Measures the length of the cleaned text
FROM
    product_reviews
WHERE
    LENGTH(TRIM(review_text)) > 50 -- Filters for reviews longer than 50 characters
ORDER BY
    LENGTH(TRIM(review_text)) DESC
LIMIT 20;

This query first cleans up any extra spaces with TRIM(), then uses UPPER() to standardize the case, and finally uses LENGTH() to measure the final clean string. It then filters for longer reviews and orders the results to show you the longest ones first.

3. SQL Challenge: Categorizing Reviews by Length and Content

Now, let's combine these concepts with a CASE statement. What if you wanted to categorize your reviews into "short," "medium," and "long" and also check for the presence of a specific keyword, all while keeping the data clean?

SELECT
    product_id,
    COUNT(review_id) AS total_reviews,
    CASE
        WHEN LENGTH(TRIM(review_text)) < 50 THEN 'Short Review'
        WHEN LENGTH(TRIM(review_text)) BETWEEN 50 AND 150 THEN 'Medium Review'
        ELSE 'Long Review'
    END AS review_length_category
FROM
    product_reviews
WHERE
    UPPER(review_text) LIKE '%GREAT%' -- Checks for a keyword regardless of case
GROUP BY
    1, 3
ORDER BY
    total_reviews DESC;

This query first uses UPPER() to make the search case-insensitive, looking for the word 'GREAT'. Then, using a CASE statement and LENGTH(), it assigns each matching review to a category based on its length. Finally, it groups the results to give you a count of how many 'GREAT' reviews fall into each category.

4. SQL Mistake: Overlooking Case-Sensitivity in String Comparisons

A frequent mistake is not accounting for case-sensitivity when searching for a specific string. By default, many databases perform case-sensitive comparisons. This means 'apple' does not equal 'Apple' or 'APPLE'. This can lead to your WHERE clause not finding any records, even if the word is present in the data.

Incorrect Example (Case-sensitive WHERE clause):

SELECT review_text
FROM product_reviews
WHERE review_text LIKE '%good%'; -- This will miss 'Good' or 'GOOD'

Correct Example (Case-insensitive using UPPER() or LOWER()):

SELECT review_text
FROM product_reviews
WHERE UPPER(review_text) LIKE '%GOOD%'; -- Correctly finds 'good', 'Good', and 'GOOD'

To avoid this, a simple and effective practice is to convert both the column data and your search string to the same case (either all uppercase or all lowercase) before comparing them.

5. SQL in Practice: Analyzing Social Media Mentions

In social media analysis, a data professional often needs to clean raw user comments before analyzing them for brand sentiment or content. Imagine you work for a brand that sponsors a major sporting event. You want to analyze Twitter mentions to see if people are talking about your brand and what they're saying. You'd clean up the raw tweet text before performing any analysis.

-- Assume a table named social_media_mentions with a 'tweet_text' column
SELECT
    source_platform,
    COUNT(mention_id) AS total_mentions,
    AVG(LENGTH(TRIM(tweet_text))) AS avg_mention_length,
    MIN(LENGTH(TRIM(tweet_text))) AS min_mention_length,
    MAX(LENGTH(TRIM(tweet_text))) AS max_mention_length
FROM
    social_media_mentions
WHERE
    UPPER(tweet_text) LIKE '%YOURBRAND%' -- Case-insensitive search
    AND tweet_date >= '2025-06-01'
GROUP BY
    source_platform
HAVING
    COUNT(mention_id) > 10
ORDER BY
    total_mentions DESC;

This query analyzes your brand's mentions across different social media platforms. It cleans and standardizes the text, then uses aggregate functions with LENGTH() to report the total count of mentions and the average, minimum, and maximum length of those mentions per platform. This helps a social media strategist understand not only who is talking about the brand but also the nature and typical length of those conversations.

6. SQL Resource: SQL String Functions Cheatsheet

This resource provides a great overview of common SQL string functions. While geared towards SQL Server, the concepts and function names are largely consistent across different database systems. It includes clear examples and a concise format that helps you quickly reference and understand a wide array of text-manipulation functions beyond just TRIM(), UPPER(), and LENGTH(). It's a fantastic, free tool for expanding your knowledge of working with text data.