Welcome to Subscribe On Youtube

3832. Find Users with Persistent Behavior Patterns

Description

Table: activity

+--++
\| user_id      \| int     \|
\| action_date  \| date    \|
\| action       \| varchar \|
+--++-+--+
\| user_id \| action_date \| action \|
++-+--+

Output:

++++
\| user_id \| action \| streak_length \| start_date \| end_date   \|
++++
\| 3       \| view   \| 7             \| 2024-01-01 \| 2024-01-07 \|
\| 1       \| login  \| 5             \| 2024-01-01 \| 2024-01-05 \|
++++

Explanation:

  • User 1:
    • Performed login from 2024-01-01 to 2024-01-05 on consecutive days
    • Each day has exactly one action, and the action is the same
    • Streak length = 5 (meets minimum requirement)
    • The action changes on 2024-01-06, ending the streak
  • User 2:
    • Performed click for only 4 consecutive days
    • Does not meet the minimum streak length of 5
    • Excluded from the result
  • User 3:
    • Performed view for 7 consecutive days
    • This is the longest valid sequence for this user
    • Included in the result

The Results table is ordered by streak_length in descending order, then by user_id in ascending order

</div>

Solutions

Solution 1: Filtering + Grouping + Aggregation

We first need to filter user dates with only a single action per day, then identify consecutive intervals among these dates, and finally aggregate these intervals to calculate the streak length and filter records that meet the criteria.

  • import pandas as pd
    
    
    def find_behaviorally_stable_users(activity: pd.DataFrame) -> pd.DataFrame:
        activity['action_date'] = pd.to_datetime(activity['action_date'])
    
        # Filter users with only a single action per day
        df = activity.assign(
            cnt=activity.groupby(['user_id', 'action_date'])['action'].transform('count')
        )
        df = df[df['cnt'] == 1].sort_values(['user_id', 'action', 'action_date'])
    
        # Identify consecutive intervals
        df['rn'] = df.groupby(['user_id', 'action'])['action_date'].rank(method='first')
        df['grp'] = df['action_date'] - pd.to_timedelta(df['rn'], unit='D')
    
        # Aggregate streaks
        streaks = (
            df.groupby(['user_id', 'action', 'grp'])
            .agg(
                streak_length=('action_date', 'count'),
                start_date=('action_date', 'min'),
                end_date=('action_date', 'max'),
            )
            .reset_index()
        )
    
        # Filter and get the longest streak for each user
        res = streaks[streaks['streak_length'] >= 5].sort_values(
            ['streak_length', 'user_id'], ascending=[False, True]
        )
    
        return res.groupby('user_id').head(1)[
            ['user_id', 'action', 'streak_length', 'start_date', 'end_date']
        ]
    
    
  • # Write your MySQL query statement below
    WITH
        daily_counts AS (
            -- Step 1: Filter user dates with exactly one record per day (meeting the requirement of "exactly one action per day")
            SELECT
                user_id,
                action_date,
                action,
                COUNT(*) OVER (PARTITION BY user_id, action_date) AS cnt
            FROM activity
        ),
        filtered_activity AS (
            -- Step 2: Filter out data with multiple actions on the same day
            SELECT user_id, action_date, action
            FROM daily_counts
            WHERE cnt = 1
        ),
        streak_groups AS (
            -- Step 3: Group consecutive dates using the method of subtracting row number from date
            SELECT
                user_id,
                action,
                action_date,
                DATE_SUB(
                    action_date,
                    INTERVAL ROW_NUMBER() OVER (
                        PARTITION BY user_id, action
                        ORDER BY action_date
                    ) DAY
                ) AS grp
            FROM filtered_activity
        ),
        streak_summary AS (
            -- Step 4: Calculate the length of each consecutive segment and only keep records with length >= 5
            SELECT
                user_id,
                action,
                COUNT(*) AS streak_length,
                MIN(action_date) AS start_date,
                MAX(action_date) AS end_date,
                -- Sort different streaks for each user to facilitate getting the maximum value later
                ROW_NUMBER() OVER (
                    PARTITION BY user_id
                    ORDER BY COUNT(*) DESC
                ) AS rnk
            FROM streak_groups
            GROUP BY user_id, action, grp
            HAVING streak_length >= 5
        )
    -- Step 5: Extract the longest record for each qualified user and sort
    SELECT user_id, action, streak_length, start_date, end_date
    FROM streak_summary
    WHERE rnk = 1
    ORDER BY streak_length DESC, user_id ASC;
    
    

All Problems

All Solutions