3475. DNA Pattern Recognition

Description

Table: Samples

++
\| Column Name    \| Type    \| 
++
\| sample_id      \| int     \|
\| dna_sequence   \| varchar \|
\| species        \| varchar \|
++
sample_id is the unique key for this table.
Each row contains a DNA sequence represented as a string of characters (A, T, G, C) and the species it was collected from.

Biologists are studying basic patterns in DNA sequences. Write a solution to identify sample_id with the following patterns:

Sequences that start with ATG (a common start codon)
Sequences that end with either TAA, TAG, or TGA (stop codons)
Sequences containing the motif ATAT (a simple repeated pattern)
Sequences that have at least 3 consecutive G (like GGG or GGGG)

Return the result table ordered by sample_id in ascending order.

The result format is in the following example.

Example:

Input:

Samples table:

+--+
\| sample_id \| dna_sequence     \| species   \|
+--+
\| 1         \| ATGCTAGCTAGCTAA  \| Human     \|
\| 2         \| GGGTCAATCATC     \| Human     \|
\| 3         \| ATATATCGTAGCTA   \| Human     \|
\| 4         \| ATGGGGTCATCATAA  \| Mouse     \|
\| 5         \| TCAGTCAGTCAG     \| Mouse     \|
\| 6         \| ATATCGCGCTAG     \| Zebrafish \|
\| 7         \| CGTATGCGTCGTA    \| Zebrafish \|
+--+

Output:

+--++-+-++++
\| 1         \| ATGCTAGCTAGCTAA  \| Human       \| 1           \| 1          \| 0          \| 0          \|
\| 2         \| GGGTCAATCATC     \| Human       \| 0           \| 0          \| 0          \| 1          \|
\| 3         \| ATATATCGTAGCTA   \| Human       \| 0           \| 0          \| 1          \| 0          \|
\| 4         \| ATGGGGTCATCATAA  \| Mouse       \| 1           \| 1          \| 0          \| 1          \|
\| 5         \| TCAGTCAGTCAG     \| Mouse       \| 0           \| 0          \| 0          \| 0          \|
\| 6         \| ATATCGCGCTAG     \| Zebrafish   \| 0           \| 1          \| 1          \| 0          \|
\| 7         \| CGTATGCGTCGTA    \| Zebrafish   \| 0           \| 0          \| 0          \| 0          \|
+---+

Explanation:

Sample 1 (ATGCTAGCTAGCTAA):
- Starts with ATG (has_start = 1)
- Ends with TAA (has_stop = 1)
- Does not contain ATAT (has_atat = 0)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)
Sample 2 (GGGTCAATCATC):
- Does not start with ATG (has_start = 0)
- Does not end with TAA, TAG, or TGA (has_stop = 0)
- Does not contain ATAT (has_atat = 0)
- Contains GGG (has_ggg = 1)
Sample 3 (ATATATCGTAGCTA):
- Does not start with ATG (has_start = 0)
- Does not end with TAA, TAG, or TGA (has_stop = 0)
- Contains ATAT (has_atat = 1)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)
Sample 4 (ATGGGGTCATCATAA):
- Starts with ATG (has_start = 1)
- Ends with TAA (has_stop = 1)
- Does not contain ATAT (has_atat = 0)
- Contains GGGG (has_ggg = 1)
Sample 5 (TCAGTCAGTCAG):
- Does not match any patterns (all fields = 0)
Sample 6 (ATATCGCGCTAG):
- Does not start with ATG (has_start = 0)
- Ends with TAG (has_stop = 1)
- Starts with ATAT (has_atat = 1)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)
Sample 7 (CGTATGCGTCGTA):
- Does not start with ATG (has_start = 0)
- Does not end with TAA, "TAG", or "TGA" (has_stop = 0)
- Does not contain ATAT (has_atat = 0)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)

Note:

The result is ordered by sample_id in ascending order
For each pattern, 1 indicates the pattern is present and 0 indicates it is not present

Solutions

Solution 1: Fuzzy Matching + Regular Expressions

We can use LIKE and REGEXP for pattern matching, where:

LIKE 'ATG%' checks if it starts with ATG
REGEXP 'TAA$\|TAG$\|TGA$' checks if it ends with TAA, TAG, or TGA ($ indicates the end of the string)
LIKE '%ATAT%' checks if it contains ATAT
REGEXP 'GGG+' checks if it contains at least 3 consecutive Gs

Python
SQL

import pandas as pd


def analyze_dna_patterns(samples: pd.DataFrame) -> pd.DataFrame:
    samples["has_start"] = samples["dna_sequence"].str.startswith("ATG").astype(int)
    samples["has_stop"] = (
        samples["dna_sequence"].str.endswith(("TAA", "TAG", "TGA")).astype(int)
    )
    samples["has_atat"] = samples["dna_sequence"].str.contains("ATAT").astype(int)
    samples["has_ggg"] = samples["dna_sequence"].str.contains("GGG+").astype(int)
    return samples.sort_values(by="sample_id").reset_index(drop=True)

# Write your MySQL query statement below
SELECT
    sample_id,
    dna_sequence,
    species,
    dna_sequence LIKE 'ATG%' AS has_start,
    dna_sequence REGEXP 'TAA$|TAG$|TGA$' AS has_stop,
    dna_sequence LIKE '%ATAT%' AS has_atat,
    dna_sequence REGEXP 'GGG+' AS has_ggg
FROM Samples
ORDER BY 1;

3475 - DNA Pattern Recognition

3475. DNA Pattern Recognition

Description

Solutions

Solution 1: Fuzzy Matching + Regular Expressions

All Problems

All Solutions

Welcome to Subscribe On Youtube

3475. DNA Pattern Recognition

Description

Solutions

Solution 1: Fuzzy Matching + Regular Expressions

All Problems

All Solutions