Welcome to Subscribe On Youtube
3475. DNA Pattern Recognition
Description
Table: Samples
++ \| Column Name \| Type \| ++ \| sample_id \| int \| \| dna_sequence \| varchar \| \| species \| varchar \| ++ sample_id is the unique key for this table. Each row contains a DNA sequence represented as a string of characters (A, T, G, C) and the species it was collected from.
Biologists are studying basic patterns in DNA sequences. Write a solution to identify sample_id
with the following patterns:
- Sequences that start with ATG (a common start codon)
- Sequences that end with either TAA, TAG, or TGA (stop codons)
- Sequences containing the motif ATAT (a simple repeated pattern)
- Sequences that have at least
3
consecutive G (like GGG or GGGG)
Return the result table ordered by sample_id in ascending order.
The result format is in the following example.
Example:
Input:
Samples table:
+--+ \| sample_id \| dna_sequence \| species \| +--+ \| 1 \| ATGCTAGCTAGCTAA \| Human \| \| 2 \| GGGTCAATCATC \| Human \| \| 3 \| ATATATCGTAGCTA \| Human \| \| 4 \| ATGGGGTCATCATAA \| Mouse \| \| 5 \| TCAGTCAGTCAG \| Mouse \| \| 6 \| ATATCGCGCTAG \| Zebrafish \| \| 7 \| CGTATGCGTCGTA \| Zebrafish \| +--+
Output:
+--++-+-++++ \| 1 \| ATGCTAGCTAGCTAA \| Human \| 1 \| 1 \| 0 \| 0 \| \| 2 \| GGGTCAATCATC \| Human \| 0 \| 0 \| 0 \| 1 \| \| 3 \| ATATATCGTAGCTA \| Human \| 0 \| 0 \| 1 \| 0 \| \| 4 \| ATGGGGTCATCATAA \| Mouse \| 1 \| 1 \| 0 \| 1 \| \| 5 \| TCAGTCAGTCAG \| Mouse \| 0 \| 0 \| 0 \| 0 \| \| 6 \| ATATCGCGCTAG \| Zebrafish \| 0 \| 1 \| 1 \| 0 \| \| 7 \| CGTATGCGTCGTA \| Zebrafish \| 0 \| 0 \| 0 \| 0 \| +---+
Explanation:
- Sample 1 (ATGCTAGCTAGCTAA):
- Starts with ATG (has_start = 1)
- Ends with TAA (has_stop = 1)
- Does not contain ATAT (has_atat = 0)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)
- Sample 2 (GGGTCAATCATC):
- Does not start with ATG (has_start = 0)
- Does not end with TAA, TAG, or TGA (has_stop = 0)
- Does not contain ATAT (has_atat = 0)
- Contains GGG (has_ggg = 1)
- Sample 3 (ATATATCGTAGCTA):
- Does not start with ATG (has_start = 0)
- Does not end with TAA, TAG, or TGA (has_stop = 0)
- Contains ATAT (has_atat = 1)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)
- Sample 4 (ATGGGGTCATCATAA):
- Starts with ATG (has_start = 1)
- Ends with TAA (has_stop = 1)
- Does not contain ATAT (has_atat = 0)
- Contains GGGG (has_ggg = 1)
- Sample 5 (TCAGTCAGTCAG):
- Does not match any patterns (all fields = 0)
- Sample 6 (ATATCGCGCTAG):
- Does not start with ATG (has_start = 0)
- Ends with TAG (has_stop = 1)
- Starts with ATAT (has_atat = 1)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)
- Sample 7 (CGTATGCGTCGTA):
- Does not start with ATG (has_start = 0)
- Does not end with TAA, "TAG", or "TGA" (has_stop = 0)
- Does not contain ATAT (has_atat = 0)
- Does not contain at least 3 consecutive 'G's (has_ggg = 0)
Note:
- The result is ordered by sample_id in ascending order
- For each pattern, 1 indicates the pattern is present and 0 indicates it is not present
Solutions
Solution 1: Fuzzy Matching + Regular Expressions
We can use LIKE
and REGEXP
for pattern matching, where:
- LIKE
'ATG%'
checks if it starts with ATG - REGEXP
'TAA$\|TAG$\|TGA$'
checks if it ends with TAA, TAG, or TGA ($ indicates the end of the string) - LIKE
'%ATAT%'
checks if it contains ATAT - REGEXP
'GGG+'
checks if it contains at least 3 consecutive Gs
-
import pandas as pd def analyze_dna_patterns(samples: pd.DataFrame) -> pd.DataFrame: samples["has_start"] = samples["dna_sequence"].str.startswith("ATG").astype(int) samples["has_stop"] = ( samples["dna_sequence"].str.endswith(("TAA", "TAG", "TGA")).astype(int) ) samples["has_atat"] = samples["dna_sequence"].str.contains("ATAT").astype(int) samples["has_ggg"] = samples["dna_sequence"].str.contains("GGG+").astype(int) return samples.sort_values(by="sample_id").reset_index(drop=True)
-
# Write your MySQL query statement below SELECT sample_id, dna_sequence, species, dna_sequence LIKE 'ATG%' AS has_start, dna_sequence REGEXP 'TAA$|TAG$|TGA$' AS has_stop, dna_sequence LIKE '%ATAT%' AS has_atat, dna_sequence REGEXP 'GGG+' AS has_ggg FROM Samples ORDER BY 1;