May 12, 2025

Public workspaceRNA-seq Data Generation and 1D-CNN Classification Workflow for Visual Impairment Research

  • Dr. Ashit Kumar Dutta1,
  • Nasser Ali AlJarallah1,
  • Abdul Rahaman Wahab Sait2
  • 1Almaarefa University;
  • 2King Faisal University
Icon indicating open access to content
QR code linking to this content
Protocol CitationDr. Ashit Kumar Dutta, Nasser Ali AlJarallah, Abdul Rahaman Wahab Sait 2025. RNA-seq Data Generation and 1D-CNN Classification Workflow for Visual Impairment Research. protocols.io https://dx.doi.org/10.17504/protocols.io.6qpvrqz2zlmk/v1
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: In development
We are still developing and optimizing this protocol
Created: May 12, 2025
Last Modified: May 12, 2025
Protocol Integer ID: 218082
Funders Acknowledgements:
King Salman Center for Disability Research
Grant ID: ERC -2024-01
Abstract
A robust, reproducible pipeline to (i) generate realistic synthetic RNA-seq data for visual-impairment research using a Negative Binomial model, (ii) build and train a 1D-CNN classifier, (iii) evaluate performance via ROC-AUC and SHAP interpretability, and (iv) publish both code and datasets to public repositories (Protocols.io & NCBI GEO).
Materials
- Hardware: Any workstation with ≥16 GB RAM and a GPU (e.g., NVIDIA RTX series)
- OS: Linux (Ubuntu 20.04+) or Windows 10/11
- Programming: Python 3.8+
- Libraries: numpy, pandas, scikit-learn, tensorflow (or PyTorch), shap, biopython
- NCBI Accounts: GEO submission account (https://www.ncbi.nlm.nih.gov/account/)
- Protocols.io: Free account (email/ORCID)
Prepare Synthetic RNA-seq Data
Prepare Synthetic RNA-seq Data
Open a Python environment and install dependencies:
pip install numpy pandas scipy
Simulate counts for N samples × G genes using a Negative Binomial distribution:
import numpy as np
data = np.random.negative_binomial(n=10, p=0.3, size=(100,500))
Assign binary labels (0=healthy, 1=visually impaired) and save as CSV:
import pandas as pd
labels = np.random.choice([0,1], size=100)
df = pd.DataFrame(data, columns=[f"Gene_{i+1}" for i in range(500)])
df["Condition"] = labels
df.to_csv("synthetic_RNAseq.csv", index=False)
Data Preprocessing
Data Preprocessing
Load your CSV and apply log2-CPM normalization:
import numpy as np, pandas as pd
df = pd.read_csv("synthetic_RNAseq.csv")
counts = df.iloc[:,:-1]
log_cpm = np.log2((counts.div(counts.sum(axis=1), axis=0) * 1e6) + 1)
Scale features (zero mean, unit variance):
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(log_cpm)
y = df["Condition"].values
Train-Validation-Test Split
Train-Validation-Test Split
Split into 70/15/15 proportions:
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.30, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=42)
Build the 1D-CNN Model
Build the 1D-CNN Model
Define architecture (TensorFlow/Keras example):
import tensorflow as tf
model = tf.keras.Sequential([tf.keras.layers.Input(shape=(X_train.shape[1],1)), tf.keras.layers.Conv1D(64, 3, activation="relu"), tf.keras.layers.MaxPooling1D(2), tf.keras.layers.Conv1D(128, 3, activation="relu"), tf.keras.layers.MaxPooling1D(2), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation="relu"), tf.keras.layers.Dropout(0.5), tf.keras.layers.Dense(1, activation="sigmoid")])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
Model Training
Model Training
Reshape data for Conv1D and train with early stopping:
X_train_c = X_train[..., np.newaxis]
X_val_c = X_val[..., np.newaxis]
es = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
history = model.fit(X_train_c, y_train, validation_data=(X_val_c, y_val), epochs=50, batch_size=32, callbacks=[es])
Evaluation & Interpretation
Evaluation & Interpretation
Evaluate on test set:
X_test_c = X_test[..., np.newaxis]
model.evaluate(X_test_c, y_test)
Compute ROC-AUC:
from sklearn.metrics import roc_auc_score
y_pred = model.predict(X_test_c).ravel()
roc_auc_score(y_test, y_pred)
SHAP analysis for feature importance:
import shap
explainer = shap.KernelExplainer(model.predict, X_train_c[:50])
shap_values = explainer.shap_values(X_test_c[:20])
shap.summary_plot(shap_values, X_test[:20])