Transcription factors (TFs) are pivotal in regulating gene expression by binding to non-coding regulatory DNA regions through their DNA-binding domains (DBDs), thereby promoting or inhibiting transcription. Extensive research has centered on interactions between DBDs and DNA, classifying TFs into families based on their DBDs. In contrast, the role of effector domains (EDs) as main players of transcription regulation remains underexplored. Current evidence, mostly from lower-level eukaryotic organisms, suggests that some key processes involved the recruitment of RNA polymerase, coactivators, histone modifiers and chromatin remodelers all occur along their EDs. However, the presence of intrinsically disordered regions (IDRs) across the EDs makes them challenging to study via conventional structural biology techniques. Additionally, the distinctive proteome across different levels i.e. organism, tissue and cell, further complicate speculation from other organism’s transcription factors.
We developed a machine learning-based approach to classify human transcription factors (HTFs) based on their effector domains which we named FALK22. To develop it, we analyzed sequences from 1,639 HTFs, optimize descriptors that capture sequence-dependent properties, and optimized hyperparameter spaces for classification. We also used the Evolutionary Scale Model (ESM) for classification and compared our feature space with embedding space generated by ESM for full-length HTFs. Using two independent unsupervised machine learning techniques, we identified two distinct classifications comprising 20 and 30 clusters based on regions outside the DBDs, each with unique patterns in amino acid composition and spacing. Coarse-grained simulations of full sequence TFs from our classification further grouped the sequences into three different classes of protein-protein interactions within the dense phase. This methodology provides a foundation for future research in transcriptional regulation, the effect of condensation in gene expression and its implications in human diseases.