DISTRIBUTED PROCESSING SYSTEM DEVELOPMENT APPLIED TO VIRTUAL SCREENING
Arthur Araújo de Lacerda, Jennifer Stephanie Pereira das Neves, Edson Luiz Folador
ABSTRACT
The COVID-19 pandemic has brought significant challenges to public health worldwide. The search for effective treatments and the discovery of new drugs are essential to combat this global crisis. However, drug development is a complex process that requires substantial resources. Bioinformatics is an area that can assist in the development of these drugs through virtual drug screening. However, limitations and the high cost of processing power have been a challenge for researchers. The SSHPC, a simple, lightweight, SSH(secure shell protocol) based general-purpose processing distribution software was developed with the aim of harnessing the potential of idle machines, complementing the processing power of the main server at UFPB’s LAMBDA laboratory. This report presents the development of this software designed to meet the laboratory's needs. The process was successfully completed, resulting in a significant and scalable increase in available processing power. This allowed for more efficient use of the laboratory's computational resources, positively impacting research activities that rely on virtual screening tasks. Based on the SSH protocol, communication security and authentication between the machines involved in processing distribution are guaranteed. Furthermore, it enables efficient and reliable execution of remote commands and tasks, maximizing resource utilization. This report provides detailed information on the software's architecture, its main features, and the solutions implemented for efficient processing distribution. In addition, the results obtained through testing and evaluations are presented, highlighting the performance gains achieved. The SSHPC stands out as a simplified SSH-based processing distribution solution with no dependencies. Its approach facilitates installation and usage by users, ensuring an efficient tool for optimizing processing tasks in an academic and scientific context.
Keywords: bioinformatics. covid-19. distributed processing. virtual screening.
INTRODUCTION
SARS-CoV-2
The coronavirus SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2), which causes the disease COVID-19 (Coronavirus Disease 2019), has emerged as one of the biggest public health challenges of the 21st century. In December 2019, initial outbreaks were reported in Wuhan, China, and quickly spread globally, reaching Brazil in February 2020 (Cândido et al., 2020). In March 2020, the World Health Organization (WHO) declared a pandemic. Since then, COVID-19 has profoundly impacted people's health, the economy and everyday life around the world. In addition to the search for new medicines, the discovery of anti-COVID effects in existing medicines has also proven to be a promising strategy for combating the virus (Singh, 2020).
Virtual Screening
Computational methods such as docking and virtual screening have been used to accelerate the drug discovery process and are routinely used in industry and academia. Virtual screening involves screening large libraries of chemical compounds to identify those with the greatest therapeutic potential, based on scoring functions that classify the affinity between the ligand and the target (Leelananda, 2016). This assessment requires the calculation of binding energies, molecular conformations and other relevant properties, and therefore faces significant computational difficulties. Furthermore, the complexity and scale of molecular interactions increase the time required to run simulations (Shaw, 2007).
Processamento distribuído
Computational chemistry speeds up the drug discovery process. High-performance computers are used to accelerate demanding calculations, however, such a structure comes with time and financial costs. Therefore, local and distributed computing presents itself as a viable alternative (Banegas-Luna, 2019). Distributed computing transcends the limitations of individual systems by bringing together the collective power of many physically separated processing units to collaborate in performing overall tasks through robust communication protocols and management systems that coordinate the flow of data and tasks (Enslow, 1978). This conception permeates scenarios, from intensive data analysis, to the construction of highly scalable web applications, to the discovery of new drugs.
AIMS
Develop a distributed processing system that is simple to use and install. Which means:Code programs to manage processes on the server and distribute to clients, and execute the processes there;
Run production testing using SARS-COV-2 proteins for virtual screening;
Develop module to install and configure the system on servers in other institutions;
Use the security and efficiency of the Secure shell to transfer files and information for server-client communication;
Ensure the usability of cluster computers for users, even during data processing;
METHODOLOGY
The development of SSHPC (Secure Socket Shell based Processing Cluster) involved a combination of technological resources and tools to create an effective and efficient solution. The combination of these elements was used to create the seven bash scripts that make up SSHPC. Below is a list of the material used during development:
Linux Operating Systems: The development base was formed by computers with Linux operating systems, providing an environment for implementing and testing SSHPC. The choice of Linux operating systems as a development platform is due to their wide adoption in commercial and scientific server environments. Linux offers stability, security, and a diverse set of command-line tools, which are essential for creating a distributed processing system.
Base Scripting: The Bash scripting language was the choice to implement SSHPC. It was chosen due to its command-line nature, its familiarity in Unix-like environments and because it is native to the operating system, lacking any prerequisites for installing or using SSHPC. Bash is ideal for automating tasks and creating scripts, which involve task coordination and processing distribution.
Secure Shell Protocol: The SSH protocol was chosen for communication between SSHPC components due to its robust security. SSH guarantees authentication, encryption and data integrity during the exchange of information between server and client nodes, protecting the system against unauthorized access and interception.
Testing environment: A set of Linux machines from the LIDI laboratory at the UFPB Campus I Class Center were used to test SSHPC in different scenarios. This allowed us to verify the functionality, scalability and efficiency of the system.
The system is divided between (1) the server machine, responsible for managing and distributing processes, requests, results and instruction files and; (2) client machines, responsible for processing files sent by the server and sending requests when available for processing. They communicate over the internet, using the SSH protocol, to exchange files and information. This creates a processing cluster where the client machines, acting as nodes in this cluster, have their potential gathered and managed by the server machine.
Aiming to guarantee security, communication between client and server will take place via an encrypted means, preventing access and changes during transmission, guaranteeing confidentiality and data integrity. Additionally, communication between client and server to perform tasks will always be one-way, managed by the server, ensuring that the client computer never accesses the server's files. The only client-server communication will be to inform that the specific client is available for use, without login, without access to any file system or resources on the server. The proposal presented in this project will be developed in three stages detailed below: implementation, production and expansion.
Implementation
Both the programs (scripts) that will be executed on the client side and on the server side will be developed using the bash programming language, native to the GNU/Linux OS. The Extreme Programming (XP) software development methodology will be used, which aims to quickly develop a product with the highest possible quality. To enable development, the project will operate on two distinct parallel fronts, in a cyclical manner: development and testing. The development front will be responsible for coding the client module and server module programs, then tests will be run to execute, verify results, verify efficiency, verify security and generate feedback notes for the next stage of development to provide changes if necessary.
Testing and implementation environment
The environment where the distributed processing system (DPS) was implemented and tested is composed of a server computer located at the Biotechnology Center (CBiotec), a public digital inclusion laboratory (LIDI) containing 20 desktop computers and the multi-user bioinformatics laboratory and data analysis (LAMBDA) containing eight desktop computers.
Both the server and the computers in the laboratories are currently used to carry out other activities, remaining that way during the development of the project and after its end. This intentional design option aims to ensure that client computers are not enticed for docking when used for other purposes, always prioritizing in-person use.
Distributed processing system’s functionalities
The DPS contains the functionalities described below for its operationalization, which may be encoded in different modules or in the form of a function:
Client
Check processing idleness on the computer;
Inform processing availability to the server;
Server
Check directories and insert available services into the execution queue;
Check customer processing availability requests;
Check whether the client service was stopped prematurely and restart;
Check the existence of services to be processed;
Assign and send service exclusively to a single client to perform;
Detect execution completion and load processed results to the client;
Display reports of queued, running and executed services;
Display report of customers who are performing or have already performed services;
RESULTS AND DISCUSSIONS
The system was developed and is in operation, computing molecular docking of several complexes. It was built from several scripts for the client and server computer that run on system users. For the system to work, it is necessary to install it on client machines and server machines via the installation script available on the website. This script prepares the system for the start of SSHPC activities. The user directory structures, scripts developed and files required for the system to function are explained below:
Client machine
Client’s scripts
sshpc_client_install.sh
Installs the system on the client. Creates the system user, its directory structure, downloads the other necessary scripts, configures the cron schedule, a tool that guarantees the continuous execution of processes and creates the configuration file necessary for client-server communication
sshpc_client_pair.sh
Connects the client to the server. After installing the system on the client, client-server pairing is necessary for communication to occur. This script sends a pairing message to the server, in which the IP address is passed, which is added to the server's list of trusted IPs, and from then on free communication is established. This program is automatically executed by “sshpc_server_pair.sh” on the server computer.
sshpc_worker.sh
Checks, every unit of time, the availability of the computer, if any, a job request is sent to the server. This script acts in such a way that it always leaves a core available to the user
Configuration file
The “sshpc.conf” file contains information about the client machine and server machine that guarantees fluid interaction.
Server machine
Server’s scripts
sshpc_server_install.sh
Install the system on the server. This program creates the system user, its directory structure, downloads the other necessary scripts, configures the cron schedule, a tool that guarantees the continuous execution of processes, creates the ssh key, a file that guarantees free server-client communication
sshpc_server_pair.sh
After installing the system on the client or in case of IP address changes, pairing between the client and server is necessary to establish communication. This script sends the SSH key to the client, allowing fluid and secure communication between the parties.
sshpc_verify.sh
At regular intervals, this script checks for requests sent by a client with processing availability. If such a request is identified, the sshpc_run.sh program is triggered.
sshpc_run.sh
Allocates one of the instructions present in the run folder to the client and sends the command to execute them. The result is returned to the server and stored in the runned folder. This program is automatically called by sshpc_verify.sh.
sshpc_summary.sh
This script groups system information, according to the parameters passed to the command. The available parameters allow you to view results over time, on previous days or specific dates. Furthermore, it is possible to check how many processes are running, how many instruction files are still awaiting processing, monitor availability requests sent by clients and track clients that are in process.
Instruction files
Instruction files are required for SSHPC to function. These files are interpreted by the sshpc_run.sh script and contain directives that will be transmitted to the client. They have a specific structure and provide, by default, the following possible information: sending, sending files from the server to the client. There is also an option to send compressed files; recovery, send files from client to server. There is also an option to send compressed files; execution, execute command on the client or server; deletion, remove client files.
System’s operating cycle
The system operates in cycles triggered according to the customer's processing availability. Every unit of time, the client computers execute the sshpc_worker.sh script, in order to check their availability, if any, a request is sent to the server. On the server, the sshpc_verify.sh script is executed every unit of time, in order to verify the requests sent by clients (stored in the access report). For each request, the sshpc_run.sh script is called, which assigns an instruction file to the client, present in the /home/sshpc/run folder, moving it, after assignment, to the /home/sshpc/running folder. At this point, each line of the instruction file is read and executed sequentially until the end. At the end, the file is moved to the /home/sshpc/runned folder, indicating that the task has been completed, and the cycle ends. However, if there is an error in executing these instructions, such as an attempt to send a non-existent file, the instruction file is sent to the /home/sshpc/fail folder.
This research guarantees greater processing power, facilitating running through large drug libraries, considering that, according to the authors, this represents one of the main difficulties faced in drug discovery, SSHPC is able to overcome them in a scalable and robust way (Leelananda, 2016; Barbosa, 2012). The system was tested with large-scale analysis of SARS-COV-2 proteins against the tannic acid ligand and analogues. Adding 10 computers with a maximum of 4 processing cores, it is possible to calculate an average of 400 interaction complexes per day. This reveals the potential of local distributed processing systems. Previously, with just the laboratory's dedicated processing server (AMD Ryzen threadripper 64), 1000 complexes were calculated per day. With the implementation of SSHPC in LIDI, it was possible to increase processing capacity by 40%, using mostly idle machines while remaining available for common use by users
CONCLUSION
Thus, it is concluded that the developed distributed processing system of simple use and installation was implemented and is in operation and that through the programming of scripts to manage processes on the server and distribute them to clients, it was able to increase the processing capacity from LAMBDA-UFPB. Now, looking to the future, the aim is to scale SSHPC to other UFPB laboratories and other institutions in order to add more processing power and improve the drug discovery process. As well as making SSHPC available for other institutions to use in asynchronous distributed processing that are interested.
REFERENCES
Comissão Organizadora
Francisco Mendonça Junior
Pascal Marchand
Teresinha Gonçalves da Silva
Isabelle Orliac-Garnier
Gerd Bruno da Rocha
Comissão Científica
Ricardo Olimpio de Moura