Biomedical Image Processing with Containers and Deep Learning: An Automated Analysis Pipeline

Here, a streamlined, scalable, laboratory approach is discussed that enables medium‐to‐large dataset analysis. The presented approach combines data management, artificial intelligence, containerization, cluster orchestration, and quality control in a unified analytic pipeline. The unique combination of these individual building blocks creates a new and powerful analysis approach that can readily be applied to medium‐to‐large datasets by researchers to accelerate the pace of research. The proposed framework is applied to a project that counts the number of plasmonic nanoparticles bound to peripheral blood mononuclear cells in dark‐field microscopy images. By using the techniques presented in this article, the images are automatically processed overnight, without user interaction, streamlining the path from experiment to conclusions.


Introduction
Advances in computer technology, data acquisition hardware, laser technology, and automated imaging platforms have transformed the field of biomedical research. Hardware systems that once required manual intervention can now be programed to run continuously for days or even weeks. High-content-screening systems enable the simultaneous testing of several experimental hypotheses automatically. [1][2][3][4][5] Precision mechanical advances enable optical systems that can scan and rescale entire centimeters of samples at subcellular resolution. [6] High-bandwidth communication networks enable large multisite scientific datasets. [7] Data storage technology has evolved substantially to the point where the units of data measurements are terabytes, enabling the exploration of increasingly complex scientific questions.
These technologies have had an impact on biomedical research. Where it was once adequate to acquire a few data points and images to address a hypothesis, today's tools enable considerably greater capabilities to drive the exploration of increasingly complex scientific questions.
Archiving and indexing large datasets is a nontrivial, but a possible task, as demonstrated by the plethora of database repositories currently in use for biology research, [8][9][10][11][12][13][14][15][16][17][18] electron microscopy, [19] microarray analysis, [20] radiology, [21][22][23][24] or multidiscipline research. [25,26] While these tools allow us to acquire substantially more data, it is still incumbent on the scientist to transform data into useful and actionable information. Manual image processing is unfeasible for those datasets, and automated data analysis techniques and methods are continuously being developed, [27][28][29][30][31][32][33] even more now with the advent of machine learning tools, [34][35][36][37][38][39][40] that often perform on par with human observers. Open-source artificial intelligence (AI) software libraries such as Keras [41,42] and TensorFlow, [43] enable the power of neural networks and AI for the analysis of the datasets. Data inspection interfaces are built to inspect the results of such automated methods. [36,44] Beyond image processing, data visualization algorithms are being used to extract information from high-dimensional datasets. [45][46][47][48][49] Reusable and reproducible software built over open-source software such as python [50] and R, [51] make the development of software in repositories based on Git and GitHub commonplace. [52] Python notebooks, for example, enable data analysis sharing [53] and containerization software such as Docker, [54] enables the running of such software across computers and operating systems (OS) with minimal configuration overhead. [55] Examples exist in the medical imaging community, to find image-based signatures of cancer, [56] and in the optical coherence tomography (OCT) community, [57] where high-resolution volumetric images of vessel, esophagus, or colon are routinely acquired.
All these tools can be daunting, especially to those who have developed many of their skills using more traditional, standalone computational analysis methods. This is the method predominantly taught to most scientists and engineers during their undergraduate and postgraduate training. While this approach is perfect for initial development, small datasets, and for exploring some outcomes, it generally suffers from a lack of scalability, making the transition to large datasets challenging.
In this article, we present data management, automated analysis, and software development methods and practices that are designed to enable the management and processing of medium-to-large datasets in a reliable and reproducible manner. Key aspects presented here are the following: a data policy to maintain order through projects that span throughout years; a data processing pipeline based on AI to automatically process imaging data; software techniques that ease the transition from prototyping to deployment; clustering techniques that can ease data analysis; and, finally, data inspection techniques to monitor the validity of automatically-generated analyses. We propose to make a distinct split among such elements of the data cycle, even if they are performed in the same machine or by the same researchers. This division enables distinct processing by different entities and establishes quality control (QC) procedures for each step ( Figure 1). We present a concrete example of the presented concepts: a novel image processing pipeline that, from 3D microscopy image stacks, extract cells, and cell contours using deep convolutional neural networks and then analyzes individuals cell using computer vision image processing techniques. The pipeline is designed to be run overnight in a small cluster of computers. Further, a QC web-based interface is used to validate the results of the automated pipeline. This approach will be useful to researchers who seek to acquire and process significant amounts of data, all within the laboratory environment.

Standardized Data Enable Automated Processing
Data is the cornerstone of research. Strong efforts are being made towards data interoperability [58] and standardized data and metadata formats. [59][60][61] Guidelines for data storage and management are actively being published with a focus on open standards. [30,[62][63][64][65][66] In this article we will detail methods to manage medium-to-large acquired datasets within the laboratory, with a focus on storing data for automated analysis.
Data in biomedical research is often acquired by devices that enable high-throughput data acquisition such as microscopes, OCT devices, or magnetic resonance imaging scanners. The data may be initially stored on the same computer that is attached to the physical acquisition device. After the acquisition session is ended, the data should be immediately archived to a data storage system. This can be automated via free-and open-source software such as rsync [67] to ensure that acquired data is safely stored.
To maximize efficiency, there should always be an inherent hierarchy associated with the data. It should be defined before the start of the project and imposed through its duration. In our laboratory, for example, we keep the following hierarchy: Project/Data type (raw, processed, …)/Date of experiments/ Subject/Experiment. The hierarchy can be reflected in the directory structure of the file server system of the laboratory, or a Figure 1. Flow of data for an experiment. The data is acquired with an acquisition system and is archived in a data storage system. A processing cluster, with one or more nodes, retrieves the acquired data and process it through an analysis pipeline generating processed data and intermediate QC images. Internally, the processing cluster can split tasks among their nodes using cluster orchestration software. Finally, the processed data and QC images are inspected by the researcher to draw the conclusions of the experiment.
www.advancedsciencenews.com www.bioessays-journal.com through the imposition of relational databases. In practice, we favor the former solution, for its ease of use. When using home-built programs and data acquisition software, data storage structure can and should be built directly into the code. For example, in one of our projects, a home-built confocal microscopy software package was designed to automatically create a file structure based on user-defined conditions for each experiment. This automatically generated structure included the project name, the date of data acquisition, the type of data being stored, and a time stamp that accompanied each collected file. This automatic storage structure mechanism made the collected data instantly traceable and straightforward to analyze en masse postacquisition.
For automated data processing, we make use of centralized data storage systems, such as network attached storage (NAS) devices. This approach allows the different members of the team to "mount" and externally access data on their networked desktop machine for inspection. Centralized data storage systems with a large hard disk (HD) capacity (around 50 TB) are available for only a few thousand dollars. The market of NAS systems has blossomed, especially with the addition of easy-to-use web interfaces, computing capabilities, and containerization to these servers. In our laboratory, for example, we have centralized the data for a single project into a ten-disk 40 TB NAS server, to which more HDs could be added as need-be.
Alternatively, the data can be stored on the cloud and cloud computing can be used for its analysis. Cloud services refer to online, on-demand computational services, such as file storage, that are managed by a cloud provider such as Google. Universities and hospitals often make these services available directly to users. Recently, services such as Dropbox, Google Cloud, and Amazon Web Services can even be made to the US Health Insurance Portability and Accountability Act (HIPAA) compliant if the right care is taken in their configuration. Cloud services can offer many advantages at the cost of a larger setup time. Their use should be considered based on the amount of data to be collected, the time required to upload the data to cloud services, as well as the processing power needed to analyze the data. For example, Dropbox and other companies offer cloud storage tiers as large as tens of petabytes, storage sizes that are currently unavailable to many scientists. While setting up servers for remote computing and storage on a cloud provider can be slower than deploying on a locally accessed machine, the availability of virtualized computing power for some applications can more than make up for this. For example, Google offers the Google Cloud Platform for data analysis where a user can log into request a virtual computer with specific characteristics for data analysis, such as the number of central processing units (CPUs), random-access memory (RAM) size, and number and type of graphics processing cards. Processing data on such cloud systems can make use of computing time on hardware, billed by the minute, that may otherwise be too expensive or exotic to be normally accessible. At the same time, cloud data storage and processing can be expensive; terabytes of cloud data storage can lead to prohibitively large costs if not planned or managed correctly. Table 1 shows a price comparison of data storage and computing possibilities between local and cloud-based solutions.

Automated Pipelines Turn Raw Data into Actionable Insights
Data processing can be abstracted in the concept of a pipelinea set of processes applied to the data to synthesize conclusions. Often, at each step of the pipeline, the data is reduced by www.advancedsciencenews.com www.bioessays-journal.com eliminating information that is irrelevant for the project. Depending on the specific project and analytical methods, the steps of the pipeline can be automated or require manual input. For many of our projects, the ultimate goal of automated data processing is to let the scientist draw conclusions from the acquired data without any manual input. This push towards full automation removes the burden of step-by-step manual data analysis, freeing the scientist to focus on experiment design, data acquisition, and interpretation. Recent advances in computer vision and AI have brought automated image analysis algorithms to a level that could be often considered on par with inter-reader agreement. Deep convolutional neural networks excel at object localization, image segmentation, and object tracking in video sequences. These findings have been adopted for biomedical image analysis. As an example, in Figure 2, we present an automated data processing pipeline for microscopy. Raw images are first acquired. Cells are then identified in each fov using an object detector based on a deep convolutional neural network (single shot multibox detector [SSD]). [68] Regions of interest around the cells are then extracted from the raw data. From each of them we extract the contour of the cell, including bound plasmonic nanoparticles (PNPs), with a segmentation deep convolutional neural network (U-Net). [69] Finally, we count, using a standard image processing technique (blob-detection), the number of gold nanoparticles attached to the cell. This tabular data is then ready for analysis via R or python scripts. In this project, the raw data collected can be quite large. Each experiment consists of roughly 20 fields of view (fovs), each fov consisting of an uncompressed 3D color image, 650 MB in size. If four experiments are carried daily, this makes the total data collection nearly 52 GB per day.
It is worth noting that while the analytic tools developed here, including deep-learning-based methods, are used for postacquisition data analysis, such approaches can be implemented in the future to improve data acquisition. Neural networks and computer vision techniques could play a potential role in real-time processing of a live image feed to provide guidance and feedback to a scientist or operator. For example, images acquired on a camera system could be processed to guide the selection of optimal image fields, or to ensure that the cells of interest are at the center of the microscope focus. Similarly, neural networks could be used to process image data during acquisition in a QC step: real-time analysis of the image could provide a user information on the signal-to-noise, signal-to-background, viability of the cells, lineage of the cells, and so on. These automated feedback methods may be of importance when technology of this kind is commercialized to ensure quality data acquisition by minimally trained users or even enable fully automated, computer-guided image acquisition.

Deep Learning As a Pipeline Building Block
Training and development of deep learning neural networks is still the subject of research. However, for many applications, recently developed methods are readily applicable. In the processing pipeline we use deep learning to automate two tasks: finding cells in a fov and outlining the contour of the cells for subsequent analysis. Each task is solved by a different network structure. For instance, in cell detection networks, the input space is an image and the output is a list of boxes representing the location and size of the cells. For cell contouring, the input space is an image and the output is another image where pixels belonging to the cell are marked as "1," while pixels belonging to the background are marked as "0." A key aspect of training deep learning algorithms is the definition of a high-quality training database that spans all of the expected image variabilities.

Finding Cells in Fovs with Deep Learning Object Detection Networks
Finding objects in natural images has been an intense area of research in computer vision and AI. Open databases, such as ImageNet, [70] have been used to benchmark the performance of different algorithms, such as Fast R-CNN, [71] SSD, [68] or YOLO. [72] The choice of which algorithm to implement should be done on the basis of the performance of the network on images and the availability of open-source implementations that can be integrated with the rest of the pipeline. In our case there is only one type of object to be detected: cells. While this may seem a simplistic target at a first glance, the cells in our experiments are extracted from blood, span a range of sizes and compositions, and are often accompanied by spurious image features such as dead cells and cellular debris. We have chosen to implement an SSD network due to the availability of a python-based implementation, its good performance on ImageNet, and the underlying simplicity of the network structure.
Briefly, we reduce each input image stack of 2496 pixels × 3328 pixels × 40 layers × 3 RGB channels into a maximum intensity projection downsampled image of 624 pixels × 832 pixels × 3 RGB channels. Each cell occupies roughly a window of 20 × 20 pixels on such downsampled image. We define the output of the network as a 39 × 52 grid with four channels, where the first channel denotes the likelihood of the pixel of belonging to the background, the second one the likelihood of the pixel belonging to the cell, and the third and fourth channels model the displacement in normalized units in the x and y directions. The reference standard is composed of similar grids where background pixels have a 1 in the first channel and a 0 on the second. In this scenario, pixels that have cells have a 1 in the second channel and a 0 in the first channel, and the third and fourth channels encode their normalized displacement in the original image.
The network is trained using the adaptive momentum optimizer and a hybrid cost function, where the first term corresponds to the normalized cross entropy between the first two channels of the reference standard and the network's output. The second term is defined only on pixels that have cells in the reference standard and represents the mean error between computed and reference displacements. There is a λ factor between two terms to modulate the weight of correct localization with respect to detection, which is fixed experimentally to 0.1.
The training database is formed of 7097 fovs manually annotated by image analysis experts by placing a dot in the central point of the cells. We use 4000 fovs for training, 1000 for validation, and 2097 for testing. A custom non-maximasuppression (NMS) method is applied to eliminate close-by detections. The Pearson correlation coefficient between the number of detected cells using this SSD-based method and the number of cells obtained by an expert is of 0.827. The resulting average error in the test set between the number of cells found in a fov and the number of cells present is of 1.7 cells per fov. Such results were deemed of good quality by the researchers who are using the system. Most of the errors happened in close-by cells or cells that were partially in the fov. Improvements on the metrics would probably arise from a cleaner dataset. Examples of the fovs can be found in Figure 3. The network processes the 2097 testing images in 73 s, at a rate of 35 ms per image using an Nvidia GTX1080Ti graphics card.
It is important to note that, using current AI libraries, this network can be written in 300 lines of code, as shown in the GitHub repository of the project. [73] A) B) . Note how the method is robust to background artifacts. In the network images, blue boxes correspond to convolution operations, red boxes to batch normalization, orange to max-pooling, dark green to upsampling operations, concat stands for concatenate, and the yellow box stands for a convolution with a sigmoid activation function. Each layer has its associated parameters on the box.

Deep Learning Finds the Contours of Cells with Segmentation Networks
Encoder-decoder network structures with skip connections, such as the U-Net, [69] outperform most specialized segmentation methods in a wide variety of tasks. The U-Net has, since its creation, been subject to modifications and improvements, with the addition of more complex convolutional blocks [74] or squeeze-and-excite methods. [75] While the SSD network structure was well suited for object identification tasks, such as the automated identification of cells, the architecture of the U-Net is particularly optimized for the partitioning of images based on image features such as shapes and edges. This makes the U-Net a natural choice for image segmentation tasks.
In this work we have used a simplified version of the U-Net to segment the contour of the cells and bound PNPs due to its ease to code with current AI libraries. The network is composed of a three-step encoder path, each of them with two convolution layers, each of 32, 64, or 128 filters according to the depth. An image of the network can be found in Figure 3. The network is trained with a database of 1704 manually segmented maximum intensity projections of 200 × 200 pixel cell images: 800 images were used for training, 200 for validation, and 704 were used for testing. The network was optimized using the Dice coefficient between the resulting segmentation and the reference standard. The Dice coefficient measures the similarity between two segmentation masks by taking twice the area of the product of such images and dividing it by the sum of the areas of the two segmentation masks. The Dice coefficient will approach 1 when the masks are equal and 0 when the masks do not share common pixels. Such a metric has been proven to have better properties than a per-pixel classification method. [76] The network is trained with the Adam optimizer, with a learning rate of 1e − 5 and default parameters. The average Dice coefficient on the test set was of 0.935. Examples of the contours over the cells can be found in Figure 3. Once the data is loaded, cell contouring provides a mask for a cell in 1 ms using an Nvidia GTX1080Ti GPU.

Automated Algorithms Need to Be Quality-Controlled
While a given image analysis routine may seem to be performing well, it is critical to ensure that data is being processed correctly and reproducibly. We configure our code to generate intermediate QC images that can be readily checked so that researchers can trust the data generated via automated processes. QC intermediates, whether they be images, tables, or spreadsheets, allow for rapid inspection to make sure that the data and associated analysis are working correctly. A QC inspection step can save considerable time and effort and ensures that the data meets standards consistently long before a paper, presentation, or publication is written. QC is especially important in large-scale experiments since often it is impossible to analyze each intermediate step of the pipeline. Semiautomated methods for the identification of outliers are valuable tools to control large-size experiments. [77][78][79][80] QCinspection interfaces can be put into place. They are especially important when the conclusions of the experiments challenge the experimental setup or the data acquisition method. Current web-frameworks, such as Django, [81] enable the creation of software that expose the contents of a folder structure and create links among their items with few lines of code, enabling data analysis to all members of the team. An example of an interface developed to inspect the results acquired with the pipeline of Figure 2 can be seen in Figure 4. Such an interface has the capacity to mark cells for exclusion for analysis, for example, in case the image processing pipeline fails.
The analysis step typically follows an automated image analysis to validate the hypotheses of the experiment or infer if more data needs to be acquired. The analysis is often performed at the researcher's computer since a graphical visualization of the extracted data and statistical methods are generally needed. Any data analysis software, from python and R to MATLAB or SAS, can be used to perform such exploration. Both Jupyter and R Studio additionally provide outstanding notebook environments that allow for data analysis, exploration, and documentation. When analysis output is either too large to easily transport between computers or too complex to analyze on a researcher's computer, Jupyter and R Studio server software provide a www.advancedsciencenews.com www.bioessays-journal.com browser-based means of remote graphical processing that process data in-place. For example, we have deployed multiprocessor and graphics processing unit (GPU)-equipped data processing servers running Jupyter to enable GPU-accelerated data analysis routines that would not be feasible on a personal computer.

Deploying Experimental Software for Routine Use
Software is often developed on a personal computer, validated in a subset of the data, and then encapsulated for deployment in the whole dataset. This process is often iterative as improvements or adaptations are needed when new data arises. For good practices on the software development we refer the reader to Wilson et al. [82] Software version control systems, such as Git, should be used to track changes in the developed software. General guidelines on their use can be found in Blischak et al. [52] Git provides a powerful set of tools for writing code both individually or as a team. Rolling back to an earlier code version is straightforward and Git is scalable to large teams, all writing code together. When codeveloping software, it is important that the team uses the same libraries on their different development machines. The python programming language, for example, provides tools to list all software dependencies installed. New users can use such lists to regenerate the same precise development environment across different computers and OSs. Package managers for languages such as python and R provide a simple means of identifying and installing all the required libraries for analysis. Examples of such lists are the requirements.txt files generated by the Pip Installs Python (pip) package manager. All one has to do to make their code portable to other users is to generate a requirements.txt file and include it with the packaged code. When a user downloads the code repository, pip can read the file and set up a "virtual" environment that houses all of the correct versions of required libraries. Similarly, the conda python distribution also can manage python packages and generate virtual environment description files to accomplish the same purpose.
Running the same software in different machines under different OSs can be challenging, making multiplatform development cumbersome. Even when using the same OS, different installed versions of computational libraries can lead to incompatibilities that render software unreliable. A first abstraction to deal with such issue was the use of the virtual machine: a guest OS that runs within the host OS in virtualized hardware. This model has recently been extended with the concept of OS virtualization, also called containerization. Each container holds a stripped-down version of the OS that encapsulates only the libraries required to run the specific code. Containers have the advantage of being relatively lightweight, can be created as needed, and replicated on-demand to run processes in parallel. Today, the most popular containerization software is Docker. [54] Docker containers can be run locally using standard Linux tools and can be configured run at scheduled times using cron. This approach is appropriate for cases when there is only one server processing the data. Importantly, Docker is available for Linux, Windows, and MacOS, meaning that once a container is made, it can be run anywhere and on any machine.
Containers are now widely in use for development and web services and are finding new life in the burgeoning area of data science. In a scientific research environment, the allure of containers is clear: a program and all the software libraries necessary for its execution can simply be placed in a container and distributed. This ensures that the program will run consistently across all computers and servers in the laboratory, whether it is on a student's Windows laptop or a departmental Linux server. This can be ideal for research environments, as it becomes simple to distribute both data and its associated analysis as readily traceable and easily documented bundles. This approach is immediately attractive for translational research, where bench-side routines need to be transformed into analyses that can be run by clinical collaborators. Distributing containers can ease this process and ensure that the code and environment running on a clinical computer system matches the needs of the project. Containers are also a potential means by which scientists can address some concerns regarding data and analysis repeatability. Analysis routines written in one lab can be easily transferred to another and be guaranteed to work out-of-the-box. When used along with proper data storage approaches and version control, containerization can be a powerful way to ensure standardized analysis.

Parallelizing the Effort: Running the Code Automatically in Several Computers Simultaneously
Let us imagine we have already collected the data, generated manual annotations, trained AI models, generated a Docker infrastructure to guarantee that our code runs in any computer, and that we acquire 60 GB of data from a day of experiments. How do we process that data to generate actionable information? In our laboratory we process the data using the Linux utility called cron, [83] which runs a background process at a given time. Every evening at a fixed time, a script inspects the directory structure for unprocessed data, and in the case that it exists, triggers the image processing pipeline on it. This works well for normal, routine data collection in the laboratory, as the size of any new data is amenable for analysis within a few hours. In some cases, when the data processing is slow or the amount of new data is large, overnight processing in a single device might not be enough. In these cases, the pipeline is run for several days until complete. In some scenarios, however, this approach may result in days or weeks of analysis time, making parallel data analysis methods preferred.
Running the code on several computers is a complex topic covered by the field of clustering and cluster orchestration. In our case, clustering operations are simple, since each image is independent of the others. This situation, often called "embarrassingly easily parallelizable," only requires the infrastructure of a master node that allocates individual image analyses to the processing nodes according to their capacities. The parallelization will bring benefits to computation time www.advancedsciencenews.com www.bioessays-journal.com when the image processing time is large in comparison with the image loading time.
While several cluster management and orchestration methods exist, we advocate for the use of Kubernetes, [84] an open-source platform, due to its availability and tight bounds with Docker. This enables the use of heterogeneous nodes, as might be the case in small-scale laboratories. Kubernetes requires a Kubernetes cluster. A computer is assigned the role of the master node, to which the rest of the computers are registered. The master node dispatches pods (running processes) to the nodes according to their capacities. A job is the set of pods that are created to perform a set of operations on independent data ( Figure 5).
Following the example of Figure 5, a pod processes a fov to extract cells, and for each of them find their contour and count the number of bound PNPs to it. The job would be to process all fovs. Each pod will be dispatched among the available servers according to their capacities. Several pods can be placed at the same time in the same node. In our laboratory, using three nodes, we have reduced the average processing time for each fov from 1 min to 12 s. This 5× improvement comes from the parallelization between the nodes and within the nodes.
We run the jobs in the Kubernetes cluster every night, using a script triggered by the Linux cron utility. Such scripts inspect the data structure, find unprocessed images, and generate a Kubernetes pod description file for such image. Then the master node generates and orchestrates all pods, which will be consequently executed.
This use of Kubernetes demonstrates that these analysis pipelines can be parallelized and accelerated. A future challenge, especially if these types of analysis move out of the laboratory and into diagnostic settings, will be the need for far faster computing for on-demand or even real-time analysis. This so-called "edge computing" challenge may be well addressed by the emergence of specially designed neural network processors designed to efficiently carry out machine learning tasks. While the training of neural network models largely requires GPU-powered workstations, the actual step of calculating inference can be carried out on these special purpose chips, very much like how machine learning currently augments the cameras in many smartphones. These chips are being combined with system-on-a-chip devices (SoCs) in a new wave of single-board computers, such as the Google Coral. The low cost (≈$150) and small form factor of these edge computing devices may enable the construction of data analysis systems composed to dozens of boards capable of providing efficient and fast parallel computation for the pipeline approach described here.

Conclusions and Outlook
The analysis of large image datasets is an everyday research task for biomedical engineers. A concise data policy and automated image processing can help the everyday work of laboratories. Deep convolutional neural network methods show a high level of performance and can automate daunting tasks. Recent computer science techniques developed for cloud computing, data sharing, operating-system virtualization, and cluster orchestration provide abstractions of data and computation management applicable to biomedical engineering and translational research. While these technologies take some time to set up, from our experience, they can pay off greatly for mediumto-large research studies.
When dealing with medium-to-large datasets, we propose the following recommendations: a) centralize the data, b) establish a data storage policy, c) establish a data processing pipeline, d) process overnight automatically, e) use the same code for prototyping and processing, f) compartmentalize using container software, g) deploy using clustering management software if needed, and h) perform QC images of the results. Such recommendations have been adopted in our laboratory for the last 3 years and have enabled us to deal with multiterabyte research projects.
While the tools described here are applied to imaging data, they are not unique to images and can certainly be applied to other types of data, such as mass spectral data and DNA sequences.
We are now in the process of expanding the discussed analysis approaches into all current and future studies to eliminate analysis bottlenecks and accelerate the pace of translation. To provide those interested in further examples, a code repository has been set up via the Evans Laboratory GitHub page. [73] Figure 5. Abstract layers of software development using Python as an example. The Python environment configuration is specified through the use of a requirements.txt file. The OS where the Python code runs is specified through the use of a Dockerfile. Such OS is encapsulated within a container and the container within a Kubernetes pod. The cluster manager generates the set of pods that consist of a job (e.g., running a cell extraction routine of the pipeline of Figure 2 on all acquired images). The job is then executed on the processing cluster.