Recently, I saw a discussion of OLGA speed in the OLGA Users group on LinkedIn. The discussion starts with the question of why OLGA performs nearly the same on two different CPUs (an Intel Core i7 processer running at 3.4 GHz and Intel Core i5 processor also running at 3.4 GHz). This result is surprising and troubling because Core i5 is a considerably cheaper processor.
I have seen flow assurance companies buy expensive hardware in hope of making OLGA go faster. Unfortunately, the results of such expense have been hit-or-miss. As a budding flow assurance consultant, I witnessed one of those misses. After purchasing hardware that was very expensive, we found that OLGA ran just as fast as it was running on desktop machines that were one year old. Since then, I have spent quite of bit of time looking at OLGA speed and working on understanding what factors impact OLGA performance.
To help flow assurance companies considering such buying decisions, I thought it might be worthwhile sharing the knowledge I have gained through my investigation. Also, I thought it might be interesting to add some data and analysis to the discussion and look specifically at how the number of threads plays a role in OLGA speed. In the LinkedIn discussion, Torgeir Vanvik from Schlumberger offered some excellent insight into the way OLGA works, and I am hoping this post sheds more light on the topic of OLGA’s parallel performance.
Key factors that affect OLGA simulation speed
There are a several key factors that affect OLGA simulation speed. Some have to do with the numerical modeling complexity and others have to do with the hardware on which OLGA is run.
On the modeling side, the most obvious factor is the complexity of the network being modeled. In general, single branch models run faster than networks and simple converging networks run faster than diverging networks or networks with looped lines. Unfortunately, this is not something flow assurance engineers can control so it is not worth discussing it further.
Next on the list are the section lengths and numerical time step. In OLGA, the simulation time step is controlled using the parameters
MAXDT in the
INTEGRATION specification and also using the
DTCONTROL parameters. To ensure model stability, simulations are typically run with the CFL condition controlling simulation time step. The CFL condition determines how much distance, relative to the length of a section in the model, the fluid in that section is allowed to move in one time step. The net effect is that the longer the section length, the longer your time steps are allowed to be, and vice versa. The
DTCONTROL parameters along with section lengths have a profound impact on model speed. The model speed is typically governed by the smallest section in the network. I can write a whole treatise on this but that is a topic for another day.
The model speed is typically governed by the smallest section in the network
On the hardware side, the key factors that affect simulation speed are CPU and I/O speed.
Modern CPUs have two specifications that are important for our purposes – clock speed and number of cores. The clock speed is indicative of how many instructions are processed per second, and the number of cores indicate how many instructions are processed in parallel. Modern versions of OLGA (6 and above) are able to exploit the power of multiple cores whereas older versions of OLGA (5 and below) get no benefit from multi-core processors.
No matter what the version of OLGA, clock speed is important. Ultimately, it comes down to how many instructions can be processed per second, so the GHz of the processor (the bigger the better) is important.
No matter what the version of OLGA, clock speed is important
For OLGA 6 and later versions, the number of cores will also play a role in the speed. However, it is easy to fall into the trap of believing that more cores will result in faster simulation speeds. The unfortunate reality is that some tasks benefit from being processed in parallel while others don’t. If the time to split the task into small problems is greater than the time savings resulting from parallel processing, the task will actually run slower. In other words, depending on the problem there is a theoretical limit to the gains from parallelization. This is also true for OLGA.
To answer practical questions like, “Is it better to have a 3.4 GHz, 4-core CPU or a 2.4 GHz, 16-core CPU?” requires some investigation into OLGA parallelizability. In fact, we explore that very topic later in this post.
Depending on the problem there is a theoretical limit to the gains from parallelization
Since OLGA outputs simulation results to the disk as it is running, the speed at which it can write out the results can limit (sometimes severely) the run-time speed. There are two common hardware bottlenecks, the hard drive speed (when OLGA is saving locally) and the network bandwidth (when OLGA is writing to a network drive).
Most commercial-grade desktop computers and laptops ship with mechanical hard-drives that spin at 5400 or 7200 rpm, while server-grade machines often come with 10k or 15k rpm drives. The read/write access speed scales directly with the spin speed of the drive. In general, the greater the spin rate the better the hard drive when it comes to OLGA speed. Solid state drives (SSDs) are now also available cheaply and the technology has matured enough to be used in a commercial setting. However, the speed of SSDs range from worse than mechanical drives to exceptionally fast depending on the manufacturer and model. In other words, not all SSDs are as blazing fast as they would have you believe so choose carefully. It is also important to consider the computer bus interface which determines the internal data transfer rates (though these days that interface is rarely the bottle neck). Ultimately, the hard drive performance can be as important to simulation speed as the CPU.
Ultimately, the hard drive performance can be as important to simulation speed as the CPU
When saving OLGA results to a network share, the network can also limit the ability for OLGA to write simulations results. As a result, companies should ensure that the bandwidth between the computer running OLGA and the network storage is as large as possible. This will alleviate any slowdowns in OLGA speed.
When saving OLGA results to a network share, the network can also limit the ability for OLGA to write simulations results
These bottlenecks can also be avoided most of the times by carefully considering the frequency and quantity of simulation outputs.
In order to understand the factors that influence parallel speedup, we used 8 different model configurations.
|Model||Description||Number of sections||Smallest section length (m)|
|1||Single pipeline model||190||173|
|2||Single branch – fine mesh||7000||11|
|3||Single branch – coarse mesh||376||21|
|4||Converging pipeline network||335||17|
|5||Converging network with pressure-pressure boundary||383||14|
|6||Converging pipeline network – no flow||335||17|
|7||Converging-diverging network (Loop)||60||50|
|8||Two separate networks||426||50|
All models were run with no trend and profile outputs to eliminate the effect of I/O on parallel speedup. To ensure the results we repeatable, each model was run multiple times utilizing a varying number of threads. A simple program was developed to run each model up to 20 times in 10 minutes (which ensured all models ran at least 2 times and many ran the full 20 times). The average run time was then calculated for each model and thread combination. It is worth noting that the run times for each simulation iteration were nearly identical. OLGA 2014.2 was used for this study (see acknowledgments at the end). The following command was used to manipulate the number of threads used by OLGA. A thread is a part of a computer program that can be managed separately by the operating system. A single core in a modern CPU can handle two threads.
opi.exe /t <num_threads> <input_file>
All simulations were run on a machine with 4 physical cores and capable of running 8 threads in parallel.
The first plot shows the speedup achieved by the various models. The ideal speedup line shows that a model using n threads should be able to achieve a speedup of ‘n’ compared to the 1 thread model. Note that without specifying the number of cores when running OLGA, the default number of threads is based on the number of CPU cores (in our case that is 4).
The plot above shows that the best performing model achieves a speedup of 3 using 4 threads, and a speedup of 4 using 8 threads. The worst performing models cap off at a speedup of ~1.6 and achieve no additional speedup beyond 5 threads. In fact, speedup of few of the models reduce when going from 7 threads to 8. However, this last artifact could be a result of using all available threads on the processor leaving the OS to switch between the computational load and background services running on the OS. We can only confirm this if we ran the test on an 8- or 16-core machine.
Another way to look at speedup is to look at a quantity called parallel efficiency which is the ratio of the actual speedup to the ideal speedup.
These two plots show that the parallel speedup tends to stagnate beyond 4 threads for most models. Most models are able to achieve a speedup of 2 or more when using 4 threads. However, by the time we get to 7 threads, only one model has a parallel efficiency of over 50%. In other words, we would be better off running two simulations simultaneously using 4 threads each, rather than running just one simulation using all 8 available threads.
Parallel speedup tends to stagnate beyond 4 threads for most models
The parallel speedup and efficiency plots showed that the efficiency of parallelization varied between various model types. So the next question is what makes a model more or less parallelizable. The flow chart below shows a simplified program structure of a parallel program.
In OLGA, the main calculation loop would be the time loop that marches time from start to the end of the simulation. The initial sequential process would be reading input files, tab files, etc. The final post-processing might include closing file handles, releasing memory, etc.
With that background in mind, we curve fitted the parallel efficiency curves with an exponential function of the following form:
I call the calculated factor the parallel efficiency decay factor. We can then plot the decay factor as a function of various aspects of the model. Our analysis shows that the decay factor is a strong function of the model runtime and the number of sections in the model.
The plot above shows that the parallel efficiency is loosely a logarithmic function of the model run time. This makes sense and follows readily from the way parallel efficiency is formulated above. (and Amdahl’s law). Skipping some math jugglery, can be rearranged to the following equation:
When , there is hardly any speedup, yielding a parallel efficiency of and when , yielding a parallel efficiency of 1. In between, we get a log-linear relationship.
The plot below shows the parallel efficiency decay factor as a function of number of sections in the model. As the number of sections increase, parallel efficiency gets better. Note that at 7000 sections, the decay factor is ~-0.1, which is probably close to the theoretical limit based on the strictly sequential parts of the simulation.
This also makes sense based on the fact that the number of computations performed in each time step is directly proportional to the number of sections and these are the computations that are computed in parallel according to the OLGA manual. So, higher the number of sections, better the parallel efficiency. However, there is a limit to the parallel efficiency as there are always sequential parts of the algorithm that cannot be parallelized.
Higher the number of sections, better the parallel efficiency
To sum it up…
Getting back to the discussion of hardware choices and their impact on OLGA speed, number of cores, clock speed and I/O speed are all significant factors. Recent versions of OLGA are multi-threaded and have the ability to run faster by utilizing multiple processor cores. We did a detailed analysis on how number of cores can impact OLGA speed and whether it is prudent to spend money on cores.
OLGA defaults to using as many threads as the number of cores available. In our analysis, the best speedup we achieved with 4 threads was ~3, a 75% parallel efficiency. In general, the more compute intensive the simulation, the better the speedup. For short simulations, multi-threading did not help. Even for a long simulation with 7000 sections, going from 4 threads to 8 threads only bumped up the speedup from 3 to 4. In general the parallel efficiency tapers off as we go beyond 4 threads. Based on our analysis, I reckon that 4 threads is a sweet spot for running flow assurance models in OLGA. You could of course fiddle with this for individual models but I would not recommend spending time on it.
Four threads is a sweet spot for running flow assurance models in OLGA
Keeping in line with our findings, the OLGA manual advises that it is better to use the available cores for simultaneous simulations rather than using them to speed up an individual simulation. However, this advice is a bit naive. For example, most professional desktop or laptop systems today have 4 cores but do not have the hard drive access speeds to support 4 simultaneous simulations writing data. The right choice lies somewhere in the middle.
If you are making a hardware buying decision I would not go beyond 4 cores when buying a computer with a mechanical hard drive. If you have enough OLGA licenses and want to centralize your simulations on one machine, the storage choice is as important as the processor choice. I would also recommend setting OMP_NUM_THREADS environment variable to 4 in order to run OLGA at an optimum parallel efficiency.
We welcome you to share your experiences and provide us feedback. If there is enough interest, we will explore the effect of CPU clock speed and disk I/O in detail in future posts.
We thank Dr. Ivor Ellul and RPS Group for running OLGA simulations and for valuable suggestions related to the analysis presented here.