OLGA Surge Volume Bug?

Find BugWe came across an interesting observation while designing surge volume calculations in flotools that I thought was worth sharing and inviting comments from the flow assurance community. For those that are not 100% sure what surge volume is exactly, let me first explain.

Calculating surge volumes is a routine part of a flow assurance engineer’s work. Operational scenarios like slugging, pigging, production ramp-up in multiphase production systems can all result in large volumes of liquids being swept out of the pipeline and into the first vessel on the receiving facility. Often, these liquid surges come in at rates that far exceed the receiving facility’s capacity to process liquids. Therefore the vessel, typically a slug catcher, acts as a buffer where the surge of liquid can be collected and processed over time. One of the objectives of performing flow assurance studies is to quantify the maximum surge of liquid that can be seen across various operations in order to size the slug catcher appropriately. The maximum volume that a slug catcher will have to hold for a given operation is called the surge volume.

OLGA provides a way to calculate surge volumes whenever at least one of ACCLIQ, ACCOIQ, and ACCWAQ is included in the list of trended outputs. The calculation assumes that the slug catcher is present just downstream of the location where these variables are trended and that the vessel can be drained at a fixed maximum drain rate during the operation.

The calculation performed by OLGA is described by the following equation:
[1]V_{T_{start}}=0
[2]V_{t+1}=\max{\bigg(0,V_t+ACC_{t+1}-ACC_t-Q_{drain}\cdot(T_{t+1}-T_t)\bigg)}
[3]V_{surge}=\max{\big(V_t\big)}\:\text{ where }\:T_{end}\leq t\leq T_{end}
where

ACC_t is the OLGA reported cumulative volume of liquid at time step t,

T_t is the elapsed simulation time at time step t,

Q_{drain} is the maximum drain rate of the slug catcher,

T_{start} and T_{end}  mark the time window in the simulation where the calculation is done, and

V_{surge} is the calculated surge volume

In this post I will look at two interesting properties of this calculation:

  • Why were the accumulation variables (ACC*) used instead of the instantaneous rate variables (QL*)?
  • Why is there a \max{(0,\:\ldots})} operation in equation (2)?

Accumulation Variables vs. Instantaneous Rate Variables

OLGA’s calculation of surge volume uses the accumulated variables as the basis of the surge volume calculation instead of the instantaneous liquid volume rate variables (QLT, QLTHL, QLTWT).  To understand why, let’s look at the instantaneous rate form of the surge volume equation.

When using the instantaneous rate variables, equation (2) becomes,
[4]V_{t+1}=\max{\Bigg(0,V_t+\bigg(\frac{1}{2}\Big(Q_{t+1}+Q_t\Big)-Q_{drain}\bigg)\cdot (T_{t+1}-T_t)\Bigg)}
If we compare the accumulation terms from (2) and (4), we see the following assumed relationship:
[5]ACC_{t+1}-ACC_{t}\simeq \frac{1}{2}\Big(Q_{t+1}+Q_t\Big)\cdot (T_{t+1}-T_t)
In other words, the average of the instantaneous rates in a particular time window is approximately equal to the average accumulation rate in that time window.  This is typically a bad assumption because the instantaneous rates capture rate spikes that are very short in duration and would not be indicative of the average rate for the corresponding time window. The average rate can be calculated as follows:
[6]Q_{avg,t}=\frac{ACC_{t+1}-ACC_t}{T_{t+1}-T_t}
The following chart shows a comparison between an actual QLT output from OLGA and the associated average QLT calculated from ACCLIQ according to equation (6):

 

Instantaneous vs. Average Liquid Rate
Instantaneous vs. Average Liquid Rate

You can see that the average QLT (from ACCLIQ) does not show the flowrate spikes that the QLT variable shows.  These spikes, while they probably do occur in a flowing system, typically occur in very short time windows smaller than the output interval of the simulation. The larger the output interval, the worse the assumption.

The following chart shows an example of the error in accumulation by comparing the calculated accumulation using the rate variable and subtracting the OLGA calculated ACC variable from it. While the maximum error in this example (~25 barrels) is not significant, the magnitude of the error entirely depends on the nature of the simulation and may be significant in some cases.

Error in accumulation calculated using flotools
Error in accumulation

In our view, OLGA has taken the correct approach and used the accumulated variables as the basis for the surge volume calculation.

Handling Negative Terms

Equation (2) features a \max operation. This ensures that the calculated volume in the slug catcher never goes below zero. But what happens when the quantity (ACC_{t+1}-ACC_t) becomes negative?

It is perfectly normal and valid for a numerical simulator to predict negative rates at an outlet boundary. When OLGA predicts negative rates at the outlet of the pipeline, the ACC variable may reduce in value from one time step to the next. When this happens, equation (2) will result in a reduction in the calculated slug catcher volume at a rate faster than the assumed drain rate. Effectively, the calculation does not prevent the possibility that liquid can leave via liquid drain as well as the inlet of the slug catcher. When you look at a schematic of a typical slug catcher, like the one shown below, it becomes apparent that this may not be such a sound assumption. The slug catchers are designed for gravity separation of phases and hence the inlet nozzles are at or near the top of the vessel. Once the liquids go in, they quickly settle to the bottom. Any negative flow is likely to be mostly gas with very little liquid carried as droplets in the gas phase.

Slug catcher schematic
Slug catcher schematic

Depending on your case, using the OLGA basis for calculation may result in significant errors. In one case, we found a 10% error at a specific drain rate. The problem is that the error is not on the side of conservatism. We think the correct way to write equation (2) is as follows:
[7]V_{t+1}=\max{\bigg(0,V_t+\max{\Big(ACC_{t+1}-ACC_t,0\Big)}-Q_{drain}\cdot(T_{t+1}-T_t)\bigg)}
In equation (7), we added another max function that bounds the quantity (ACC_{t+1}-ACC_t), which is the average flow rate into the slug catcher for a given time interval, to zero.

If the calculation is being done at the outlet of a pipeline that is connected to a pressure node, set the parameter GASFRACTION to 1.0 in your NODE specification. This will ensure that whenever there is negative flow at the outlet boundary, the negative flow is all gas. That said, we still think equation (7) is a better way to perform the surge volume calculation because it works well regardless of the boundary specification.

Comparison of Surge Volumes
Comparison of Surge Volumes

The plot above shows a comparison of surge volumes calculated according to equations (2) and (7), labeled “OLGA Method” and “Proposed Method” respectively. We can see that filtering out the negative values results in larger surge volumes at lower drain rates. At large enough drain rates, the differences eventually disappear. Given surge volume calculations are performed in order to size the slug catcher, we believe that equation (2) is not conservative and therefore should not be used. Instead, our modified version represented in equation (7), which gives a more conservative estimate of surge volume, should be used.

As always, your comments and feedback would be much appreciated.

The Truth about OLGA Speed

tunnel-101976_1280

Recently, I saw a discussion of OLGA speed in the OLGA Users group on LinkedIn. The discussion starts with the question of why OLGA performs nearly the same on two different CPUs  (an Intel Core i7 processer running at 3.4 GHz and Intel Core i5 processor also running at 3.4 GHz). This result is surprising and troubling because Core i5 is a considerably cheaper processor.

I have seen flow assurance companies buy expensive hardware in hope of making OLGA go faster. Unfortunately, the results of such expense have been hit-or-miss. As a budding flow assurance consultant, I witnessed one of those misses. After purchasing hardware that was very expensive, we found that OLGA ran just as fast as it was running on desktop machines that were one year old. Since then, I have spent quite of bit of time looking at OLGA speed and working on understanding what factors impact OLGA performance.

To help flow assurance companies considering such buying decisions, I thought it might be worthwhile sharing the knowledge I have gained through my investigation. Also, I thought it might be interesting to add some data and analysis to the discussion and look specifically at how the number of threads plays a role in OLGA speed. In the LinkedIn discussion, Torgeir Vanvik from Schlumberger offered some excellent insight into the way OLGA works, and I am hoping this post sheds more light on the topic of OLGA’s parallel performance.

Key factors that affect OLGA simulation speed

There are a several key factors that affect OLGA simulation speed. Some have to do with the numerical modeling complexity and others have to do with the hardware on which OLGA is run.

On the modeling side, the most obvious factor is the complexity of the network being modeled. In general, single branch models run faster than networks and simple converging networks run faster than diverging networks or networks with looped lines. Unfortunately, this is not something flow assurance engineers can control so it is not worth discussing it further.

Next on the list are the section lengths and numerical time step. In OLGA, the simulation time step is controlled using the parameters MINDT and MAXDT in the INTEGRATION specification and also using the DTCONTROL parameters. To ensure model stability, simulations are typically run with the CFL condition controlling simulation time step. The CFL condition determines how much distance, relative to the length of a section in the model, the fluid in that section is allowed to move in one time step. The net effect is that the longer the section length, the longer your time steps are allowed to be, and vice versa. The INTEGRATION and DTCONTROL parameters along with section lengths have a profound impact on model speed. The model speed is typically governed by the smallest section in the network. I can write a whole treatise on this but that is a topic for another day.

The model speed is typically governed by the smallest section in the network

On the hardware side, the key factors that affect simulation speed are CPU and I/O speed.

The processor

Modern CPUs have two specifications that are important for our purposes – clock speed and number of cores. The clock speed is indicative of how many instructions are processed per second, and the number of cores indicate how many instructions are processed in parallel. Modern versions of OLGA (6 and above) are able to exploit the power of multiple cores whereas older versions of OLGA (5 and below) get no benefit from multi-core processors.

No matter what the version of OLGA, clock speed is important. Ultimately, it comes down to how many instructions can be processed per second, so the GHz of the processor (the bigger the better) is important.

No matter what the version of OLGA, clock speed is important

For OLGA 6 and later versions, the number of cores will also play a role in the speed. However, it is easy to fall into the trap of believing that more cores will result in faster simulation speeds. The unfortunate reality is that some tasks benefit from being processed in parallel while others don’t. If the time to split the task into small problems is greater than the time savings resulting from parallel processing, the task will actually run slower. In other words, depending on the problem there is a theoretical limit to the gains from parallelization. This is also true for OLGA.

To answer practical questions like, “Is it better to have a 3.4 GHz, 4-core CPU or a 2.4 GHz, 16-core CPU?” requires some investigation into OLGA parallelizability. In fact, we explore that very topic later in this post.

Depending on the problem there is a theoretical limit to the gains from parallelization

I/O

Since OLGA outputs simulation results to the disk as it is running, the speed at which it can write out the results can limit (sometimes severely) the run-time speed. There are two common hardware bottlenecks, the hard drive speed (when OLGA is saving locally) and the network bandwidth (when OLGA is writing to a network drive).

Most commercial-grade desktop computers and laptops ship with mechanical hard-drives that spin at 5400 or 7200 rpm, while server-grade machines often come with 10k or 15k rpm drives. The read/write access speed scales directly with the spin speed of the drive. In general, the greater the spin rate the better the hard drive when it comes to OLGA speed. Solid state drives (SSDs) are now also available cheaply and the technology has matured enough to be used in a commercial setting. However, the speed of SSDs range from worse than mechanical drives to exceptionally fast depending on the manufacturer and model. In other words, not all SSDs are as blazing fast as they would have you believe so choose carefully. It is also important to consider the computer bus interface which determines the internal data transfer rates (though these days that interface is rarely the bottle neck). Ultimately, the hard drive performance can be as important to simulation speed as the CPU.

Ultimately, the hard drive performance can be as important to simulation speed as the CPU

When saving OLGA results to a network share, the network can also limit the ability for OLGA to write simulations results. As a result, companies should ensure that the bandwidth between the computer running OLGA and the network storage is as large as possible. This will alleviate any slowdowns in OLGA speed.

When saving OLGA results to a network share, the network can also limit the ability for OLGA to write simulations results

These bottlenecks can also be avoided most of the times by carefully considering the frequency and quantity of simulation outputs.

The study

In order to understand the factors that influence parallel speedup, we used 8 different model configurations.

Model Description Number of sections Smallest section length (m)
1 Single pipeline model  190  173
2 Single branch – fine mesh  7000  11
3 Single branch – coarse mesh  376  21
4 Converging pipeline network  335  17
5 Converging network with pressure-pressure boundary  383  14
6 Converging pipeline network – no flow  335  17
7 Converging-diverging network (Loop)  60  50
8 Two separate networks  426  50

Methodology

All models were run with no trend and profile outputs to eliminate the effect of I/O on parallel speedup. To ensure the results we repeatable, each model was run multiple times utilizing a varying number of threads. A simple program was developed to run each model up to 20 times in 10 minutes (which ensured all models ran at least 2 times and many ran the full 20 times). The average run time was then calculated for each model and thread combination. It is worth noting that the run times for each simulation iteration were nearly identical. OLGA 2014.2 was used for this study (see acknowledgments at the end). The following command was used to manipulate the number of threads used by OLGA. A thread is a part of a computer program that can be managed separately by the operating system. A single core in a modern CPU can handle two threads.

opi.exe /t  <num_threads> <input_file>

All simulations were run on a machine with 4 physical cores and capable of running 8 threads in parallel.

Results

The first plot shows the speedup achieved by the various models. The ideal speedup line shows that a model using n threads should be able to achieve a speedup of ‘n’ compared to the 1 thread model. Note that without specifying the number of cores when running OLGA, the default number of threads is based on the number of CPU cores (in our case that is 4).

Speed-up achieved by various model types
Speed-up achieved by various model types

The plot above shows that the best performing model achieves a speedup of 3 using 4 threads, and a speedup of 4 using 8 threads. The worst performing models cap off at a speedup of ~1.6 and achieve no additional speedup beyond 5 threads. In fact, speedup of few of the models reduce when going from 7 threads to 8. However, this last artifact could be a result of using all available threads on the processor leaving the OS to switch between the computational load and background services running on the OS. We can only confirm this if we ran the test on an 8- or 16-core machine.

Another way to look at speedup is to look at a quantity called parallel efficiency which is the ratio of the actual speedup to the ideal speedup.

Parallel efficiency achieved by various model types
Parallel efficiency achieved by various model types

These two plots show that the parallel speedup tends to stagnate beyond 4 threads for most models. Most models are able to achieve a speedup of 2 or more when using 4 threads. However, by the time we get to 7 threads, only one model has a parallel efficiency of over 50%. In other words, we would be better off running two simulations simultaneously using 4 threads each, rather than running just one simulation using all 8 available threads.

Parallel speedup tends to stagnate beyond 4 threads for most models

Analysis

The parallel speedup and efficiency plots showed that the efficiency of parallelization varied between various model types. So the next question is what makes a model more or less parallelizable. The flow chart below shows a simplified program structure of a parallel program.

Typical program flow of a parallel numerical algorithm such as the one used in OLGA
Typical program flow of a parallel numerical algorithm such as the one used in OLGA

 

In OLGA, the main calculation loop would be the time loop that marches time from start to the end of the simulation. The initial sequential process would be reading input files, tab files, etc. The final post-processing might include closing file handles, releasing memory, etc.

With that background in mind, we curve fitted the parallel efficiency curves with an exponential function of the following form:

\mu_p=e^{c(n_p-1)}

where

\mu_p\text{ is the parallel efficiency}\\<br /><br />
c\text{ is the parallel efficiency decay factor, and}\\<br /><br />
n_p\text{ is the number of threads}

I call the calculated c factor the parallel efficiency decay factor. We can then plot the decay factor as a function of various aspects of the model. Our analysis shows that the decay factor is a strong function of the model runtime and the number of sections in the model.

Parallel Efficiency Decay vs. Model Runtime
Parallel Efficiency Decay vs. Model Runtime

The plot above shows that the parallel efficiency is loosely a logarithmic function of the model run time. This makes sense and follows readily from the way parallel efficiency is formulated above. (and Amdahl’s law). Skipping some math jugglery, c can be rearranged to the following equation:

c=\frac{ln(\frac{t_s+t_p}{n_p\cdot t_s+t_p})}{n_p-1}

where

t_s\text{ is the time spent in the sequential portions of the simulation}\\<br /><br />
t_p\text{ is the time spent in the parallel portions of the simulation}

When t_s\gg t_p, there is hardly any speedup, yielding a parallel efficiency of \frac{1}{n_p} and when t_p\gg t_sc\rightarrow 0 yielding a parallel efficiency of 1. In between, we get a log-linear relationship.

The plot below shows the parallel efficiency decay factor as a function of number of sections in the model. As the number of sections increase, parallel efficiency gets better. Note that at 7000 sections, the decay factor is ~-0.1, which is probably close to the theoretical limit based on the strictly sequential parts of the simulation.

Parallel Efficiency Decay vs. Number of Sections
Parallel Efficiency Decay vs. Number of Sections

This also makes sense based on the fact that the number of computations performed in each time step is directly proportional to the number of sections and these are the computations that are computed in parallel according to the OLGA manual. So, higher the number of sections, better the parallel efficiency. However, there is a limit to the parallel efficiency as there are always sequential parts of the algorithm that cannot be parallelized.

Higher the number of sections, better the parallel efficiency

To sum it up…

Getting back to the discussion of hardware choices and their impact on OLGA speed, number of cores, clock speed and I/O speed are all significant factors. Recent versions of OLGA are multi-threaded and have the ability to run faster by utilizing multiple processor cores. We did a detailed analysis on how number of cores can impact OLGA speed and whether it is prudent to spend money on cores.

OLGA defaults to using as many threads as the number of cores available. In our analysis, the best speedup we achieved with 4 threads was ~3, a 75% parallel efficiency. In general, the more compute intensive the simulation, the better the speedup. For short simulations, multi-threading did not help. Even for a long simulation with 7000 sections, going from 4 threads to 8 threads only bumped up the speedup from 3 to 4. In general the parallel efficiency tapers off as we go beyond 4 threads. Based on our analysis, I reckon that 4 threads is a sweet spot for running flow assurance models in OLGA. You could of course fiddle with this for individual models but I would not recommend spending time on it.

Four threads is a sweet spot for running flow assurance models in OLGA

Keeping in line with our findings, the OLGA manual advises that it is better to use the available cores for simultaneous simulations rather than using them to speed up an individual simulation. However, this advice is a bit naive. For example, most professional desktop or laptop systems today have 4 cores but do not have the hard drive access speeds to support 4 simultaneous simulations writing data. The right choice lies somewhere in the middle.

If you are making a hardware buying decision I would not go beyond 4 cores when buying a computer with a mechanical hard drive. If you have enough OLGA licenses and want to centralize your simulations on one machine, the storage choice is as important as the processor choice. I would also recommend setting OMP_NUM_THREADS environment variable to 4 in order to run OLGA at an optimum parallel efficiency.

We welcome you to share your experiences and provide us feedback. If there is enough interest, we will explore the effect of CPU clock speed and disk I/O in detail in future posts.

Acknowledgments

We thank Dr. Ivor Ellul and RPS Group for running OLGA simulations and for valuable suggestions related to the analysis presented here.

Tips and Tricks of Flow Assurance Professionals