- 21 Mar, 2023 4 commits
-
-
Kirill Smelkov authored
This way xlog.Reader can be sure that if it looked around in such a window and did not find a sync, then something is not good with the stream and it does not need to go beyond that limit looking around. This is a change of the protocol. But it is early days and existing logs - that we use in the demo, are all below 1000 lines limit, so they will continue to be loaded ok. No direct test for actual Loss Of Sync detection - this functionality is draft for now and should be improved later. However for no-LOS cases xlog.Reader is already covered with tests.
-
Kirill Smelkov authored
Let's use "sync(reason=start)" instead of dedicated "start" event for uniformity. Periodic syncs are now "sync(reason=periodic)" and after logrotation support there will be also "pre-logrotate" and "post-logrotate" reasons. Emit "sync(reason=stop)" at xlog shutdown for uniformity and to make it more clear from looking at just enb.xlog about what is xlog state at the end. Stop requiring "start" to be present in the header - we will soon rework xlog reader to look around for nearby sync automatically so that reading could be started from any position in the stream.
-
Kirill Smelkov authored
We emit config_get after every attach from the beginning of xlog in e0cc8a38 (amari.xlog: Initial draft). The reasoning here is that it is useful by default to know configuration of a service. In the previous patch we added sync events so that xlog stream becomes self-synchronizing. To continue that line it is now useful to have that config_get emitted after every such synchronization point instead of only after attaching to the service. That's what hereby patch does. As a bonus the code is reworked in a way that config_get setup is not hardcoded anymore and config_get periodicity now can be controlled by users via explicitly specifying config_get in the logspec.
-
Kirill Smelkov authored
So that xlog stream becomes self-synchronized and could be used even if we start reading it from some intermediate point instead of only from the beginning. We will need this in general - to be able to start reading long log not only from its beginning, and also in particular for Wendelin systems where logs are uploaded by Fluentd in chunks and some chunks could be potentially lost. Sync events are emitted always unconditionally with default sync interval being 10x the longest specified period. We also provide users a way to control sync periodicity via explicitly specifying "meta.sync/period" query in the logspec. See !3 (comment 175796) and further for related discussion. This is change of xlog protocol. But it is early days and the only direct consumer of xlog is amari.kpi which we adjust accordingly. So it should be ok.
-
- 17 Mar, 2023 2 commits
-
-
Kirill Smelkov authored
We will soon need to construct logspecs not only by way of parsing.
-
Kirill Smelkov authored
In 79d10eb9 that patch wired ctx through xlog callchains and added corresponding handling of cancellation. But I overlooked one place where plain sleep was used. -> Fix it.
-
- 09 Mar, 2023 3 commits
-
-
Kirill Smelkov authored
The most interesting patches are - d102ffaa (drb: Start of the package) - 5bf7dc1c (amari.{drb,xlog}: Provide aggregated DRB statistics in the form of synthetic x.drb_stats message) - 499a7c1b (amari.kpi: Teach LogMeasure to handle x.drb_stats messages) - 2824f50d (kpi: Calc: Add support for E-UTRAN IP Throughput KPI) - 4b2c8c21 (demo/kpidemo.*: Add support for E-UTRAN IP Throughput KPI + demonstrate it in the notebook) The other patches introduce or adjust needed infrastructure. A byproduct of particular note is that kpi.Measurement now supports QCI. A demo might be seen in the last part of https://nbviewer.org/urls/lab.nexedi.com/kirr/xlte/raw/43aac33e/demo/kpidemo.ipynb And below we provide the overall overview of the implementation. Overview of E-UTRAN IP Throughput computation --------------------------------------------- Before we begin explaining how IP Throughput is computed, let's first refresh what it is and have a look at what is required to compute it reasonably. This KPI is defined in TS 32.450[1] and aggregates transmission volume and time over bursts of transmissions from an average UE point of view. It should be particularly noted that only the time, during which transmission is going on, should be accounted. For example if an UE receives 10KB over 4ms burst and the rest of the time there is no transmission to it during, say, 1 minute, the downlink IP Throughput for that UE over the minute is 20Mbit/s (= 8·10KB/4ms), not 1.3Kbit/s (= 8·10KB/60s). This KPI basically shows what would be the speed to e.g. download a response for HTTP request issued from a mobile. [1] https://www.etsi.org/deliver/etsi_ts/132400_132499/132450/16.00.00_60/ts_132450v160000p.pdf#page=13 To compute IP Throughput we thus need to know Σ of transmitted amount of bytes, and Σ of the time of all transmission bursts. Σ of the bytes is relatively easy to get. eNB already provides close values in overall `stats` and in per-UE `ue_get[stats]` messages. However there is no anything readily available out-of-the box for Σ of bursts transmission time. Thus we need to measure the time of transmission bursts ourselves somehow. It turns out that with current state of things the only practical way to measure it to some degree is to poll eNB frequently with `ue_get[stats]` and estimate transmission time based on δ of `ue_get` timestamps. Let's see how frequently we need to poll to get to reasonably accuracy of resulting throughput. A common situation for HTTP requests issued via LTE is that response content downloading time takes only few milliseconds. For example I used chromium network profiler to access various sites via internet tethered from my phone and saw that for many requests response content downloading time was e.g. 4ms, 5ms, 3.2ms, etc. The accuracy of measuring transmission time should be thus in the order of millisecond to cover that properly. It makes a real difference for reported throughput, if say a download sample with 10KB took 4ms, or it took e.g. "something under 100ms". In the first case we know that for that sample downlink throughput is 2500KB/s, while in the second case all we know is that downlink throughput is "higher than 100KB/s" - a 25 times difference and not certain. Similarly if we poll at 10ms rate we would get that throughput is "higher than 1000KB/s" - a 2.5 times difference from actual value. The accuracy of 1 millisecond coincides with TTI time and with how downlink/uplink transmissions generally work in LTE. With the above the scheme to compute IP Throughput looks to be as follows: poll eNB at 1000Hz rate for `ue_get[stats]`, process retrieved information into per-UE and per-QCI streams, detect bursts on each UE/QCI pair, and aggregate `tx_bytes` and `tx_time` from every burst. It looks to be straightforward, but 1000Hz polling will likely create non-negligible additional load on the system and disturb eNB itself introducing much jitter and harming its latency requirements. That's probably why eNB actually rate-limits WebSocket requests not to go higher than 100Hz - the frequency 10 times less compared to what we need to get to reasonable accuracy for IP throughput. Fortunately there is additional information that provides a way to improve accuracy of measured `tx_time` even when polled every 10ms at 100Hz rate: that additional information is the number of transmitted transport blocks to/from an UE. If we know that during 10ms frame it was e.g. 4 transport blocks transmitted to the UE, that there were no retransmissions *and* that eNB is not congested, we can reasonably estimate that it was actually a 4ms transmission. And if eNB is congested we can still say that transmission time is somewhere in `[4ms, 10ms]` interval because transmitting each transport block takes 1 TTI. Even if imprecise that still provides some information that could be useful. Also 100Hz polling turns to be acceptable from performance point of view and does not disturb the system much. For example on the callbox machine the process, that issues polls, takes only about 3% of CPU load and only on one core, and the CPU usage of eNB does not practically change and its reported tx/rx latency does not change as well. For sure, there is some disturbance, but it appears to be small. To have a better idea of what rate of polling is possible, I've made an experiment with the poller accessing my own websocket echo server quickly implemented in python. Both the poller and the echo server are not optimized, but without rate-limiting they could go to 8000Hz frequency with reaching 100% CPU usage of one CPU core. That 8000Hz is 80x times more compared to 100Hz frequency actually allowed by eNB. This shows what kind of polling frequency limit the system can handle, if absolutely needed, and that 100Hz turns out to be not so high a frequency. Also the Linux 5.6 kernel, installed on the callbox from Fedora32, is configured with `CONFIG_HZ=1000`, which is likely helping here. Implementation overview ~~~~~~~~~~~~~~~~~~~~~~~ The scheme to compute E-UTRAN IP Throughput is thus as follows: poll eNB at 100Hz frequency for `ue_get[stats]` and retrieve information about per-UE/QCI streams and the number of transport blocks dl/ul-ed to the UE in question during that 10ms frame. Estimate `tx_time` taking into account the number of transmitted transport blocks. And estimate whether eNB is congested or not based on `dl_use_avg`/`ul_use_avg` taken from `stats`. For the latter we also need to poll for `stats` at 100Hz frequency and synchronize `ue_get[stats]` and `stats` requests in time so that they both cover the same time interval of particular frame. Then organize the polling process to provide aggregated statistics in the form of new `x.drb_stats` message, and teach `xamari xlog` to save that messages to `enb.xlog` together with `stats`. Then further adjust `amari.kpi.LogMeasure` and generic `kpi.Measurement` and `kpi.Calc` to handle DRB-related data. That is how it is implemented. The main part, that performs 100Hz polling and flow aggregation, is in amari/drb.py. There `Sampler` extracts bursts of data transmissions from stream of `ue_get[stats]` observations and `x_stats_srv` organizes whole 100Hz sampling process and provides aggregated `x.drb_stats` messages to `amari.xlog`. Even though the main idea is relatively straightforward, several aspects deserves to be noted: 1. information about transmitted bytes and corresponding transmitted transport blocks is emitted by eNB not synchronized in time. The reason here is that, for example, for DL a block is transmitted via PDCCH+PDSCH during one TTI, and then the base station awaits HARQ ACK/NACK. That ACK/NACK comes later via PUCCH or PUSCH. The time window in between original transmission and reception of the ACK/NACK is 4 TTIs for FDD and 4-13 TTIs for TDD(*). And Amarisoft LTEENB updates counters for dl_total_bytes and dl_tx at different times: ue.erab.dl_total_bytes - right after sending data on PDCCH+PDSCH ue.cell.{dl_tx,dl_retx} - after receiving ACK/NACK via PUCCH|PUSCH this way an update to dl_total_bytes might be seen in one frame (= 10·TTI), while corresponding update to dl_tx/dl_retx might be seen in either same, or next, or next-next frame. `Sampler` brings δ(tx_bytes) and #tx_tb in sync itself via `BitSync`. 2. when we see multiple transmissions related to UE on different QCIs, we cannot directly use corresponding global number of transport blocks to estimate transmissions times because we do not know how eNB scheduler placed those transmissions onto resource map. So without additional information we can only estimate corresponding lower and upper bounds. 3. for output stability and to avoid throughput being affected by partial fill of tail TTI of a burst, E-UTRAN IP Throughput is required to be computed without taking into account last TTI of every sample. We don't have that level of details since all we have is total amount of transmitted bytes in a burst and estimation of how long in time the burst is. Thus, once again, we can only provide an estimation so that resulting E-UTRAN IP Throughput uncertainty window cover the right value required by 3GPP standard. A curious reader might be interested to look at tests in `amari/drb_test.py` , and at the whole changes that brought E-UTRAN IP Throughput alive. Limitations ~~~~~~~~~~~ Current implementation has the following limitations: - we account whole PDCP instead of only IP traffic. - the KPI is computed with uncertainty window instead of being precise even when the connection to eNB is alive all the time. The shorter bursts are the more the uncertainty. - the implementation works correctly for FDD, but not for TDD. That's because BitSync currently supports only "next frame" case and support for "next-next frame" case is marked as TODO. - eNB `t` monitor command practically stops working and now only reports ``Warning, remote API ue_get (stats = true) pending...`` instead of reporting useful information. This is due to that contrary to `stats`, for `ue_get` eNB does not maintain per-connection state and uses global singleton counters. - the performance overhead might be more noticeable on machines less powerful compared to callbox. To address the limitations I plan to talk to Amarisoft about eNB improvements so that E-UTRAN IP Throughput could be computed precisely from DRB statistics directly provided by eNB itself. However it is still useful to have current implementation, even with all its limitations, because it already works today with existing eNB versions. Kirill
-
Kirill Smelkov authored
Noticed while developing support for E-UTRAN IP Throughtput.
-
Kirill Smelkov authored
Show how to compute that KPI, add corresponding plotting routines, and teach kpidemo.py to display both E-RAB Accessibility and E-UTRAN IP Throughput simultaneously in the same window. Add corresponding demonstration into demo notebook with data from throughput experiment showcasing several scenarious and how E-UTRAN IP Throughput implementation handles them.
-
- 08 Mar, 2023 15 commits
-
-
Kirill Smelkov authored
- move code to load amari.kpi.LogMeasure -> kpi.MeasurementLog into load_measurements(). We will need to use that when showcasing E-UTRAN IP Throughput KPI to load another enb.xlog dataset. - factor code to iterate over MeasurementLog and invoke kpi.Calc on each period into calc_each_period(). Same reason. - factor plotting code into helper routines located only in kpidemo.py. The notebook version now uses those routines by way of importing. The plotting code is not helping to understand the KPI computation pipeline usage, so it makes sense not to show it out of the box in the demo notebook.
-
Kirill Smelkov authored
It is useful to verify E-UTRAN IP Throughput KPI implementation, as that KPI is defined in terms of burst samples.
-
Kirill Smelkov authored
This patch provides the final building block for E-UTRAN IP Throughput KPI. It continues d102ffaa (drb: Start of the package) 5bf7dc1c (amari.{drb,xlog}: Provide aggregated DRB statistics in the form of synthetic x.drb_stats message) 499a7c1b (amari.kpi: Teach LogMeasure to handle x.drb_stats messages) Quoting those patches The scheme to compute E-UTRAN IP Throughput is thus as follows: poll eNB at 100Hz frequency for `ue_get[stats]` and retrieve information about per-UE/QCI streams and the number of transport blocks dl/ul-ed to the UE in question during that 10ms frame. Estimate `tx_time` taking into account the number of transmitted transport blocks. And estimate whether eNB is congested or not based on `dl_use_avg`/`ul_use_avg` taken from `stats`. For the latter we also need to poll for `stats` at 100Hz frequency and synchronize `ue_get[stats]` and `stats` requests in time so that they both cover the same time interval of particular frame. Then organize the polling process to provide aggregated statistics in the form of new `x.drb_stats` message, and teach `xamari xlog` to save that messages to `enb.xlog` together with `stats`. Then further adjust `amari.kpi.LogMeasure` and generic `kpi.Measurement` and `kpi.Calc` to handle DRB-related data. <-- NOTE So here we implement that last noted step: We add Calc.eutran_ip_throughput() whose implementation is relatively straightforward as the hard part is done by amari.drb and amari.kpi - in the Calc we basically need to only divide provided DRB.IPVolDl / DRB.IPTimeDl.
-
Kirill Smelkov authored
This patch provides next building block for E-UTRAN IP Throughput KPI and continues d102ffaa (drb: Start of the package) 5bf7dc1c (amari.{drb,xlog}: Provide aggregated DRB statistics in the form of synthetic x.drb_stats message) Quoting those patches The scheme to compute E-UTRAN IP Throughput is thus as follows: poll eNB at 100Hz frequency for `ue_get[stats]` and retrieve information about per-UE/QCI streams and the number of transport blocks dl/ul-ed to the UE in question during that 10ms frame. Estimate `tx_time` taking into account the number of transmitted transport blocks. And estimate whether eNB is congested or not based on `dl_use_avg`/`ul_use_avg` taken from `stats`. For the latter we also need to poll for `stats` at 100Hz frequency and synchronize `ue_get[stats]` and `stats` requests in time so that they both cover the same time interval of particular frame. Then organize the polling process to provide aggregated statistics in the form of new `x.drb_stats` message, and teach `xamari xlog` to save that messages to `enb.xlog` together with `stats`. Then further adjust `amari.kpi.LogMeasure` <-- NOTE and generic `kpi.Measurement` and `kpi.Calc` to handle DRB-related data. So here we implement the noted step: We teach LogMeasure to take x.drb_stats messages into account and update IP Throughput related fields in appropriate Measurement from x.drb_stats data. This process is relatively straightforward besides one place: for stable output E-UTRAN IP Throughput is required to be computed without taking into account last TTI of every sample. We don't have that level of details since all we have is total amount of transmitted bytes in a burst and estimation of how long in time the burst is. Thus we can only provide an estimation for the E-UTRAN IP Throughput as follows: DRB.IPVol and DRB.IPTime are collected to compute throughput. thp = ΣB*/ΣT* where B* is tx'ed bytes in the sample without taking last tti into account and T* is time of tx also without taking that sample's tail tti. we only know ΣB (whole amount of tx), ΣT and ΣT* with some error. -> thp can be estimated to be inside the following interval: ΣB ΣB ───── ≤ thp ≤ ───── (1) ΣT_hi ΣT*_lo the upper layer in xlte.kpi will use the following formula for final throughput calculation: DRB.IPVol thp = ────────── (2) DRB.IPTime -> set DRB.IPTime and its error to mean and δ of ΣT_hi and ΣT*_lo so that (2) becomes (1). for this to work we also need to introduce new fields to Measurement that represent error of DRB.IPTime. The hope is that introduction is temporary and should be removed once we rework DRB stats to provide B* and T* directly.
-
Kirill Smelkov authored
We added LogMeasure in 71087f67 (amari.kpi: New package with driver for Amarisoft LTE stack to retrieve KPI-related measurements from logs) and its original logic is to read `stats` messages and to create Measurement that covers [Sx, Sx+1) only after seeing Sx+1. However in the next patch we will need to also take into account other smaller messages besides stats, and for that messages we need being-prepared Measurement to already exist to be able to amend it with partial data we see. So we need to rework the process to create Measurement that will cover [Sx, Sx+1) right after seeing Sx without waiting for Sx+1 to come in. This patch does that. Along the way it unifies how events and stats are handled. Previously events and stats were handled via different objects and the code had many scattered places that tried to handle cases like event-event, event-stats, stats-event and stats-stats. And for all those cases the intent was that we still want to emit corresponding Measurement for all of them, even if maybe if all NA data besides timestamps. Thus it does not make sense to split events and stats into different flows - as we can handle all combinations by considering just one flow of "stats or events". This simplifies logic and removes several sporadic branches of code to emit M(ø) around events. It also discovers several places where we were not emitting such M(ø) even though the intent was to do so. All this is fixed now with updated tests.
-
Kirill Smelkov authored
This patch provides next building block for E-UTRAN IP Throughput KPI and continues d102ffaa (drb: Start of the package). Quoting that patch The scheme to compute E-UTRAN IP Throughput is thus as follows: poll eNB at 100Hz frequency for `ue_get[stats]` and retrieve information about per-UE/QCI streams and the number of transport blocks dl/ul-ed to the UE in question during that 10ms frame. Estimate `tx_time` taking into account the number of transmitted transport blocks. And estimate whether eNB is congested or not based on `dl_use_avg`/`ul_use_avg` taken from `stats`. For the latter we also need to poll for `stats` at 100Hz frequency and synchronize `ue_get[stats]` and `stats` requests in time so that they both cover the same time interval of particular frame. Then organize the polling process to provide aggregated statistics in the form of <-- NOTE new `x.drb_stats` message, and teach `xamari xlog` to save that messages to <-- NOTE `enb.xlog` together with `stats`. <-- NOTE Then further adjust `amari.kpi.LogMeasure` and generic `kpi.Measurement` and `kpi.Calc` to handle DRB-related data. So here we implement the noted step: - add drv._x_stats_srv server that polls eNB at 100Hz rate, uses Sampler to extract bursts and aggregates information about those bursts. - teach xlog to organize servers for synthetic messages and communicate with them, and register drv._x_stats_srv as such server to handle generation of x.drb_stats message.
-
Kirill Smelkov authored
An utility class to compute avg/std incrementally. Thanks to https://www.johndcook.com/blog/standard_deviation/ for the recipe of how to do it.
-
Kirill Smelkov authored
This package will be used to implement E-UTRAN IP Throughput KPI. In hereby patch we add `drb.Sampler` that extracts samples of transmission bursts from `ue_get[stats]` observations. Let's go through what E-UTRAN IP Throughput KPI is and how it motivates functionality provided by this patch. Overview of E-UTRAN IP Throughput computation --------------------------------------------- This KPI is defined in TS 32.450 [1] and aggregates transmission volume and time over bursts of transmissions from an average UE point of view. It should be particularly noted that only the time, during which transmission is going on, should be accounted. For example if an UE receives 10KB over 4ms burst and the rest of the time there is no transmission to it during, say, 1 minute, the downlink IP Throughput for that UE over the minute is 20Mbit/s (= 8·10KB/4ms), not 1.3Kbit/s (= 8·10KB/60s). This KPI basically shows what would be the speed to e.g. download a response for HTTP request issued from a mobile. [1] https://www.etsi.org/deliver/etsi_ts/132400_132499/132450/16.00.00_60/ts_132450v160000p.pdf#page=13 To compute IP Throughput we thus need to know Σ of transmitted amount of bytes, and Σ of the time of all transmission bursts. Σ of the bytes is relatively easy to get. eNB already provides close values in overall `stats` and in per-UE `ue_get[stats]` messages. However there is no anything readily available out-of-the box for Σ of bursts transmission time. Thus we need to measure the time of transmission bursts ourselves somehow. It turns out that with current state of things the only practical way to measure it to some degree is to poll eNB frequently with `ue_get[stats]` and estimate transmission time based on δ of `ue_get` timestamps. Let's see how frequently we need to poll to get to reasonably accuracy of resulting throughput. A common situation for HTTP requests issued via LTE is that response content downloading time takes only few milliseconds. For example I used chromium network profiler to access various sites via internet tethered from my phone and saw that for many requests response content downloading time was e.g. 4ms, 5ms, 3.2ms, etc. The accuracy of measuring transmission time should be thus in the order of millisecond to cover that properly. It makes a real difference for reported throughput, if say a download sample with 10KB took 4ms, or it took e.g. "something under 100ms". In the first case we know that for that sample downlink throughput is 2500KB/s, while in the second case all we know is that downlink throughput is "higher than 100KB/s" - a 25 times difference and not certain. Similarly if we poll at 10ms rate we would get that throughput is "higher than 1000KB/s" - a 2.5 times difference from actual value. The accuracy of 1 millisecond coincides with TTI time and with how downlink/uplink transmissions generally work in LTE. With the above the scheme to compute IP Throughput looks to be as follows: poll eNB at 1000Hz rate for `ue_get[stats]`, process retrieved information into per-UE and per-QCI streams, detect bursts on each UE/QCI pair, and aggregate `tx_bytes` and `tx_time` from every burst. It looks to be straightforward, but 1000Hz polling will likely create non-negligible additional load on the system and disturb eNB itself introducing much jitter and harming its latency requirements. That's probably why eNB actually rate-limits WebSocket requests not to go higher than 100Hz - the frequency 10 times less compared to what we need to get to reasonable accuracy for IP throughput. Fortunately there is additional information that provides a way to improve accuracy of measured `tx_time` even when polled every 10ms at 100Hz rate: that additional information is the number of transmitted transport blocks to/from an UE. If we know that during 10ms frame it was e.g. 4 transport blocks transmitted to the UE, that there were no retransmissions *and* that eNB is not congested, we can reasonably estimate that it was actually a 4ms transmission. And if eNB is congested we can still say that transmission time is somewhere in `[4ms, 10ms]` interval because transmitting each transport block takes 1 TTI. Even if imprecise that still provides some information that could be useful. Also 100Hz polling turns to be acceptable from performance point of view and does not disturb the system much. For example on the callbox machine the process, that issues polls, takes only about 3% of CPU load and only on one core, and the CPU usage of eNB does not practically change and its reported tx/rx latency does not change as well. For sure, there is some disturbance, but it appears to be small. To have a better idea of what rate of polling is possible, I've made an experiment with the poller accessing my own websocket echo server quickly implemented in python. Both the poller and the echo server are not optimized, but without rate-limiting they could go to 8000Hz frequency with reaching 100% CPU usage of one CPU core. That 8000Hz is 80x times more compared to 100Hz frequency actually allowed by eNB. This shows what kind of polling frequency limit the system can handle, if absolutely needed, and that 100Hz turns out to be not so high a frequency. Also the Linux 5.6 kernel, installed on the callbox from Fedora32, is configured with `CONFIG_HZ=1000`, which is likely helping here. Implementation overview ~~~~~~~~~~~~~~~~~~~~~~~ The scheme to compute E-UTRAN IP Throughput is thus as follows: poll eNB at 100Hz frequency for `ue_get[stats]` and retrieve information about per-UE/QCI streams and the number of transport blocks dl/ul-ed to the UE in question during that 10ms frame. Estimate `tx_time` taking into account the number of transmitted transport blocks. And estimate whether eNB is congested or not based on `dl_use_avg`/`ul_use_avg` taken from `stats`. For the latter we also need to poll for `stats` at 100Hz frequency and synchronize `ue_get[stats]` and `stats` requests in time so that they both cover the same time interval of particular frame. Then organize the polling process to provide aggregated statistics in the form of new `x.drb_stats` message, and teach `xamari xlog` to save that messages to `enb.xlog` together with `stats`. Then further adjust `amari.kpi.LogMeasure` and generic `kpi.Measurement` and `kpi.Calc` to handle DRB-related data. ---------------------------------------- In this patch we provide first building block - `Sampler` that extracts bursts of data transmissions from stream of `ue_get[stats]` observations. Even though main idea behind `Sampler` is relatively straightforward, several aspects deserves to be noted: 1. information about transmitted bytes and corresponding transmitted transport blocks is emitted by eNB not synchronized in time. The reason here is that, for example, for DL a block is transmitted via PDCCH+PDSCH during one TTI, and then the base station awaits HARQ ACK/NACK. That ACK/NACK comes later via PUCCH or PUSCH. The time window in between original transmission and reception of the ACK/NACK is 4 TTIs for FDD and 4-13 TTIs for TDD (*). And Amarisoft LTEENB updates counters for dl_total_bytes and dl_tx at different times: ue.erab.dl_total_bytes - right after sending data on PDCCH+PDSCH ue.cell.{dl_tx,dl_retx} - after receiving ACK/NACK via PUCCH|PUSCH this way an update to dl_total_bytes might be seen in one frame (= 10·TTI), while corresponding update to dl_tx/dl_retx might be seen in either same, or next, or next-next frame. We bring `δ(tx_bytes)` and `#tx_tb` in sync ourselves via _BitSync. (*) see e.g. Figure 8.1 in "An introduction to LTE, 2nd ed." 2. when we see multiple transmissions related to UE on different QCIs, we cannot directly use corresponding number of transport blocks to estimate transmissions times because we do not know how eNB scheduler placed those transmissions onto resource map. So without additional information we can only estimate corresponding lower and upper bounds.
-
Kirill Smelkov authored
We will soon need to run 2 threads: - one with the main logger, and - another one to serve requests for synthetic x.drb_stats queries Both main and the second thread will be run via sync.WorkGroup to cancel each other in case of failure somewhere. So since WorkGroup.wait(), similarly to all pygolang operations, is not interrupted by signals(*), we need to wire ctx to be passed through all operations and manage to cancel that context on SIGINT/SIGTERM. This patch: 1. adjusts xlog to wire ctx through all call chains and moves ._xlog1() to be run in the thread. 2. adjusts amari.Conn to take ctx as argument on all operations and react reasonably on that ctx cancel. We need to do it here because xlog uses Conn internally. 3. adjusts xamari main driver to setup root context that is canceled on SIGINT/SIGTERM similarly e.g. to how nxdtest does it in nexedi/nxdtest@b0cf277d . (*) see nexedi/pygolang@e18adbab for details.
-
Kirill Smelkov authored
We will soon add more levels of trying to this part of the code and linear defers are easier to follow compared to many levels of try/except nesting.
-
Kirill Smelkov authored
We will soon need this to know at runtime the address of eNB service attached by Conn to establish another connection attached to the same eNB.
-
Kirill Smelkov authored
Previously for Measurement fields with .QCI or .CAUSE suffix we had only the .sum value and no per-QCI nor per-CAUSE values. In other words support for QCI and CAUSE was stub. In this patch we add support for QCI: every field X.QCI is now automatically expanded into X[256] array and X.sum . For convenience we also provide X.<qci> aliases that alias X[qci]. For example field DRB.IPVolDl.9 aliases 9'th element of DRB.IPVolDl array. We will need QCI support for E-UTRAN IP Throughput KPI which is required to provide resulting values for every QCI individually. CAUSE support remains stub for now.
-
Kirill Smelkov authored
There was a thinko that led to returning 0 instead of NA when there are not .QCI or .CAUSE fields except .sum . Without added fix, e.g. Σqci(Measurement(), 'ERAB.EstabInitAttNbr.QCI') was returning 0 instead of NA. -> Fix it.
-
Kirill Smelkov authored
DRB.IPThpVol -> DRB.IPVol (no "Thp" inside) DRB.IPThpTime -> DRB.IPTime (no "Thp" inside TS 32.450 and TS 32.425 defines those names as in corrected variants - please see corresponding references in the code for details. It was my thinko in dc1d5481 (kpi: Start of the package) to use "Thp" in the names.
-
Kirill Smelkov authored
We were not caring about that and so previously e.g. NA(np.int16) was giving int instead of np.int16 . Fix it.
-
- 05 Jan, 2023 2 commits
-
-
Kirill Smelkov authored
Before this patch it was complaining: (xlte3.venv) kirr@deca:~/src/wendelin/xlte$ check-manifest lists of files in version control and sdist do not match! missing from sdist: demo/kpidemo.ipynb demo/kpidemo.py xlte.py suggested MANIFEST.in rules: include *.py recursive-include demo *.ipynb recursive-include demo *.py Noticed during !2 review.
-
Xavier Thompson authored
Make `python setup.py sdist` include CHANGELOG.rst in the tar.gz. Without it, installing from the released tar.gz will fail because setup.py expects to read CHANGELOG.rst to generate the long description. /reviewed-by @kirr /reviewed-on !2
-
- 03 Jan, 2023 2 commits
-
-
Xavier Thompson authored
Fix off-by-one error in `amari.xlog.LogSpec.parse` which truncated the last character of options, e.g. `stats[rf]/60s` would parse option `r` instead of `rf`. /reviewed-by @kirr /reviewed-on !1
-
Xavier Thompson authored
Fix `amari.Conn._send_msg` to include the arguments in the message sent. Without it, logspec options such as `stats[samples,rf]/60s` are ignored. /reviewed-by @kirr /reviewed-on !1
-
- 12 Dec, 2022 12 commits
-
-
Kirill Smelkov authored
Add JupyterLab notebook that shows how to build KPI-computing pipeline and to compute the KPIs. The notebook comes with extensive comments describing every step. Please see those comments for details.
-
Kirill Smelkov authored
Add demo program that shows how to build KPI-computing pipeline and to compute the KPIs. It can be used e.g. as follows: $ ./demo/kpidemo.py 60 https://lab.nexedi.com/kirr/misc/raw/162307b9/lte/20221211-overload.xlog The program comes with extensive comments describing every step. Please see those comments for details. The next patch will also add analogous JupyterLab notebook.
-
Kirill Smelkov authored
kpi.Calc is calculator to compute KPIs. It can be instantiated on MeasurementLog and time interval over which to perform computations. It currently implements calculations for only one "E-RAB Accessibility KPI". Please see added docstrings and tests for details. The next patch will also add demo program that uses all kpi.Calc and other parts of KPI-computation pipeline to build and visualize E-RAB Accessibility from real data.
-
Kirill Smelkov authored
amari.kpi: New package with driver for Amarisoft LTE stack to retrieve KPI-related measurements from logs amari.kpi provides LogMeasure that takes enb.xlog (TODO and enb.log) as input, and produces kpi.Measurements on output. enb.xlog ───────── ─────────> │ Log │ │ │ ────> []kpi.Measurement ─────────> │ Measure │ enb.log ───────── We read log data organizing periods around stats queries, and for now we build Measurement from stats' counters. To do so we take δ(stats_prev, stat) and process it mapping Amarisoft counters to 3GPP ones specified by kpi.Measurement. We emit measurement X after reading stats X+2 - i.e. we emit measurement for a period after reading data covering _next_ period. It is organized this way to account for init/fini correction: fini adjust ------------- ' ' Sx v Sx+1 ' Sx+2 ────|───────────|───────────|──── Measurement Measurement X X+1 This approach has following limitations: - for most of the counters there is no direct mapping in between Amarisoft and 3GPP. For example we currently use s1_erab_setup_request for ERAB.EstabAddAtt.sum, but this mapping is not strictly correct and will break if corresponding S1 E-RAB SETUP REQUEST message contains multiple ERABs. The code has corresponding FIXME marks where such approximations are used. - it is not possible to implement init/fini correction precisely. From aggregated statistics we only get total amount for a fini value for a period - without knowing which part of it corresponds to init events from previous period, and which part to init events from current one. With that it is only possible to make a reasonable guess and try to preserve statistical properties, but not more. See m_initfini in the code for details. - it is possible to handle eNB with single cell only. This limitation comes from the fact that in Amarisoft LTE stack S1-related counters come as "globals" ones, while e.g. RRC-related counters are "per-cell". It is thus not possible to see how much S1 connection establishments are associated with one particular cell if there are several of them. TODO also parse enb.log to fix those issues.
-
Kirill Smelkov authored
xlog.Reader could be used to parse and read back data previously saved by xlog. In the next patch we will use it in Amarisoft driver for KPI measurements.
-
Kirill Smelkov authored
Start the package to process measurements and compute KPIs from them. In this patch we add kpi.Measurement - a central part to represent measurement results in intermediate generic form. kpi.Measurement will be used by both KPI calculator, and by drivers for particular LTE stacks to provide their KPI-related data in this uniform common format. kpi.Measurement also establishes semantic for such measurement results to be followed by drivers. The semantic is stated in kpi.Measurement docstring and in comment for every field. Also in particular, according to TS 32.401 and common sense, measurement data are required to be correctly accounted for initiation/termination events to avoid discrepancies. Quoting kpi.Measurement documentation: Important note (init/fini correction): Termination events should be counted in the same granularity period, where corresponding initiation event occurred, even if termination event happens _after_ granularity period covering the initiation event. For example in the following illustration "ConnEstab Success" event should be counted in the same granularity period 1 as "ConnEstab Initiate" event: ----------------------- ' ' | p e r ' i o d 1 | p e r ' i o d 2 | | ' | v | ────'───────x───────────────'───────x────────────'────────────> ConnEstab ConnEstab time Initiate Success This preserves invariant that N(initiations) is always ≥ N(results) and goes in line with what TS 32.401 4.3.2 "Perceived accuracy -> Same period for the same two events" requires. kpi.Measurement comes accompanied by kpi.MeasurementLog which in essence is array of kpi.Measurements. We will use kpi.Measurement and kpi.MeasurementLog in later patches to both provide Amarisoft-specific data in this common format, and to compute KPIs from it.
-
Kirill Smelkov authored
-
Kirill Smelkov authored
Upon rotation we want to emit trailing part to the old file, and emit new header into new log file. All this is custom and cannot be handled reliably when rotation is done by external tool.
-
Kirill Smelkov authored
Document kind of messages and events that could be emitted by xlog. The messages come from Amarisoft software directly, but events are xlog-specific and without proper documentation it is easy to miss what they are and which information and semantic they carry.
-
Kirill Smelkov authored
The time emitted in messages by Amarisoft is in seconds. It also makes sense to emit meta/event times in seconds as well for uniformity. This is backward-incompatible change, but it should be ok at this early time.
-
Kirill Smelkov authored
xlog logging is kind of slow - usually it comes once per several seconds or once per minute. And without flushing many entries can remain sitting up in the file buffer in userspace without being conveyed to OS kernel. Which is not very convenient because in such situation we cannot make good use of tools like `tail -f`. Since flushing is relatively cheap operation - it is just one write syscall - let's do it after every emitted line. The write syscall does not force data to be synced to disk, so it should not slow things down, but make it convenient to have latest logs right away in the filesystem view.
-
Kirill Smelkov authored
Conn multiplexes many requests/responses over single WebSocket connection. To do so it organizes dedicated receive thread that continuously receives messages from underlying websocket connection and dispatches received replies back to threads that issued corresponding requests. An rx timeout in that receive thread is thus not something unexpected - it can happen e.g. if there is simply no requests sent. But I missed that in 61ad9032 (amari: Add functionality to interoperate with an Amarisoft LTE service via WebSocket) and implicitly did not ignored such global rx timeout. As the result `amari xlog` does not work properly if period of requests is greater than timeout value, for example: $ xamari xlog ws://localhost:9001 ue_get/30s {"meta": {"event": "start", "time": 1670588996.0623107, "generator": "xlog ws://localhost:9001 ue_get[]/30.0s"}} {"meta": {"event": "service attach", "time": 1670588996.1852894, "srv_name": "ENB", "srv_type": "ENB", "srv_version": "2022-12-01"}} {"message":"config_get", ...} note {"message":"ue_get","ue_list":[],"message_id":2,"time":3045.323,"utc":1670588996.423} ----> {"meta": {"event": "service detach", "time": 1670589026.3569217, "srv_name": "ENB", "srv_type": "ENB", "srv_version": "2022-12-01", "reason": "timed out"}} {"meta": {"event": "service attach", "time": 1670589029.485363, "srv_name": "ENB", "srv_type": "ENB", "srv_version": "2022-12-01"}} {"message":"config_get", ...} {"message":"ue_get","ue_list":[],"message_id":2,"time":3078.606,"utc":1670589029.706} ... -> Fix it by ignoring global rx timeout. NOTE: we must also add manual handling of per-request timeout when waiting for corresponding reply. If we don't do that a situation where particular reply does not come back, but replies for other requests are coming back ok, will never be detected. Here is how fixed version works now: $ xamari xlog ws://localhost:9001 ue_get/30s {"meta": {"event": "start", "time": 1670589223.0339117, "generator": "xlog ws://localhost:9001 ue_get[]/30.0s"}} {"meta": {"event": "service attach", "time": 1670589223.1970558, "srv_name": "ENB", "srv_type": "ENB", "srv_version": "2022-12-01"}} {"message":"config_get", ...} {"message":"ue_get","ue_list":[],"message_id":2,"time":3272.292,"utc":1670589223.391} {"message":"ue_get","ue_list":[],"message_id":3,"time":3302.274,"utc":1670589253.373} {"message":"ue_get","ue_list":[],"message_id":4,"time":3332.266,"utc":1670589283.365} ... Note that ue_get messages are coming sequentially and there is no "service detach" event, that was artificially popping up due to wrong timeout handling.
-