We encounter issues of data redundancy and message length constraints when transmitting maritime distress safety information using the BeiDou SMC system. A single piece of information often requires multiple transmissions of BeiDou short messages to be completed. To enhance the transmission capability of information, we introduce an ME algorithm in this paper. In this algorithm, we employ a pattern-matching-based method for information extraction, conducting dictionary searches for each message field to reduce space redundancy. Additionally, we propose the BPPMd algorithm tailored for maritime distress safety information. We introduce a byte encoding module in the input phase and utilize an LSTM network to predict the current input and historical information. We use the predicted values for encoding and updating the network. This algorithm further improves compressibility while minimizing text data redundancy, with relatively minor sacrifices in compression time.
3.1. ME Algorithm of Maritime Safety Information
After analyzing the characteristics of the Maritime Safety Information dataset, we found that the format of maritime distress safety information exhibits significant regularity. In natural language processing, for information with fixed rules, a simple and efficient pattern-matching method based on rules [
47] can be adopted for information extraction.
The rule-based method refers to a technique where a large amount of text is examined, and patterns of rules existing in the text are analyzed. These patterns are systematically parsed and matched for information extraction [
48]. Scholars have proposed various approaches based on rules, including a rule-based knowledge element attribute extraction method and a web information extraction method based on regular expressions [
49]. Although the rule-based method has lower automation and universality, it offers higher accuracy and good flexibility for natural language text. It is straightforward to operate during the extraction process but heavily relies on formulated rules (string patterns) and is suitable for structured text. In this method, regular expressions are used to describe sets of regular expressions, composed of ordinary characters and special characters like wildcards. They are commonly employed for text content searches and can match text based on certain algorithms, facilitating the extraction of substrings from strings. Regular expressions form the foundation of rule-based pattern-matching methods for extracting text information.
For ordinary direct regex matching, if the matching rules are too loose and the boundaries are weak, there may be numerous matching results, many of which are irrelevant to the task objectives. Therefore, we use a regular matching method with a keyword triggering mechanism. The process is as follows:
- (1)
For a string , given a regular expression for a keyword, obtain the match result set .
- (2)
Obtain the set of starting positions of in the match result set .
- (3)
Set the search range to characters based on the task’s target string length. Let the -th element in set be . If the number of elements in is , then split the string into substrings, where each substring ranges from . Denote the set of substrings as .
- (4)
Given a target regular expression , use to match each element in to obtain the final result set.
In summary, keyword-based range regular expression matching first locks onto a small part of the complete string based on keyword position information. Then, it performs regular expression matching within this specified range, yielding more precise results and avoiding excessive matches. The specific implementation of the ME algorithm is illustrated in
Figure 3.
Zhejiang Navigation Police is one of the issuing authorities defined by us, and “East China Sea” is one of the affected areas defined by us. Taking the issuing authority and affected area as examples, Zhejiang Navigation Police is encoded as $P05 according to the dictionary, and “East China Sea” is encoded as $L04. This is performed to reduce the byte space of the information and improve transmission efficiency.
3.2. BPPMd Lossless Compression Algorithm
After encoding maritime safety information with a specialized algorithm, we can further process the data using efficient lossless compression algorithms to reduce the space occupancy of textual data. The traditional LZ77 algorithm, as shown in
Figure 4a, divides the entire sliding window into two regions: the dictionary area on the left and the area to be encoded on the right. The encoder searches in the dictionary area until it finds a matching string. However, the LZ77 algorithm performs poorly in compressing maritime distress safety information. The PPMd algorithm, which has shown excellent results in text compression, is depicted in
Figure 4b. It constructs a complex context indexing tree, recording the order and frequency information between the compressed data streams. It predicts the probability of each character in the data stream to be compressed based on the stored information and finally encodes the predicted probability values using an interval encoder. The PPMd algorithm has already achieved good results on the maritime distress safety information dataset, but there are still aspects that can be improved in multimodel fusion. Further research on this could lead to even better results.
LSTM is a form of recurrent neural network designed to capture long-term dependencies. This makes it adept at retaining contextual information and understanding relationships between words when processing text data, rather than merely handling each word individually. By doing so, we can achieve a more nuanced representation of the text data, which is derived from learning the data itself. This representation is typically more compact than the original text data because it incorporates semantic understanding and feature extraction, removing some redundant information while preserving essential semantic and structural features. Once we have obtained this more compact and expressive representation, we can apply traditional compression algorithms to it. Since this representation has already been processed by the LSTM model, traditional compression algorithms can usually handle it more effectively, thereby achieving better compression of the text data.
Based on LSTM’s text processing capabilities, we incorporate it as a byte encoding module into the PPMd algorithm. As shown in
Figure 4c, our method is called BPPMd. It aims to further enhance compression performance while minimizing the sacrifice of compression speed. The byte encoding module, illustrated in
Figure 4d, comprises a preprocessor, statistical analyzer, predictor, and encoder.
The preprocessor is used to perform preprocessing tasks such as tokenization, data cleaning, and normalization on the input data to prepare it for statistical analysis. Tokenization, as mentioned by us, involves reading input data byte by byte and further segmenting it into bits. Data cleaning entails removing unnecessary characters based on data characteristics or handling padding bytes at the end of files. Normalization involves standardizing the data to facilitate more effective processing in subsequent steps. The statistical analyzer then performs statistical analysis on the preprocessed data to determine the frequency characteristics and distribution of words in the data. Based on the results of this statistical analysis, a vocabulary is constructed. This vocabulary provides the predictor with all possible input symbols or data values needed for prediction, helping the predictor better understand the features and patterns of the input data, thereby improving the accuracy of predictions and the efficiency of encoding.
The predictor includes a prediction method and an update method. The predictor first updates an accumulated bit pattern based on the input bits. If the number of bits exceeds a set threshold, it invokes the prediction method. This method involves iterating through the neuron layers and performing forward propagation on each layer. The output of the current layer is copied to the input buffer of the next layer. Next, it computes the output value of each neuron in the output layer. The raw output values of the output layer are then exponentiated and normalized to represent a probability distribution.
The update method is then called, which involves performing backpropagation training for each neuron layer at each time step. It calculates and updates the output layer error for the current time step and updates the network weights based on the errors of the output layer and the hidden layers. Following this, the probability values of each possible byte are updated based on the vocabulary and the predictor’s output. For each byte in the vocabulary, if it is part of a word, the corresponding probability value is extracted from the predictor’s output.
The flowchart for the BPPMd algorithm is illustrated in
Figure 5.
In the byte encoding stage, this byte encoding module begins by preparing input data through preprocessing steps such as text segmentation and data cleaning. It constructs a vocabulary based on the characteristics of the data, which initializes a predictor trained on symbol frequencies from the input data to generate a compression model. During the compression phase, the input data are read byte by byte and converted into a bitstream. The encoder uses the predictor’s output and encoding algorithms to encode each bit, resulting in a more compact and expressive representation of the text data after semantic understanding and feature extraction. This facilitates efficient data compression when combined with context models and range coders. The encoded data are written to a file via an output stream, while encoding rates and other statistical information are recorded. The entire process integrates LSTM networks for prediction and encoding, ensuring effective compression and decompression of data. Post byte encoding module, the input data are more compact compared to the original text data, removing redundant information and effectively reducing the byte space occupancy of the text data.
During the processing in the data compression stage, when the data enter the compression module, they first go through the query module (I) for character querying. If the character to be compressed is matched at the current level, the match is successful. The character is then encoded based on the frequency information of the matched character, and the context tree is updated before continuing to compress the next byte. If the match is unsuccessful, an escape occurs at the current level. The escape character is encoded based on its frequency, and the process falls back to the previous level using query module (II) to continue matching operations. When the current order is −1, indicating an escape at the top level, the compression process ends. The difference between query module I and query module II lies in the way escape characters are predicted. The former uses a fixed escape character frequency (frequency of 1), while the latter uses secondary escape estimation (SEE) to predict the probability of escape characters.
The escape module accurately predicts the probability of escape characters by establishing an escape context model. The context model established by the escape module is simpler than the context model and prediction module established by the PPM algorithm. Additionally, the prediction process of the escape module is straightforward, directly producing prediction results by only querying one layer of context information. The escape module utilizes various contextual information, including the total frequency information of the current context, the number of characters, and the order of the context level, as well as the parent context information and child context information, to more accurately predict the probability of escape characters.
The prediction update module passes the predicted probability information to the encoder module. The encoder utilizes range coding algorithm to encode the probabilities into corresponding bitstream information, which is then outputted through the output interface. During the prediction process of the PPMd algorithm, two types of prediction information are generated: probability information for binary contexts (having only one successor character) and probability information for multisymbol contexts (having more than one successor character).
When selecting the binary model, the algorithm requires only one multiplication operation, while selecting the multisymbol encoder requires one division operation and two multiplication operations. These calculations narrow down the range of probability mapping. When the range of the interval becomes sufficiently small, the algorithm needs to perform interval expansion operations, enlarging the range of the interval by 256 times, while outputting the upper 8 bits of the original boundary data through the output buffer module. The flowchart of the range coding algorithm is shown in
Figure 6.