What is the state-of-the-art in prediction\classification missing labels in partially labeled data?

Question

Overview

Let's say I have the following data:

#                       Label\Target    VM
#2000-01-01 00:00:00        App3        VM9
#2000-01-01 01:00:00        App3        VM3
#2000-01-01 02:00:00        App1        VM1
#2000-01-01 03:00:00        App1        VM8
#2000-01-01 04:00:00    ->  None        VM1
#...                        ...         ...
#2000-12-31 19:00:00    ->  None        VM5
#2000-12-31 20:00:00        App3        VM1
#2000-12-31 21:00:00        App3        VM7
#2000-12-31 22:00:00        App1        VM3
#2000-12-31 23:00:00        App2        VM8

I have 3 classes (App1, App2, App3) + None.

None represents missing values in Label\Target column.

Problem definition

Problem: is to classify\predict missing values represented by None in Label\Target column given a time series.

None ==?==> (App1, App2, App3)?

I'm unsure which approach to take with partially labeled data.

Approach1: consider 3 classes (App1, App2, App3)
Approach2: consider 4 classes (App1, App2, App3) + None

Which method should I use?

The thing that crossed my mind is Self-Supervised Learning (SSL)

What about Self-Supervised Learning?

I did some research to see the best practice to approach this problem and what is State-of-the-art. I found this article:

How about Semi-Supervised Learning?

"Semi-supervised learning is particularly useful when there is a large amount of unlabeled data available, but it’s too expensive or difficult to label all of it." Ref

But still, I'm confused and even unsure if this problem is classify\predict or forecast missing label.

Reproducible data generation

Following is sampled data if you had a Pythonic solution:

import numpy as np
import pandas as pd
import random
np.random.seed(2023)
Generate TS
ts = pd.date_range('2000-01-01', '2000-12-31 23:00', freq='H')
number of samples
N = len(ts)
Create a random dataset
data = {
    #"TS": ts,
    'Appx':  [random.choice(['App1', 'App2', 'App3', None])                for _ in range(N)],
    'VM':    [random.choice(['VM1' , 'VM2' , 'VM3', 'VM4', 'VM5', 'VM6'])  for _ in range(N)]
}
df = pd.DataFrame(data, index=ts)
df

Question

I need best practice to frame the problem correctly for this scenario to predict missing labels in partially labeled data over time and possible approaches (Approach1 or 2) based on state-of-the-art and recent methods (Semi\Self-supervised learning). I have generated sample data if someone wants to offer the minimal solution(s).

Any help will be highly appreciated.

score 1 · Answer 1 · answered Oct 10 '23 at 22:18

1

I would frame this problem as a denoising task, that is, having a model learn to reconstruct a corrupted input. In this case, the noise to remove would be the missing label/target's. For this, you would need enough fully labeled data to train the model, which would be artificially corrupted by replacing the label/target with some special [MASK] element. The model's expected output would be the correct label/target's at the masked positions.

Given that your data seems to be sequences of discrete elements, I would use a Transformer encoder, which can use the full sequence to make its predictions. Therefore, I would use something like BERT.

The idea would be as follows (this assumes knowledge on BERT; please have a look at this maked language modeling tutorial and the questions about BERT in this site for more):

the model receives a sequence of pairs of symbols: [(label/target1, VM1), (label/target2, VM2), ...]. A percentage (e.g. 15%) of the input positions have their label/target replaced with a special [MASK] symbol.
to feed the pair of symbols (label/target, VM) to the neural network, I would use separate embeddings for each component of the pair. At each position, we would embed each part of the pair and add together the resulting vectors, together with the Transformer's positional encodings.
the model generates the (label/target) at each position. Only the positions where the input was masked are taken into account for the loss function.
For the training data, you should have the sequences where you have all the label/target information.

answered Oct 10 '23 at 22:18

noe

26,410
1
46
76

Thanks for your input. Using your BERT-based approach means we have 3 classes (App1, App2, App3), or we need to consider 4 classes (App1, App2, App3) + None? – Mario Oct 10 '23 at 23:18
You would need the normal label/target plus an option to mask. In my answer I referred to such an option as [MASK] to be consistent with BERT, but you can call it None. – noe Oct 10 '23 at 23:30
Since we have a timestamp in data, is there any Time-series-based approach for our problem? Or do you think as long as we use sequenced-friendly Transfomer (i.e., BERT) according to [MASK] is good enough for this task, and we don't need to solve the problem we framed using time-series analytics? So, can we say that the state-of-the-art for this problem is a language model like BERT? – Mario Oct 10 '23 at 23:59
From your data, I understood that the intervals where regular (1h), so I assumed the ordering info would be enough. If this assumption is not acceptable, then I would add more info to each position depending on what is considered relevant by domain experts; for instance, if the hour is the relevant part, I would add a new embedding to represent it (with 24 values, one for each hour of the day). – noe Oct 11 '23 at 07:07
Of course, I am sure biased toward my own area of expertise, which is NLP, LMs, etc, and you know, to a hammer, everything looks like a nail . There may be other timeseries-based approaches that are as valid or more valid than my proposal. – noe Oct 11 '23 at 07:10
May I know if your proposed Approach using LM models like BERT using MASK has any alignment with Skip-Gram that counts as Self-supervised or not? In the comment under this answer, they addressed: "The overall process of BERT, ELMo, etc. can be considered as 'Semi-supervised learning' because they including both LM(self-supervised) and supervised learning. The training process of LM is self-supervised learning." Do you confirm it? – Mario Oct 11 '23 at 19:39
1

Yes, the purpose of both the skipgram model and BERT is to learn word/token representations. – noe Oct 11 '23 at 20:49

Ggjj11 · Answer 2 · 2023-10-12T19:28:13.307

0

For tabular data simple boosted ml models are still a good choice, when compared to state of the art Fusion Transformers (2023), see https://arxiv.org/abs/2106.11959v3

Therefore, lalbel propagation like in https://scikit-learn.org/stable/modules/semi_supervised.html still is a method worth trying.

If you deal with big clean data of a certain modality, then you might go with pretraining (like suggested by Noe with the masked language loss etc...) with unlabelled data and then adding a task head for whatever task you want so solve. Training the task head will still require you to have enough data.

edited Oct 12 '23 at 19:28

answered Oct 12 '23 at 17:43

Ggjj11

216
2
6

Thanks for your input. Please feel free to extend your answer if you have minimal Pythonic implementation using df proposed at the end of my post. – Mario Oct 12 '23 at 18:48
Would you elaborate on how you frame the problem, i.e., denoising? What is the best approach to tackle this problem (e.g., SSL, LM models, etc.)? you proposed label propagation for small data and using pre-trained for big-data since we need a big enough training-set. please comment that using your approach that considers 3 classes (App1, App2, App3) or 4 classes (App1, App2, App3) + None? In my scenario, missing Label\Targets are labeled by the string None, which is 80% percentage of the data, and the rest labeled (App1, App2, App3) – Mario Oct 12 '23 at 20:16

What is the state-of-the-art in prediction\classification missing labels in partially labeled data?

Overview

Problem definition

Reproducible data generation

Generate TS

number of samples

Create a random dataset

Question

2 Answers2

Linked