Unbalanced class: class_weight for ML algorithms in Spark MLLib
$begingroup$
In python sklearn, there are multiple algorithms (e.g. regression, random forest ... etc.) that have the class_weight parameter to handle unbalanced data.
However, I do not find such parameter for the MLLib algorithms. Is there a plan of implementing class_weight for some MLLib algorithm? Or is there any approach in MLLib for unbalanced data? Or we actually have to handle all the up/downsampling ourselves in MLLib?
Thanks!
machine-learning apache-spark unbalanced-classes weighted-data
$endgroup$
bumped to the homepage by Community♦ 9 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
In python sklearn, there are multiple algorithms (e.g. regression, random forest ... etc.) that have the class_weight parameter to handle unbalanced data.
However, I do not find such parameter for the MLLib algorithms. Is there a plan of implementing class_weight for some MLLib algorithm? Or is there any approach in MLLib for unbalanced data? Or we actually have to handle all the up/downsampling ourselves in MLLib?
Thanks!
machine-learning apache-spark unbalanced-classes weighted-data
$endgroup$
bumped to the homepage by Community♦ 9 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.
$endgroup$
– Samuel Sherman
Jan 6 '17 at 17:43
$begingroup$
Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.
$endgroup$
– Emre
Oct 3 '17 at 22:14
add a comment |
$begingroup$
In python sklearn, there are multiple algorithms (e.g. regression, random forest ... etc.) that have the class_weight parameter to handle unbalanced data.
However, I do not find such parameter for the MLLib algorithms. Is there a plan of implementing class_weight for some MLLib algorithm? Or is there any approach in MLLib for unbalanced data? Or we actually have to handle all the up/downsampling ourselves in MLLib?
Thanks!
machine-learning apache-spark unbalanced-classes weighted-data
$endgroup$
In python sklearn, there are multiple algorithms (e.g. regression, random forest ... etc.) that have the class_weight parameter to handle unbalanced data.
However, I do not find such parameter for the MLLib algorithms. Is there a plan of implementing class_weight for some MLLib algorithm? Or is there any approach in MLLib for unbalanced data? Or we actually have to handle all the up/downsampling ourselves in MLLib?
Thanks!
machine-learning apache-spark unbalanced-classes weighted-data
machine-learning apache-spark unbalanced-classes weighted-data
asked Dec 7 '16 at 0:08
EdamameEdamame
5632617
5632617
bumped to the homepage by Community♦ 9 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 9 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.
$endgroup$
– Samuel Sherman
Jan 6 '17 at 17:43
$begingroup$
Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.
$endgroup$
– Emre
Oct 3 '17 at 22:14
add a comment |
$begingroup$
Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.
$endgroup$
– Samuel Sherman
Jan 6 '17 at 17:43
$begingroup$
Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.
$endgroup$
– Emre
Oct 3 '17 at 22:14
$begingroup$
Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.
$endgroup$
– Samuel Sherman
Jan 6 '17 at 17:43
$begingroup$
Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.
$endgroup$
– Samuel Sherman
Jan 6 '17 at 17:43
$begingroup$
Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.
$endgroup$
– Emre
Oct 3 '17 at 22:14
$begingroup$
Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.
$endgroup$
– Emre
Oct 3 '17 at 22:14
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
Algorithms in MLLib are always used as baseline in production scenario , and they indeed can not handle some industrial problems , such as label imbalance . So if you want to use them , you have to balance your instances .
Besides , mechanism of BSP in Spark , you can simply see as data parallel , might be the main reason why Spark does not cover that problem . It might be hard for Spark to dispatch instances to all nodes in cluster , while the partial instances of each node share the same label distribution as the whole .
At last , you only have to weight the loss value for every minor labeled instance during your iteration process if you want to implement it .
Hopes this will help you , good luck -)
$endgroup$
add a comment |
$begingroup$
One of the ways I've handled imbalanced classes in the past has been to build a classifier based on a dataset samples to have a 50/50 sample split. This means using all of the data points associated with your minority class, and randomly sampling the same number of data points from your majority class.
Whether this will work depends on how much data you actually have in your minority class -- if you have extreme class imbalance (<5% minority class instances), then you may want to consider synthetic oversampling.
You could probably look at pydf.rdd.takeSample() in spark, or df.sample in pandas.
$endgroup$
add a comment |
$begingroup$
The way I handled class imbalances is by following methods:
1. Merging the class that appear least frequently to other classes. Obviously you should use some kind of domain knowledge instead of merging them randomly
2. Use resampling techniques like oversampling, undersampling, SMOTE, ADASYN. I don't recommend using these techniques because they don't actually represent the actual data. But in any case you can certainly take a look at them
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f15573%2funbalanced-class-class-weight-for-ml-algorithms-in-spark-mllib%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Algorithms in MLLib are always used as baseline in production scenario , and they indeed can not handle some industrial problems , such as label imbalance . So if you want to use them , you have to balance your instances .
Besides , mechanism of BSP in Spark , you can simply see as data parallel , might be the main reason why Spark does not cover that problem . It might be hard for Spark to dispatch instances to all nodes in cluster , while the partial instances of each node share the same label distribution as the whole .
At last , you only have to weight the loss value for every minor labeled instance during your iteration process if you want to implement it .
Hopes this will help you , good luck -)
$endgroup$
add a comment |
$begingroup$
Algorithms in MLLib are always used as baseline in production scenario , and they indeed can not handle some industrial problems , such as label imbalance . So if you want to use them , you have to balance your instances .
Besides , mechanism of BSP in Spark , you can simply see as data parallel , might be the main reason why Spark does not cover that problem . It might be hard for Spark to dispatch instances to all nodes in cluster , while the partial instances of each node share the same label distribution as the whole .
At last , you only have to weight the loss value for every minor labeled instance during your iteration process if you want to implement it .
Hopes this will help you , good luck -)
$endgroup$
add a comment |
$begingroup$
Algorithms in MLLib are always used as baseline in production scenario , and they indeed can not handle some industrial problems , such as label imbalance . So if you want to use them , you have to balance your instances .
Besides , mechanism of BSP in Spark , you can simply see as data parallel , might be the main reason why Spark does not cover that problem . It might be hard for Spark to dispatch instances to all nodes in cluster , while the partial instances of each node share the same label distribution as the whole .
At last , you only have to weight the loss value for every minor labeled instance during your iteration process if you want to implement it .
Hopes this will help you , good luck -)
$endgroup$
Algorithms in MLLib are always used as baseline in production scenario , and they indeed can not handle some industrial problems , such as label imbalance . So if you want to use them , you have to balance your instances .
Besides , mechanism of BSP in Spark , you can simply see as data parallel , might be the main reason why Spark does not cover that problem . It might be hard for Spark to dispatch instances to all nodes in cluster , while the partial instances of each node share the same label distribution as the whole .
At last , you only have to weight the loss value for every minor labeled instance during your iteration process if you want to implement it .
Hopes this will help you , good luck -)
answered Dec 7 '16 at 3:39
joejoe
327111
327111
add a comment |
add a comment |
$begingroup$
One of the ways I've handled imbalanced classes in the past has been to build a classifier based on a dataset samples to have a 50/50 sample split. This means using all of the data points associated with your minority class, and randomly sampling the same number of data points from your majority class.
Whether this will work depends on how much data you actually have in your minority class -- if you have extreme class imbalance (<5% minority class instances), then you may want to consider synthetic oversampling.
You could probably look at pydf.rdd.takeSample() in spark, or df.sample in pandas.
$endgroup$
add a comment |
$begingroup$
One of the ways I've handled imbalanced classes in the past has been to build a classifier based on a dataset samples to have a 50/50 sample split. This means using all of the data points associated with your minority class, and randomly sampling the same number of data points from your majority class.
Whether this will work depends on how much data you actually have in your minority class -- if you have extreme class imbalance (<5% minority class instances), then you may want to consider synthetic oversampling.
You could probably look at pydf.rdd.takeSample() in spark, or df.sample in pandas.
$endgroup$
add a comment |
$begingroup$
One of the ways I've handled imbalanced classes in the past has been to build a classifier based on a dataset samples to have a 50/50 sample split. This means using all of the data points associated with your minority class, and randomly sampling the same number of data points from your majority class.
Whether this will work depends on how much data you actually have in your minority class -- if you have extreme class imbalance (<5% minority class instances), then you may want to consider synthetic oversampling.
You could probably look at pydf.rdd.takeSample() in spark, or df.sample in pandas.
$endgroup$
One of the ways I've handled imbalanced classes in the past has been to build a classifier based on a dataset samples to have a 50/50 sample split. This means using all of the data points associated with your minority class, and randomly sampling the same number of data points from your majority class.
Whether this will work depends on how much data you actually have in your minority class -- if you have extreme class imbalance (<5% minority class instances), then you may want to consider synthetic oversampling.
You could probably look at pydf.rdd.takeSample() in spark, or df.sample in pandas.
edited Aug 21 '18 at 19:36
Stephen Rauch♦
1,52551330
1,52551330
answered Aug 21 '18 at 15:58
ngopalngopal
413
413
add a comment |
add a comment |
$begingroup$
The way I handled class imbalances is by following methods:
1. Merging the class that appear least frequently to other classes. Obviously you should use some kind of domain knowledge instead of merging them randomly
2. Use resampling techniques like oversampling, undersampling, SMOTE, ADASYN. I don't recommend using these techniques because they don't actually represent the actual data. But in any case you can certainly take a look at them
$endgroup$
add a comment |
$begingroup$
The way I handled class imbalances is by following methods:
1. Merging the class that appear least frequently to other classes. Obviously you should use some kind of domain knowledge instead of merging them randomly
2. Use resampling techniques like oversampling, undersampling, SMOTE, ADASYN. I don't recommend using these techniques because they don't actually represent the actual data. But in any case you can certainly take a look at them
$endgroup$
add a comment |
$begingroup$
The way I handled class imbalances is by following methods:
1. Merging the class that appear least frequently to other classes. Obviously you should use some kind of domain knowledge instead of merging them randomly
2. Use resampling techniques like oversampling, undersampling, SMOTE, ADASYN. I don't recommend using these techniques because they don't actually represent the actual data. But in any case you can certainly take a look at them
$endgroup$
The way I handled class imbalances is by following methods:
1. Merging the class that appear least frequently to other classes. Obviously you should use some kind of domain knowledge instead of merging them randomly
2. Use resampling techniques like oversampling, undersampling, SMOTE, ADASYN. I don't recommend using these techniques because they don't actually represent the actual data. But in any case you can certainly take a look at them
answered Sep 21 '18 at 1:37
Siddhi Kiran BajracharyaSiddhi Kiran Bajracharya
3447
3447
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f15573%2funbalanced-class-class-weight-for-ml-algorithms-in-spark-mllib%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Yes, the algorithms in Spark's MLLib are prepared to handle complex problems. Additionally, from my understanding there not a way to perform a stratified split either. Thus, any performance metrics you acquire will not be appropriately represented.
$endgroup$
– Samuel Sherman
Jan 6 '17 at 17:43
$begingroup$
Here is an exampled of weighted logistic regression in MLlib from the 2.2 documentation.
$endgroup$
– Emre
Oct 3 '17 at 22:14