What is the difference between “equivariant to translation” and “invariant to translation”
$begingroup$
I'm having trouble understanding the difference between equivariant to translation and invariant to translation.
In the book Deep Learning. MIT Press, 2016 (I. Goodfellow, A. Courville, and Y. Bengio), one can find on the convolutional networks:
- [...] the particular form of parameter sharing causes the layer to have a property called equivariance to translation
- [...] pooling helps to make the representation become approximately invariant to small translations of the input
Is there any difference between them or are the terms interchangeably used?
neural-network deep-learning convolution
$endgroup$
add a comment |
$begingroup$
I'm having trouble understanding the difference between equivariant to translation and invariant to translation.
In the book Deep Learning. MIT Press, 2016 (I. Goodfellow, A. Courville, and Y. Bengio), one can find on the convolutional networks:
- [...] the particular form of parameter sharing causes the layer to have a property called equivariance to translation
- [...] pooling helps to make the representation become approximately invariant to small translations of the input
Is there any difference between them or are the terms interchangeably used?
neural-network deep-learning convolution
$endgroup$
2
$begingroup$
In the old days of Statistics, as in the time of Pitman, invariant was used in the meaning of equivariant.
$endgroup$
– Xi'an
Oct 12 '18 at 18:12
add a comment |
$begingroup$
I'm having trouble understanding the difference between equivariant to translation and invariant to translation.
In the book Deep Learning. MIT Press, 2016 (I. Goodfellow, A. Courville, and Y. Bengio), one can find on the convolutional networks:
- [...] the particular form of parameter sharing causes the layer to have a property called equivariance to translation
- [...] pooling helps to make the representation become approximately invariant to small translations of the input
Is there any difference between them or are the terms interchangeably used?
neural-network deep-learning convolution
$endgroup$
I'm having trouble understanding the difference between equivariant to translation and invariant to translation.
In the book Deep Learning. MIT Press, 2016 (I. Goodfellow, A. Courville, and Y. Bengio), one can find on the convolutional networks:
- [...] the particular form of parameter sharing causes the layer to have a property called equivariance to translation
- [...] pooling helps to make the representation become approximately invariant to small translations of the input
Is there any difference between them or are the terms interchangeably used?
neural-network deep-learning convolution
neural-network deep-learning convolution
edited 10 mins ago
nbro
290417
290417
asked Jan 4 '17 at 8:41
Aamir Aamir
14315
14315
2
$begingroup$
In the old days of Statistics, as in the time of Pitman, invariant was used in the meaning of equivariant.
$endgroup$
– Xi'an
Oct 12 '18 at 18:12
add a comment |
2
$begingroup$
In the old days of Statistics, as in the time of Pitman, invariant was used in the meaning of equivariant.
$endgroup$
– Xi'an
Oct 12 '18 at 18:12
2
2
$begingroup$
In the old days of Statistics, as in the time of Pitman, invariant was used in the meaning of equivariant.
$endgroup$
– Xi'an
Oct 12 '18 at 18:12
$begingroup$
In the old days of Statistics, as in the time of Pitman, invariant was used in the meaning of equivariant.
$endgroup$
– Xi'an
Oct 12 '18 at 18:12
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
Equivariance and invariance are sometimes used interchangeably. As pointed out by @Xi'an, you can find uses in the statistical literature, for instance on the notions of Invariant estimator and especially Pitman estimator.
However, I would like to mention that it would be better if both terms keep separated, as the prefix "in-" in invariant is privative (meaning "no variance" at all), while "equi-" refers to "varying in a similar or equivalent proportion".
Let us start from simple image features, and suppose that image $I$ has a unique maximum $m$ at location $(x_m,y_m)$, which is here the main classification feature.
An interesting property of classifiers is their ability to classify in the same manner some distorted versions $I'$ of $I$, for instance translations by all vectors $(u,v)$. The maximum value $m'$ of $I'$ is invariant: $m'=m$: the value is the same. While its location will be at $(x'_m,y'_m)=(x_m-u,y_m-v)$, and is equivariant, meaning that is varies "equally" with the distortion.
The precise formulations given in mathematics for equivariance may depend on the objects and transformations one considers, so I prefer here the notion that is most often used in practice (and I may get the blame from a theoretical stand-point).
Here, translations (or some more generic action) can be equipped with the structure of a group $G$, $g$ being one specific translation operator. A function or feature $f$ is invariant under $G$ if for all images in a class, and for any $g$,
$$f(g(I)) = f(I),.$$
It becomes equivariant if there exists another structure (often a group) $G'$ that reflects
the
transformations in $G$ in a meaningful way. In other words, such that for each $g$, you have one a unique $g' in G'$ such that
$$f(g(I)) = g'(f(I)),.$$
In the above example on the group of translations, $g$ and $g'$ are the same (and hence $G'=G$): an integer translation of the image reflects as the exact same translation of the maximum location.
Another common definition is:
$$f(g(I)) = g(f(I)),.$$
I however used potentially different $G$ and $G'$ because sometimes $f(I)$ and $g(I)$ are not in the same domain. This happens for instance in multivariate statistics (see e.g. Equivariance and invariance properties of multivariate quantile and related functions, and the role of standardisation).
But here, the uniqueness of the mapping between $g$ and $g'$ allows to get back to the original transformation $g$.
Often people use the term invariance because the equivariance concept is unknown, or everybody else uses invariance, and equivariance would seem more pedantic.
For the record, other related notions (esp. in maths and physics) are termed covariance, contravariance, differential invariance.
In addition, translation-invariance, as least approximate, or in envelope, has been a quest for several signal and image processing tools. Notably, multi-rate (filter-banks) and multi-scale (wavelets or pyramids) transformations have been design in the past 25 years, for instance under the hood of shift-invariant, cycle-spinning, stationary, complex, dual-tree wavelet transforms (for a review on 2D wavelets, A panorama on multiscale geometric representations). The wavelets can absorb a few discrete scale variations. All theses (approximate) invariances often come with the price of redundancy in the number of transformed coefficients.
$endgroup$
4
$begingroup$
Great! I really admire your effort for the detailed reply @Laurent Duval
$endgroup$
– Aamir
Jan 5 '17 at 8:32
add a comment |
$begingroup$
The terms are different:
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0
Invariant to translation means that a translation of input features doe not change the outputs at all. So if your pattern 0,3,2,0,0 on the input results in 0,1,0 in the output, then the pattern 0,0,3,2,0 would also lead to 0,1,0
For feature maps in convolutional networks to be useful, they typically need both properties in some balance. The equivariance allows the network to generalise edge, texture, shape detection in different locations. The invariance allows precise location of the detected features to matter less. These are two complementary types of generalisation for many image processing tasks.
$endgroup$
$begingroup$
Translated feature yields translated output at some layer. Please elaborate about considerably translated whole object being detected. Seems it will be detected even if CNN was not trained with images containing different positions? Does equivariance hold in this case (looks more similar to invariance)?
$endgroup$
– VladimirLenin
Jul 14 '17 at 10:14
$begingroup$
@VladimirLenin: I don't think that elaboration is required for this question, it is definitely not something the OP has asked here. I suggest you ask a separate question, with a concrete example if possible. Even if visually a "whole object" has been translated, that does not mean feature maps in a CNN are tracking the same thing as you expect.
$endgroup$
– Neil Slater
Jul 14 '17 at 10:24
add a comment |
$begingroup$
Just adding my 2 cents
Regarding an image classification task solved with a typical CNN Architecture consisting of a Backend (Convolutions + NL + possibly Spatial Pooling) which performs Representation Learning and of a Frontend (e.g. Fully Connected Layers, MLP) which solves the specific task, in this case image classification, the idea is to build a function $ f : I rightarrow L $ able to map from the Spatial Domain $ I $ (Input Image) to the Semantic Domain $ L $ (Label Set) in a 2 step process which is
- Backend (Representation Learning) : $ f : I rightarrow mathcal{L} $ maps the Input to the Latent Semantic Space
- Frontend (Task Specific Solver) : $ f : mathcal{L} rightarrow L $ maps from the Latent Semantic Space to the Final Label Space
and it is performed using the following properties
- spatial equivariance, regarding ConvLayer (Spatial 2D Convolution+NonLin e.g. ReLU) as a shift in the Layer Input produces a shift in the Layer Output (Note: it is about the Layer, not the single Convolution Operator)
- spatial invariance, regarding the Pooling Operator (e.g. Max Pooling passes over the max value in its receptive field regardless of its spatial position)
The closer to the input layer, the closer to the purely spatial domain $ I $ and the more important the spatial equivariance property which allows to build spatially equivariant hierarchical (increasingly) semantic representation
The closer to the frontend, the closer to the latent purely semantic domain $ mathcal{L} $ and the more important the spatial invariance as the specific meaning of the image is desired to be independent from the spatial positions of the features
Using fully connected layers in the frontend makes the classifier sensitive to feature position at some extent, depending on the backend structure : the deeper it is and the more the translation invariant operator (Pooling) used
It has been shown in Quantifying Translation-Invariance in Convolutional Neural Networks that to improve the CNN Classifier Translation Invariance, instead of acting on the inductive bias (architecture hence depth, pooling, …) it's more effective to act on the dataset bias (data augmentation)
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16060%2fwhat-is-the-difference-between-equivariant-to-translation-and-invariant-to-tr%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Equivariance and invariance are sometimes used interchangeably. As pointed out by @Xi'an, you can find uses in the statistical literature, for instance on the notions of Invariant estimator and especially Pitman estimator.
However, I would like to mention that it would be better if both terms keep separated, as the prefix "in-" in invariant is privative (meaning "no variance" at all), while "equi-" refers to "varying in a similar or equivalent proportion".
Let us start from simple image features, and suppose that image $I$ has a unique maximum $m$ at location $(x_m,y_m)$, which is here the main classification feature.
An interesting property of classifiers is their ability to classify in the same manner some distorted versions $I'$ of $I$, for instance translations by all vectors $(u,v)$. The maximum value $m'$ of $I'$ is invariant: $m'=m$: the value is the same. While its location will be at $(x'_m,y'_m)=(x_m-u,y_m-v)$, and is equivariant, meaning that is varies "equally" with the distortion.
The precise formulations given in mathematics for equivariance may depend on the objects and transformations one considers, so I prefer here the notion that is most often used in practice (and I may get the blame from a theoretical stand-point).
Here, translations (or some more generic action) can be equipped with the structure of a group $G$, $g$ being one specific translation operator. A function or feature $f$ is invariant under $G$ if for all images in a class, and for any $g$,
$$f(g(I)) = f(I),.$$
It becomes equivariant if there exists another structure (often a group) $G'$ that reflects
the
transformations in $G$ in a meaningful way. In other words, such that for each $g$, you have one a unique $g' in G'$ such that
$$f(g(I)) = g'(f(I)),.$$
In the above example on the group of translations, $g$ and $g'$ are the same (and hence $G'=G$): an integer translation of the image reflects as the exact same translation of the maximum location.
Another common definition is:
$$f(g(I)) = g(f(I)),.$$
I however used potentially different $G$ and $G'$ because sometimes $f(I)$ and $g(I)$ are not in the same domain. This happens for instance in multivariate statistics (see e.g. Equivariance and invariance properties of multivariate quantile and related functions, and the role of standardisation).
But here, the uniqueness of the mapping between $g$ and $g'$ allows to get back to the original transformation $g$.
Often people use the term invariance because the equivariance concept is unknown, or everybody else uses invariance, and equivariance would seem more pedantic.
For the record, other related notions (esp. in maths and physics) are termed covariance, contravariance, differential invariance.
In addition, translation-invariance, as least approximate, or in envelope, has been a quest for several signal and image processing tools. Notably, multi-rate (filter-banks) and multi-scale (wavelets or pyramids) transformations have been design in the past 25 years, for instance under the hood of shift-invariant, cycle-spinning, stationary, complex, dual-tree wavelet transforms (for a review on 2D wavelets, A panorama on multiscale geometric representations). The wavelets can absorb a few discrete scale variations. All theses (approximate) invariances often come with the price of redundancy in the number of transformed coefficients.
$endgroup$
4
$begingroup$
Great! I really admire your effort for the detailed reply @Laurent Duval
$endgroup$
– Aamir
Jan 5 '17 at 8:32
add a comment |
$begingroup$
Equivariance and invariance are sometimes used interchangeably. As pointed out by @Xi'an, you can find uses in the statistical literature, for instance on the notions of Invariant estimator and especially Pitman estimator.
However, I would like to mention that it would be better if both terms keep separated, as the prefix "in-" in invariant is privative (meaning "no variance" at all), while "equi-" refers to "varying in a similar or equivalent proportion".
Let us start from simple image features, and suppose that image $I$ has a unique maximum $m$ at location $(x_m,y_m)$, which is here the main classification feature.
An interesting property of classifiers is their ability to classify in the same manner some distorted versions $I'$ of $I$, for instance translations by all vectors $(u,v)$. The maximum value $m'$ of $I'$ is invariant: $m'=m$: the value is the same. While its location will be at $(x'_m,y'_m)=(x_m-u,y_m-v)$, and is equivariant, meaning that is varies "equally" with the distortion.
The precise formulations given in mathematics for equivariance may depend on the objects and transformations one considers, so I prefer here the notion that is most often used in practice (and I may get the blame from a theoretical stand-point).
Here, translations (or some more generic action) can be equipped with the structure of a group $G$, $g$ being one specific translation operator. A function or feature $f$ is invariant under $G$ if for all images in a class, and for any $g$,
$$f(g(I)) = f(I),.$$
It becomes equivariant if there exists another structure (often a group) $G'$ that reflects
the
transformations in $G$ in a meaningful way. In other words, such that for each $g$, you have one a unique $g' in G'$ such that
$$f(g(I)) = g'(f(I)),.$$
In the above example on the group of translations, $g$ and $g'$ are the same (and hence $G'=G$): an integer translation of the image reflects as the exact same translation of the maximum location.
Another common definition is:
$$f(g(I)) = g(f(I)),.$$
I however used potentially different $G$ and $G'$ because sometimes $f(I)$ and $g(I)$ are not in the same domain. This happens for instance in multivariate statistics (see e.g. Equivariance and invariance properties of multivariate quantile and related functions, and the role of standardisation).
But here, the uniqueness of the mapping between $g$ and $g'$ allows to get back to the original transformation $g$.
Often people use the term invariance because the equivariance concept is unknown, or everybody else uses invariance, and equivariance would seem more pedantic.
For the record, other related notions (esp. in maths and physics) are termed covariance, contravariance, differential invariance.
In addition, translation-invariance, as least approximate, or in envelope, has been a quest for several signal and image processing tools. Notably, multi-rate (filter-banks) and multi-scale (wavelets or pyramids) transformations have been design in the past 25 years, for instance under the hood of shift-invariant, cycle-spinning, stationary, complex, dual-tree wavelet transforms (for a review on 2D wavelets, A panorama on multiscale geometric representations). The wavelets can absorb a few discrete scale variations. All theses (approximate) invariances often come with the price of redundancy in the number of transformed coefficients.
$endgroup$
4
$begingroup$
Great! I really admire your effort for the detailed reply @Laurent Duval
$endgroup$
– Aamir
Jan 5 '17 at 8:32
add a comment |
$begingroup$
Equivariance and invariance are sometimes used interchangeably. As pointed out by @Xi'an, you can find uses in the statistical literature, for instance on the notions of Invariant estimator and especially Pitman estimator.
However, I would like to mention that it would be better if both terms keep separated, as the prefix "in-" in invariant is privative (meaning "no variance" at all), while "equi-" refers to "varying in a similar or equivalent proportion".
Let us start from simple image features, and suppose that image $I$ has a unique maximum $m$ at location $(x_m,y_m)$, which is here the main classification feature.
An interesting property of classifiers is their ability to classify in the same manner some distorted versions $I'$ of $I$, for instance translations by all vectors $(u,v)$. The maximum value $m'$ of $I'$ is invariant: $m'=m$: the value is the same. While its location will be at $(x'_m,y'_m)=(x_m-u,y_m-v)$, and is equivariant, meaning that is varies "equally" with the distortion.
The precise formulations given in mathematics for equivariance may depend on the objects and transformations one considers, so I prefer here the notion that is most often used in practice (and I may get the blame from a theoretical stand-point).
Here, translations (or some more generic action) can be equipped with the structure of a group $G$, $g$ being one specific translation operator. A function or feature $f$ is invariant under $G$ if for all images in a class, and for any $g$,
$$f(g(I)) = f(I),.$$
It becomes equivariant if there exists another structure (often a group) $G'$ that reflects
the
transformations in $G$ in a meaningful way. In other words, such that for each $g$, you have one a unique $g' in G'$ such that
$$f(g(I)) = g'(f(I)),.$$
In the above example on the group of translations, $g$ and $g'$ are the same (and hence $G'=G$): an integer translation of the image reflects as the exact same translation of the maximum location.
Another common definition is:
$$f(g(I)) = g(f(I)),.$$
I however used potentially different $G$ and $G'$ because sometimes $f(I)$ and $g(I)$ are not in the same domain. This happens for instance in multivariate statistics (see e.g. Equivariance and invariance properties of multivariate quantile and related functions, and the role of standardisation).
But here, the uniqueness of the mapping between $g$ and $g'$ allows to get back to the original transformation $g$.
Often people use the term invariance because the equivariance concept is unknown, or everybody else uses invariance, and equivariance would seem more pedantic.
For the record, other related notions (esp. in maths and physics) are termed covariance, contravariance, differential invariance.
In addition, translation-invariance, as least approximate, or in envelope, has been a quest for several signal and image processing tools. Notably, multi-rate (filter-banks) and multi-scale (wavelets or pyramids) transformations have been design in the past 25 years, for instance under the hood of shift-invariant, cycle-spinning, stationary, complex, dual-tree wavelet transforms (for a review on 2D wavelets, A panorama on multiscale geometric representations). The wavelets can absorb a few discrete scale variations. All theses (approximate) invariances often come with the price of redundancy in the number of transformed coefficients.
$endgroup$
Equivariance and invariance are sometimes used interchangeably. As pointed out by @Xi'an, you can find uses in the statistical literature, for instance on the notions of Invariant estimator and especially Pitman estimator.
However, I would like to mention that it would be better if both terms keep separated, as the prefix "in-" in invariant is privative (meaning "no variance" at all), while "equi-" refers to "varying in a similar or equivalent proportion".
Let us start from simple image features, and suppose that image $I$ has a unique maximum $m$ at location $(x_m,y_m)$, which is here the main classification feature.
An interesting property of classifiers is their ability to classify in the same manner some distorted versions $I'$ of $I$, for instance translations by all vectors $(u,v)$. The maximum value $m'$ of $I'$ is invariant: $m'=m$: the value is the same. While its location will be at $(x'_m,y'_m)=(x_m-u,y_m-v)$, and is equivariant, meaning that is varies "equally" with the distortion.
The precise formulations given in mathematics for equivariance may depend on the objects and transformations one considers, so I prefer here the notion that is most often used in practice (and I may get the blame from a theoretical stand-point).
Here, translations (or some more generic action) can be equipped with the structure of a group $G$, $g$ being one specific translation operator. A function or feature $f$ is invariant under $G$ if for all images in a class, and for any $g$,
$$f(g(I)) = f(I),.$$
It becomes equivariant if there exists another structure (often a group) $G'$ that reflects
the
transformations in $G$ in a meaningful way. In other words, such that for each $g$, you have one a unique $g' in G'$ such that
$$f(g(I)) = g'(f(I)),.$$
In the above example on the group of translations, $g$ and $g'$ are the same (and hence $G'=G$): an integer translation of the image reflects as the exact same translation of the maximum location.
Another common definition is:
$$f(g(I)) = g(f(I)),.$$
I however used potentially different $G$ and $G'$ because sometimes $f(I)$ and $g(I)$ are not in the same domain. This happens for instance in multivariate statistics (see e.g. Equivariance and invariance properties of multivariate quantile and related functions, and the role of standardisation).
But here, the uniqueness of the mapping between $g$ and $g'$ allows to get back to the original transformation $g$.
Often people use the term invariance because the equivariance concept is unknown, or everybody else uses invariance, and equivariance would seem more pedantic.
For the record, other related notions (esp. in maths and physics) are termed covariance, contravariance, differential invariance.
In addition, translation-invariance, as least approximate, or in envelope, has been a quest for several signal and image processing tools. Notably, multi-rate (filter-banks) and multi-scale (wavelets or pyramids) transformations have been design in the past 25 years, for instance under the hood of shift-invariant, cycle-spinning, stationary, complex, dual-tree wavelet transforms (for a review on 2D wavelets, A panorama on multiscale geometric representations). The wavelets can absorb a few discrete scale variations. All theses (approximate) invariances often come with the price of redundancy in the number of transformed coefficients.
edited Dec 30 '18 at 14:17
answered Jan 4 '17 at 22:53
Laurent DuvalLaurent Duval
762619
762619
4
$begingroup$
Great! I really admire your effort for the detailed reply @Laurent Duval
$endgroup$
– Aamir
Jan 5 '17 at 8:32
add a comment |
4
$begingroup$
Great! I really admire your effort for the detailed reply @Laurent Duval
$endgroup$
– Aamir
Jan 5 '17 at 8:32
4
4
$begingroup$
Great! I really admire your effort for the detailed reply @Laurent Duval
$endgroup$
– Aamir
Jan 5 '17 at 8:32
$begingroup$
Great! I really admire your effort for the detailed reply @Laurent Duval
$endgroup$
– Aamir
Jan 5 '17 at 8:32
add a comment |
$begingroup$
The terms are different:
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0
Invariant to translation means that a translation of input features doe not change the outputs at all. So if your pattern 0,3,2,0,0 on the input results in 0,1,0 in the output, then the pattern 0,0,3,2,0 would also lead to 0,1,0
For feature maps in convolutional networks to be useful, they typically need both properties in some balance. The equivariance allows the network to generalise edge, texture, shape detection in different locations. The invariance allows precise location of the detected features to matter less. These are two complementary types of generalisation for many image processing tasks.
$endgroup$
$begingroup$
Translated feature yields translated output at some layer. Please elaborate about considerably translated whole object being detected. Seems it will be detected even if CNN was not trained with images containing different positions? Does equivariance hold in this case (looks more similar to invariance)?
$endgroup$
– VladimirLenin
Jul 14 '17 at 10:14
$begingroup$
@VladimirLenin: I don't think that elaboration is required for this question, it is definitely not something the OP has asked here. I suggest you ask a separate question, with a concrete example if possible. Even if visually a "whole object" has been translated, that does not mean feature maps in a CNN are tracking the same thing as you expect.
$endgroup$
– Neil Slater
Jul 14 '17 at 10:24
add a comment |
$begingroup$
The terms are different:
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0
Invariant to translation means that a translation of input features doe not change the outputs at all. So if your pattern 0,3,2,0,0 on the input results in 0,1,0 in the output, then the pattern 0,0,3,2,0 would also lead to 0,1,0
For feature maps in convolutional networks to be useful, they typically need both properties in some balance. The equivariance allows the network to generalise edge, texture, shape detection in different locations. The invariance allows precise location of the detected features to matter less. These are two complementary types of generalisation for many image processing tasks.
$endgroup$
$begingroup$
Translated feature yields translated output at some layer. Please elaborate about considerably translated whole object being detected. Seems it will be detected even if CNN was not trained with images containing different positions? Does equivariance hold in this case (looks more similar to invariance)?
$endgroup$
– VladimirLenin
Jul 14 '17 at 10:14
$begingroup$
@VladimirLenin: I don't think that elaboration is required for this question, it is definitely not something the OP has asked here. I suggest you ask a separate question, with a concrete example if possible. Even if visually a "whole object" has been translated, that does not mean feature maps in a CNN are tracking the same thing as you expect.
$endgroup$
– Neil Slater
Jul 14 '17 at 10:24
add a comment |
$begingroup$
The terms are different:
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0
Invariant to translation means that a translation of input features doe not change the outputs at all. So if your pattern 0,3,2,0,0 on the input results in 0,1,0 in the output, then the pattern 0,0,3,2,0 would also lead to 0,1,0
For feature maps in convolutional networks to be useful, they typically need both properties in some balance. The equivariance allows the network to generalise edge, texture, shape detection in different locations. The invariance allows precise location of the detected features to matter less. These are two complementary types of generalisation for many image processing tasks.
$endgroup$
The terms are different:
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0
Invariant to translation means that a translation of input features doe not change the outputs at all. So if your pattern 0,3,2,0,0 on the input results in 0,1,0 in the output, then the pattern 0,0,3,2,0 would also lead to 0,1,0
For feature maps in convolutional networks to be useful, they typically need both properties in some balance. The equivariance allows the network to generalise edge, texture, shape detection in different locations. The invariance allows precise location of the detected features to matter less. These are two complementary types of generalisation for many image processing tasks.
edited Jan 4 '17 at 20:45
answered Jan 4 '17 at 20:39
Neil SlaterNeil Slater
17k22961
17k22961
$begingroup$
Translated feature yields translated output at some layer. Please elaborate about considerably translated whole object being detected. Seems it will be detected even if CNN was not trained with images containing different positions? Does equivariance hold in this case (looks more similar to invariance)?
$endgroup$
– VladimirLenin
Jul 14 '17 at 10:14
$begingroup$
@VladimirLenin: I don't think that elaboration is required for this question, it is definitely not something the OP has asked here. I suggest you ask a separate question, with a concrete example if possible. Even if visually a "whole object" has been translated, that does not mean feature maps in a CNN are tracking the same thing as you expect.
$endgroup$
– Neil Slater
Jul 14 '17 at 10:24
add a comment |
$begingroup$
Translated feature yields translated output at some layer. Please elaborate about considerably translated whole object being detected. Seems it will be detected even if CNN was not trained with images containing different positions? Does equivariance hold in this case (looks more similar to invariance)?
$endgroup$
– VladimirLenin
Jul 14 '17 at 10:14
$begingroup$
@VladimirLenin: I don't think that elaboration is required for this question, it is definitely not something the OP has asked here. I suggest you ask a separate question, with a concrete example if possible. Even if visually a "whole object" has been translated, that does not mean feature maps in a CNN are tracking the same thing as you expect.
$endgroup$
– Neil Slater
Jul 14 '17 at 10:24
$begingroup$
Translated feature yields translated output at some layer. Please elaborate about considerably translated whole object being detected. Seems it will be detected even if CNN was not trained with images containing different positions? Does equivariance hold in this case (looks more similar to invariance)?
$endgroup$
– VladimirLenin
Jul 14 '17 at 10:14
$begingroup$
Translated feature yields translated output at some layer. Please elaborate about considerably translated whole object being detected. Seems it will be detected even if CNN was not trained with images containing different positions? Does equivariance hold in this case (looks more similar to invariance)?
$endgroup$
– VladimirLenin
Jul 14 '17 at 10:14
$begingroup$
@VladimirLenin: I don't think that elaboration is required for this question, it is definitely not something the OP has asked here. I suggest you ask a separate question, with a concrete example if possible. Even if visually a "whole object" has been translated, that does not mean feature maps in a CNN are tracking the same thing as you expect.
$endgroup$
– Neil Slater
Jul 14 '17 at 10:24
$begingroup$
@VladimirLenin: I don't think that elaboration is required for this question, it is definitely not something the OP has asked here. I suggest you ask a separate question, with a concrete example if possible. Even if visually a "whole object" has been translated, that does not mean feature maps in a CNN are tracking the same thing as you expect.
$endgroup$
– Neil Slater
Jul 14 '17 at 10:24
add a comment |
$begingroup$
Just adding my 2 cents
Regarding an image classification task solved with a typical CNN Architecture consisting of a Backend (Convolutions + NL + possibly Spatial Pooling) which performs Representation Learning and of a Frontend (e.g. Fully Connected Layers, MLP) which solves the specific task, in this case image classification, the idea is to build a function $ f : I rightarrow L $ able to map from the Spatial Domain $ I $ (Input Image) to the Semantic Domain $ L $ (Label Set) in a 2 step process which is
- Backend (Representation Learning) : $ f : I rightarrow mathcal{L} $ maps the Input to the Latent Semantic Space
- Frontend (Task Specific Solver) : $ f : mathcal{L} rightarrow L $ maps from the Latent Semantic Space to the Final Label Space
and it is performed using the following properties
- spatial equivariance, regarding ConvLayer (Spatial 2D Convolution+NonLin e.g. ReLU) as a shift in the Layer Input produces a shift in the Layer Output (Note: it is about the Layer, not the single Convolution Operator)
- spatial invariance, regarding the Pooling Operator (e.g. Max Pooling passes over the max value in its receptive field regardless of its spatial position)
The closer to the input layer, the closer to the purely spatial domain $ I $ and the more important the spatial equivariance property which allows to build spatially equivariant hierarchical (increasingly) semantic representation
The closer to the frontend, the closer to the latent purely semantic domain $ mathcal{L} $ and the more important the spatial invariance as the specific meaning of the image is desired to be independent from the spatial positions of the features
Using fully connected layers in the frontend makes the classifier sensitive to feature position at some extent, depending on the backend structure : the deeper it is and the more the translation invariant operator (Pooling) used
It has been shown in Quantifying Translation-Invariance in Convolutional Neural Networks that to improve the CNN Classifier Translation Invariance, instead of acting on the inductive bias (architecture hence depth, pooling, …) it's more effective to act on the dataset bias (data augmentation)
$endgroup$
add a comment |
$begingroup$
Just adding my 2 cents
Regarding an image classification task solved with a typical CNN Architecture consisting of a Backend (Convolutions + NL + possibly Spatial Pooling) which performs Representation Learning and of a Frontend (e.g. Fully Connected Layers, MLP) which solves the specific task, in this case image classification, the idea is to build a function $ f : I rightarrow L $ able to map from the Spatial Domain $ I $ (Input Image) to the Semantic Domain $ L $ (Label Set) in a 2 step process which is
- Backend (Representation Learning) : $ f : I rightarrow mathcal{L} $ maps the Input to the Latent Semantic Space
- Frontend (Task Specific Solver) : $ f : mathcal{L} rightarrow L $ maps from the Latent Semantic Space to the Final Label Space
and it is performed using the following properties
- spatial equivariance, regarding ConvLayer (Spatial 2D Convolution+NonLin e.g. ReLU) as a shift in the Layer Input produces a shift in the Layer Output (Note: it is about the Layer, not the single Convolution Operator)
- spatial invariance, regarding the Pooling Operator (e.g. Max Pooling passes over the max value in its receptive field regardless of its spatial position)
The closer to the input layer, the closer to the purely spatial domain $ I $ and the more important the spatial equivariance property which allows to build spatially equivariant hierarchical (increasingly) semantic representation
The closer to the frontend, the closer to the latent purely semantic domain $ mathcal{L} $ and the more important the spatial invariance as the specific meaning of the image is desired to be independent from the spatial positions of the features
Using fully connected layers in the frontend makes the classifier sensitive to feature position at some extent, depending on the backend structure : the deeper it is and the more the translation invariant operator (Pooling) used
It has been shown in Quantifying Translation-Invariance in Convolutional Neural Networks that to improve the CNN Classifier Translation Invariance, instead of acting on the inductive bias (architecture hence depth, pooling, …) it's more effective to act on the dataset bias (data augmentation)
$endgroup$
add a comment |
$begingroup$
Just adding my 2 cents
Regarding an image classification task solved with a typical CNN Architecture consisting of a Backend (Convolutions + NL + possibly Spatial Pooling) which performs Representation Learning and of a Frontend (e.g. Fully Connected Layers, MLP) which solves the specific task, in this case image classification, the idea is to build a function $ f : I rightarrow L $ able to map from the Spatial Domain $ I $ (Input Image) to the Semantic Domain $ L $ (Label Set) in a 2 step process which is
- Backend (Representation Learning) : $ f : I rightarrow mathcal{L} $ maps the Input to the Latent Semantic Space
- Frontend (Task Specific Solver) : $ f : mathcal{L} rightarrow L $ maps from the Latent Semantic Space to the Final Label Space
and it is performed using the following properties
- spatial equivariance, regarding ConvLayer (Spatial 2D Convolution+NonLin e.g. ReLU) as a shift in the Layer Input produces a shift in the Layer Output (Note: it is about the Layer, not the single Convolution Operator)
- spatial invariance, regarding the Pooling Operator (e.g. Max Pooling passes over the max value in its receptive field regardless of its spatial position)
The closer to the input layer, the closer to the purely spatial domain $ I $ and the more important the spatial equivariance property which allows to build spatially equivariant hierarchical (increasingly) semantic representation
The closer to the frontend, the closer to the latent purely semantic domain $ mathcal{L} $ and the more important the spatial invariance as the specific meaning of the image is desired to be independent from the spatial positions of the features
Using fully connected layers in the frontend makes the classifier sensitive to feature position at some extent, depending on the backend structure : the deeper it is and the more the translation invariant operator (Pooling) used
It has been shown in Quantifying Translation-Invariance in Convolutional Neural Networks that to improve the CNN Classifier Translation Invariance, instead of acting on the inductive bias (architecture hence depth, pooling, …) it's more effective to act on the dataset bias (data augmentation)
$endgroup$
Just adding my 2 cents
Regarding an image classification task solved with a typical CNN Architecture consisting of a Backend (Convolutions + NL + possibly Spatial Pooling) which performs Representation Learning and of a Frontend (e.g. Fully Connected Layers, MLP) which solves the specific task, in this case image classification, the idea is to build a function $ f : I rightarrow L $ able to map from the Spatial Domain $ I $ (Input Image) to the Semantic Domain $ L $ (Label Set) in a 2 step process which is
- Backend (Representation Learning) : $ f : I rightarrow mathcal{L} $ maps the Input to the Latent Semantic Space
- Frontend (Task Specific Solver) : $ f : mathcal{L} rightarrow L $ maps from the Latent Semantic Space to the Final Label Space
and it is performed using the following properties
- spatial equivariance, regarding ConvLayer (Spatial 2D Convolution+NonLin e.g. ReLU) as a shift in the Layer Input produces a shift in the Layer Output (Note: it is about the Layer, not the single Convolution Operator)
- spatial invariance, regarding the Pooling Operator (e.g. Max Pooling passes over the max value in its receptive field regardless of its spatial position)
The closer to the input layer, the closer to the purely spatial domain $ I $ and the more important the spatial equivariance property which allows to build spatially equivariant hierarchical (increasingly) semantic representation
The closer to the frontend, the closer to the latent purely semantic domain $ mathcal{L} $ and the more important the spatial invariance as the specific meaning of the image is desired to be independent from the spatial positions of the features
Using fully connected layers in the frontend makes the classifier sensitive to feature position at some extent, depending on the backend structure : the deeper it is and the more the translation invariant operator (Pooling) used
It has been shown in Quantifying Translation-Invariance in Convolutional Neural Networks that to improve the CNN Classifier Translation Invariance, instead of acting on the inductive bias (architecture hence depth, pooling, …) it's more effective to act on the dataset bias (data augmentation)
answered Mar 15 '18 at 15:42
Nicola BerniniNicola Bernini
1511
1511
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16060%2fwhat-is-the-difference-between-equivariant-to-translation-and-invariant-to-tr%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
$begingroup$
In the old days of Statistics, as in the time of Pitman, invariant was used in the meaning of equivariant.
$endgroup$
– Xi'an
Oct 12 '18 at 18:12