Opened 6 years ago
Last modified 5 years ago
#142 new enhancement
Coordinate Type: Ensemble
Reported by: | markh | Owned by: | cf-conventions@… |
---|---|---|---|
Priority: | medium | Milestone: | |
Component: | cf-conventions | Version: | |
Keywords: | Cc: |
Description (last modified by markh)
1. Title
Ensemble Coordinates
2. Moderator
TBC
3. Requirement
The description of the dimension of a data variable which describes an ensemble of forecasts may involve a number of elements of metadata.
Coordinate variables and auxiliary coordinates describing an ensemble need to be able to be labelled as such.
It is useful to encapsulate information about the nature of the ensemble with the coordinate. Current use cases for metadata standardisation are: the original size of the ensemble; the single-model or multi-model nature of the ensemble, the presence of an explicit control forecast as realization=0.
4. Technical Proposal
4. Coordinate Types
... The attribute axis may be attached to a coordinate variable or auxiliary coordinate variable and given one of the values X, Y, Z,
orT . These labelswhichstand for a longitude, latitude, vertical,ortime or ensemble axis respectively. ...
4.6 Ensemble Coordinate
Variables representing an ensemble or collection of realizations shall have an attribute axis with a value E. These variables are discrete, as described in section 4.5; they do not represent continuous quantities.
Ensemble variables have a number of optional standardised attributes available for use.
A data variable representing an ensemble will commonly have an ensemble coordinate with the standard name realization; this is not mandatory. The realization standard name is used to provide a unique identifying number to each ensemble member within the ensemble.
An ensemble coordinate providing a string label for each ensemble member within the ensemble may also be included, using the standard name ensemble_member_label. An ensemble_member_label coordinate shall have unique values and missing values are not allowed.
An ensemble coordinate with the standard name realization or ensemble_member_label may include an attribute named ensemble_control_member.
This value provides a definition that one member of the ensemble is the control member and identifies this member. This control member shall have the identified value within the ensemble coordinate's data.
The absence of this attribute shall be interpreted as a negative statement, explicitly stating that there is no control member identified within the ensemble: all members have equal status, no control member existed in the ensemble.
An ensemble coordinate may be identified as being from one model or multiple models by providing further variables identified as coordinates or auxiliary coordinates by the data variable. All of these coordinates shall have an attribute axis with a value E.
The standard name's source and institution are used to identify the multiplicity of models which the ensemble is taken from, one other or both may be present.
Such coordinates may be scalar coordinate variables or they may be attached to the same dimension(s) as other ensemble coordinates referenced by a data variable. Scalar coordinates are commonly used to define a single model ensemble; in this case, this is informationally equivalent to auxiliary coordinates with identical values.
The absence of such source and institution coordinates shall not be interpreted as a positive or negative statement. No inference on the multiplicity of models which the ensemble is taken from shall be inferred from the absence of such coordinates.
Appendix A
| ensemble_control_member | D | C | ref:<<ensemble-coordinate>> | The identifier of the member of the ensemble which is the control member. |
Conformance Document
4 Coordinate Types
Requirements:
- The axis attribute may
onlybe attached to a coordinate variable or auxiliary coordinate variable only. - The only legal values of axis are X, Y, Z,
andT and E (case insensitive). - The axis attribute must be consistent with the coordinate type deduced from units and positive.
The axis attribute is not allowed for auxiliary coordinate variables.A data variable must not have more than one coordinate variable with a particular value of the axis attribute.
4.6.2 Ensemble Control Member ====
Requirements
- A variable may have an attribute ensemble_control_member only if it has an axis attribute of E and a standard_name of either realization or ensemble_member_label.
5. Benefits
Information regarding the nature of an ensemble is encoded in an ensemble coordinate, analogous to temporal coordinates.
Encoding of the presence of a control member and the single or multiple model nature of the ensemble is standardised.
Future standardisation of ensemble characteristics has a model to follow.
6. Status Quo
At present there is no standardised way of capturing information or characteristics about an ensemble, all information is encoded ad hoc by data producers.
For example, the size of the ensemble can only be inferred from the length of a realization dimension. If the ensemble is sliced, to leave only one member, the size of the ensemble is lost in the resulting data variable. This example is particularly problematic for conversion from CF to other data formats.
Change History (48)
comment:1 Changed 6 years ago by markh
comment:2 Changed 6 years ago by mgschultz
Dear Mark,
reads very well, I think. Only one suggestion: perhaps one should also try to standardize the "description" of the ensemble members? I guess, this would be a character variable with string length and ensemble size as dimensions. Would it make sense to suggest another optional attribute for the ensemble coordinate variable which points to this description variable?
Best regards, Martin
comment:3 Changed 6 years ago by mgschultz
Dear Mark,
reads very well, I think. Only one suggestion: perhaps one should also try to standardize the "description" of the ensemble members? I guess, this would be a character variable with string length and ensemble size as dimensions. Would it make sense to suggest another optional attribute for the ensemble coordinate variable which points to this description variable?
Best regards, Martin
comment:4 follow-up: ↓ 6 Changed 6 years ago by jonathan
Dear Mark
Thanks for making this proposal. I didn't have the opportunity to comment in the email discussion because I was on holiday then.
I think you are right that ensemble axes need special recognition. We should also acknowledge that an ensemble axis is a special case of a discrete axis (CF sect 4.5). If we add a new section 4.6 for ensemble axes, that should be stated at the outset. Conversely, the ensemble axis should be added to the list of applications of discrete axes in sect 4.5.
When saying that an ensemble axis must have a standard name of realization, you are assuming it will have a coordinate variable. However, I don't think it needs to have one. The elements of an ensemble might be distinguished by a combination of auxiliary coordinate variables, and might not have any meaningful order, which is implied by a coordinate variable. However, I think it would be OK to say that a coordinate variable with a standard name of realization indicates it is an ensemble axis. I would say that units of 1 should not be required because dimensionless units are generally allowed not to be omitted. (Requiring them would be something for the "strict" variety of CF which has been proposed.) If there is no other standard name for an ensemble axis for the moment, then I don't think we need a value of the axis attribute, because the standard name alone will do the job of identifying it, and the axis is redundant. (This is a different situation from the spatial axes, which have various means of identification, inherited from COARDS, so the axis attribute is a useful extra clue.)
Apart from the realization number, ensemble members might be identified by string-valued auxiliary coordinate variables. It was proposed a long time ago, but not agreed, that we could introduce standard names of institution and source for such variables, with the same meanings as the attributes of those names. In the context of CMIP, experiment would also be a good standard name to have, while the realizations in CMIP are identified by strings e.g. r1i1p1, not numbers. Therefore it would be good to allow realization coordinates to be string-valued. I think it would be useful to say something about the use of auxiliary coordinate variables to describe ensemble members in this way.
About ensemble_size. In your ticket 108, we discussed a more general mechanism for recording the original size of a dimension in cell_methods, as Karl has reminded us on the email list. That ticket hasn't been concluded, and I guess you think cell_methods is inappropriate because you want the information to be recorded for subsetting as well as statistical operations. In the earlier discussion, I believe that your use-case for subsetting was the selection of a single ensemble member, which I thought could be regarded as a point cell method. Do you have a use-case for a selection of a subset of several ensemble members? Maybe that could be recorded as a cell method too. Consider a situation where you have 10 ensemble members, you select 5 of them, and then you compute the ensemble mean. I think that recording the size 5, which would naturally belong in cell methods, would be at least as useful as recording the original size 10. I wonder what you use the original ensemble size for? It also seems to me that any treatment of this kind could apply to any discrete axis, not just an ensemble axis.
The standard name of source would be appropriate for for a string_valued auxiliary coordinate variable that identifies the model (see CF sect 2.6.2). If we had that facility, I don't think you'd need the single_model_ensemble attribute. If they are all from the same model, it can be identified by a scalar coordinate of source; if they come from different models, there can be an auxiliary coordinate variable of source with the dimension of the ensemble size.
I haven't come across the idea behind the ensemble_control_member_0 before. This sounds rather specific to a particular application. Is it in sufficiently widespread use that it requires a standard? What does the "control member" mean? In climate model ensembles that I deal with, all members are equivalent.
Best wishes
Jonathan
comment:5 Changed 6 years ago by markh
To meet my use cases I have focussed on numerical realization coordinates.
However, I have no particular focus on coordinate variables, this identification mechanism should work for auxiliary coordinates just as well.
It sounds like there is a desire to have a string based identification and perhaps derived identifiers from multiple sources. These sound like valid use cases to me.
I suggest we generalise the proposal such that any coordinate can be used to describe an ensemble axis. This would suggest the the standard_name realization is not a good way to identify this coordinate type.
The introduction to chapter 4 states: The attribute axis may be attached to a coordinate variable and given one of the values X, Y, Z or T which stand for a longitude, latitude, vertical, or time axis respectively.
Would you like this to imply that axis should only be used on a coordinate variable? (Appendix A does not place such a restriction) (This is not part of the compliance check, afaics)
Would you like this to imply that a data variable will reference 1or0 coordinates with an axis attribute of a particular value, or would it be suitable to have a data variable referencing multiple coordinates
If neither of these restrictions apply, then we can add the axis = E identifier to any coordinate or auxiliary coordinate variable and use this to handle ensemble coordinate typing.
this seems like a good solution to me
In light of this, I will reconsider the specification of a control member where the coordinate may not be numerical. This is a required use case, and commonly delivered in WMO specific contexts for short range ensemble forecasts
Updates to follow
comment:6 in reply to: ↑ 4 Changed 6 years ago by markh
Replying to jonathan:
In your ticket 108, we discussed a more general mechanism for recording the original size of a dimension in cell_methods, as Karl has reminded us on the email list.
#108 is not aiming to store an original size; the intent of #108 is to capture the domain, analogous to the 'interval' in cell_methods currently available, but in a richer fashion. In this scenario, the domain could store the realization numbers which were input to a statistical process, but this may not be the original size.
THese two pieces of information are independent.
I may want to support a work flow which subsets 15 realisations out of the original 19, then calculates the mean and standard deviation of the remaining realisations.
In order to output this comprehensively, I would like to state that this is a mean of realisations (0,2,3,4,.....) from an ensemble originally formed of 19 members
That ticket hasn't been concluded, and I guess you think cell_methods is inappropriate because you want the information to be recorded for subsetting as well as statistical operations.
This is the case. More generally, the ensemble size is information about the coordinate, not about the data variable directly, and not about statitics calculated over the ensemble dimension, so cell_methods appears to me to be the wrong place to store this information.
I am more comfortable looking for analogies with space and time coordinates, such as calendar definitions for ways to define this metadata in a controlled manner.
comment:7 follow-up: ↓ 8 Changed 6 years ago by jonathan
Dear Mark
Yes, I agree that axis="E" would a better way to identify an ensemble axis than depending on a particular standard name. As originally introduced, the axis attribute was intended for coordinate variables, not auxiliary coordinate variables, but we've agreed a ticket which allows it for aux coord vars too. As you say, axis="E" could be attached to any of the aux coord vars, or the coord var (if there is one), of an ensemble axis.
I may want to support a work flow which subsets 15 realisations out of the original 19, then calculates the mean and standard deviation of the remaining realisations. In order to output this comprehensively, I would like to state that this is a mean of realisations (0,2,3,4,.....) from an ensemble originally formed of 19 members
OK. I too am more comfortable looking for analogies with spatiotemporal coordinates, and you can imagine that there might be need to record the original dimension of other axes before a selection was made. Therefore I think it would be good to use an attribute which didn't say "ensemble" in its name.
Although I appreciate this doesn't feel quite like cell methods, there is an relationship, I suggest, since cell methods has the idea that each cell represents variation within itself, and therefore allows the original spacing of the data to be recorded. By an extension, you could regard a selection of the points along an axis as an operation e.g. a sample operation, which maps the original full range of variation into a smaller number of cells. Then you put e.g. cell_methods="ensemble: sample (dimension: original_ensemble) ensemble: mean". This would use a new cell method, instead of a new attribute. In your case, dimension ensemble=15 and original_ensemble=19.
Best wishes
Jonathan
comment:8 in reply to: ↑ 7 Changed 6 years ago by jonathan
Correction to my last posting. I didn't get the final paragraph quite right. In your example, the ensemble originally had 19 members, of which you selected a subset of 15, and then you computed e.g. the mean of them. So finally the ensemble axis has size 1, because it's been collapsed, but you may want to record the size it had before collapse, and before sampling. This could be done by using cell_methods to list these as consecutive operations, with an extension to record the dimension before each one: cell_methods="ensemble: sample (dimension: full_ensemble) ensemble: mean (dimension: sub_ensemble)", where dimensions full_ensemble=19, sub_ensemble=15 and ensemble=1.
Cheers, Jonathan
comment:9 Changed 6 years ago by markh
- Description modified (diff)
It looks from the comments like we are in agreement about the use of axis=E to define that a coordinate is an ensemble coordinate. It also looks like there are good reasons not to rely on a standard name.
I am updating the summary to reflect this.
There is still discussion about how to capture associated metadata, further consideration is needed on this aspect.
comment:10 Changed 6 years ago by markh
The outstanding question of how to capture coordinate level metadata is still outstanding.
regarding 'size of ensemble':
- Jonathan has pointed highlighted that it is plausible to address this as an extension to cell methods.
- I have pointed proposed that this is captured as a an attribute on an ensemble coordinate
Both of these appear reasonable interpretations
The other two use cases I have to address to meet my uers' needs are:
- single model ensemble (boolean)
- ensemble control member (value)
Neither of these elements seem to fit the Cell Methods paradigm at all well.
With this in mind I think it is preferable to encapsulate all the ensemble coordinate metadata as named attributes on the ensemble coordinate, rather than extending cell methods to meet this use case and still requiring extra attributes.
Please may people share thoughts on why there is benefit in extending cell methods to enable 'original ensemble size' to be encoded there, when it is still necessary to define new limited scope attributes to address the use cases I have provided?
If we do not identify strong benefit from the cell methods approach, I think we should use defined name attributes for all these cases.
thank you mark
comment:11 Changed 6 years ago by jonathan
Dear Mark
Thanks for your posting. I would say that cell_methods offers fairly natural way to encode more useful information than a simple attribute for the original ensemble size would do (as in my example above), and that there's no need to link this to your other two proposals, which don't belong in cell_methods, I agree. I commented above on these other two issues.
Rather than a new single_model_ensemble attribute, the source attribute could be used to identify the model. If there is just one model, there's just one word. If the ensemble has several models, it would be a list of words. Alternatively, you could have a string-valued auxiliary coordinate variable to record the model for each ensemble member.
I haven't come across the idea behind the ensemble_control_member_0 before. What does the "control member" mean? In climate model ensembles that I deal with, all members are equivalent. You mention that it is used in WMO contexts. Please could you give an example to illustrate the use case?
Best wishes
Jonathan
comment:12 Changed 5 years ago by markh
An ensemble control member is a deterministic forecast within the ensemble. There are no perturbations or stochastic physics applied to this member. It is generally stable through the development of the ensemble configuration.
Numerous models make a scientific decision to provide a control member as a distinct member of the ensemble, in these cases there is value in explicitly identifying this member. It is a member of the ensemble, it is not separate from it.
The Met Office GloSea ensemble makes a scientific decision not to have a control member, all members within the ensemble are perturbed, there is no control.
Other models contributing to the ECMWF S2S project https://software.ecmwf.int/wiki/display/S2S/ explicitly include a control member and label that member as the control as part of their data submission.
comment:13 Changed 5 years ago by markh
- Description modified (diff)
comment:14 follow-up: ↓ 15 Changed 5 years ago by jonathan
Dear Mark
OK, I understand about the ensemble control member. Thanks. That information about its purpose would be useful to include in your next. I think the ensemble_control_member attribute should be allowed only if axis="E" is present. How is the member identified? For instance, is it an index within the ensemble dimension?
Best wishes
Jonathan
comment:15 in reply to: ↑ 14 Changed 5 years ago by markh
Replying to jonathan:
I think the ensemble_control_member attribute should be allowed only if axis="E" is present.
I agree, this attribute is only valid if for a variable with axis="E" set.
How is the member identified? For instance, is it an index within the ensemble dimension?
I propose the value of the coordinate which identifies the control is the value of the attribute. So, if there is a coordinate where the control member is identified by a realization value of zero, the coordinate would be:
int realization(realization) ; realization:axis = "E" ; realization:ensemble_control_member = 0 ;
comment:16 follow-up: ↓ 17 Changed 5 years ago by jonathan
As I mentioned above, realizations might be identified by strings (as in CMIP) rather than numbers, which means that we couldn't specify the data type of ensemble_control_member. However, if it were an index within the ensemble dimension, it would definitely be a number.
Jonathan
comment:17 in reply to: ↑ 16 Changed 5 years ago by markh
Replying to jonathan:
As I mentioned above, realizations might be identified by strings (as in CMIP) rather than numbers, which means that we couldn't specify the data type of ensemble_control_member. However, if it were an index within the ensemble dimension, it would definitely be a number.
The data type of the attribute shall be consistent with the data type of the coordinate it is attached to.
Specifying the index is troublesome, as any change to the coordinate, in order, by subseting, or anything similar breaks the link
This approach is too fragile, in my view. Specifying the coordinate's value is safer.
comment:18 follow-up: ↓ 19 Changed 5 years ago by jonathan
Dear Mark
Most CF attributes have a defined data type (string or numeric), though there are a few whose type depends on the variable viz. _FillValue, missing_value, flag_masks, flag_values. I agree that what you suggest would work, since the purpose of the attribute is to compare with the values in the variable to find the one which matches.
Another possibility occurs to me, which would avoid defining an attribute altogether - I'm sure you've noticed I tend to prefer not to define new machinery if we can make use of something we have! You could define a standard name such as ensemble_control_binary_mask, for an auxiliary coordinate variable which is 1 for the ensemble control member and 0 for the others. This would be perhaps even more robust, because it would survive subsetting and permutation of the ensemble axis, since all auxiliary coordinate variables have to be transformed consistently by such operations. If you deleted the ensemble control member from the ensemble, you would not have to remember to zap the attribute, since the 1 element would vanish from the mask at the same time.
Best wishes
Jonathan
comment:19 in reply to: ↑ 18 Changed 5 years ago by markh
Replying to jonathan:
Another possibility occurs to me, which would avoid defining an attribute altogether - I'm sure you've noticed I tend to prefer not to define new machinery if we can make use of something we have! You could define a standard name such as ensemble_control_binary_mask, for an auxiliary coordinate variable which is 1 for the ensemble control member and 0 for the others.
I can see what you are suggesting here and I can see that it would be functional.
I find the approach less clear than defining a particular coordinate value for the variable to provide this labelling function. I do not consider the 'control member' to be a masking operation, this seems like an odd fit to me.
This would be perhaps even more robust, because it would survive subsetting and permutation of the ensemble axis, since all auxiliary coordinate variables have to be transformed consistently by such operations. If you deleted the ensemble control member from the ensemble, you would not have to remember to zap the attribute, since the 1 element would vanish from the mask at the same time.
There is no need to remove such an attribute after operations, it is still valid, even if a lookup into the coordinate array returns false, due to subset operations. It still carries information in this case, that there was a control member defined.
char realization(realization): realization:axis = "E" ; realization:ensemble_control_member = "a" ;
is useful information, even when the only values in the array in the particular data set are
realisation: ("c", "f")
I think this information is metadata about the coordinate: we are describing the coordinate, stating that there is a value which carries extra meaning, if it is present.
Encapsulating this information within the variable makes sense to me. I think it is well worth the price of adding one bespoke attribute to the convention.
mark
comment:20 Changed 5 years ago by markh
similarly, the absence of this attribute on an ensemble coordinate explicitly states that there was no control member in the ensemble, which is also useful information.
comment:21 Changed 5 years ago by jonathan
Dear Mark
OK then for ensemble_control_member, thanks. My comments still stand regarding cell_methods and the indication of the number of models involved.
Best wishes and thanks
Jonathan
comment:22 Changed 5 years ago by markh
I think the suggestion about using source to provide the required information on number of models in the ensemble makes good sense. I think this will alleviate the need for a bespoke attribute
I will consider the wording for describing this and update the proposed text accordingly
comment:23 Changed 5 years ago by markh
I have reviewed #108, which has been suggested in context with the use of a CellMethod to store the size of the original ensemble.
It is my view that #108 needs significant revision if it is to be adopted to meet its current use case. More importantly it is aiming at a different use case so I don't think it is relevant without significant rethink.
To use a CellMethod to define the number of members an ensemble had when it was created, CF would need new CellMethod syntax. It is not clear to me that this is generally useful. I am wary of extending CellMethod's syntax only to meet this use case.
The 'number of members in the original ensemble' information is information about the ensemble coordinate. It is not information directly about the data variable. It is not information related to a 'cell'.
http://cf-metadata.github.io/#_data_representative_of_cells states:
When gridded data does not represent the point values of a field but instead represents some characteristic of the field within cells of finite "volume," a complete description of the variable should include metadata that describes the domain or extent of each cell, and the characteristic of the field that the cell values represent.
In this case, the gridded data may well be representing the point values of the field. Commonly there will be no 'cells of finite volume' for these data sets.
This information is about the ensemble coordinate. it is only relevant to the data variable through the ensemble coordinate. Keeping this information on the ensemble coordinate variable seems like a useful encapsulation of information.
As such, I reiterate my view that CF should encode the number of members an ensemble originally contained as an optional numerical attribute, specifically named for the purpose and only valid when attached to an ensemble coordinate.
I think this is a localised and sensible way to address the requirement to store this metadata interoperably.
Please may contributors with a preference for a CellMethods approach detail reasons why they feel this is a preferred solution to having a defined, limited scope, attribute?
thank you mark
comment:24 follow-ups: ↓ 25 ↓ 26 Changed 5 years ago by jonathan
Dear Mark
You are right that using cell_methods for this would be a generalisation. In fact the opening of sect 7, which you quote, is already too restrictive, in talking about "volume" (even with quotation marks), given that operations can be applied to any single or combination of coordinate axes. We should probably rewrite it anyway! In general, cell_methods indicates how the data you have represents the cells (hence the title of the section). I think it is plausible to regard taking a subset of the points within the cell, in order to represent the cell, as a cell method, and that is what you are doing when extracting an sub-ensemble. The minimum or the maximum, which are existing cell methods, are particular subsets (of size one) to represent the cell. The sample operation might also be performed on axes with continuous coordinate variables, taking every Nth point for example.
The advantages that I see of introducing a sample method with a dimension comment are (a) it meets your use case of recording that a subset has been extracted, and recording how big the full set was - both of these sorts of metadata may be useful in other ways too, (b) it allows you to record multiple subsetting, which your use-case needs (see comment 7 above) - you may want to know both the original size (19) and the number of members used to compute the mean (15), (c) it allows you to record when this subsetting occurred, in relation to other statistical reductions which might have been applied, (d) it doesn't require any new attributes, just an extension of an existing one.
Best wishes
Jonathan
comment:25 in reply to: ↑ 24 Changed 5 years ago by markh
Replying to jonathan:
thank you for your considered response
If we were to adopt this approach, what changes would be required to Cell Methods?
- Would we have to change the Cell Methods preamble to highlight its wider utility?
- What syntax would be added to Cell Methods to enable the storage of the size of a collection prior to an operation
- note: this is different from the 'domain' discussion in #108, which is focussed on storing the values which were used from a coordinate which has been aggregated, not the size of that coordinate.
- Would subset be an explicit operation for a cell method?
- If so, would this be a true subset (< but not <=)
- I do not expect to have a dimension in my file of length ensemble_size; in general I just have the number
By example, How would I write the metadata from:
- 1 GRIB2 message (ensemble 7 of a collection of size 19)?
- 19 GRIB2 messages (ensembles 0-18 of size 19)
thank you mark
comment:26 in reply to: ↑ 24 Changed 5 years ago by markh
Replying to jonathan:
a minor detail point
(b) it allows you to record multiple subsetting, which your use-case needs (see comment 7 above) - you may want to know both the original size (19) and the number of members used to compute the mean (15)
I do not need this for my use case. The only number I need to store here is 19, not 15.
Providing this capability goes beyond my requirements
comment:27 follow-up: ↓ 28 Changed 5 years ago by jonathan
Dear Mark
- I haven't drafted exact words (I could help with that if we decide to do it this way) but I don't think it would be a large change. I think the general idea of cell methods is to record how the value given for each cell represents the range of values which may occur within the extent of the cell. This idea applies to all the non-default methods (mean, minimum, etc.) while for point and sum there is only one value under consideration, so the idea is irrelevant.
- My suggestion is a standardised comment dimension: dimension in () after the cell method to record the dimension as it was before the operation.
- I suggested sample but it could be subset - I think that would also be OK for continuous axes. I don't think the question of < or <= can be answered in general; it's just a subset of the elements that the axis previously had. What do you mean by "explicit"?
- It's OK to store a number in a dimension, isn't it? A scalar variable could be used instead, but a dimension has the additional possibility that you could use it to preserve coordinate variables or auxiliary coordinate variables of the axis as it was before the subset was taken. I know you don't need it for your case, but it's good to have the option.
I'm not sure that I have understood your example - sorry. If you have a dimension full (=19) and take some subset of it, so you now have an ensemble with dimension sub (which could be 1 or 19 or anything between) the cell method entry would be sub: subset (dimension: full).
Best wishes
Jonathan
comment:28 in reply to: ↑ 27 Changed 5 years ago by markh
Replying to jonathan:
- appendixE: table E1 lists the current cell_method names. These are descriptions of the numerical process carried out on the values, point, sum, mean ....
- subset and sample are quite different in character. I don't think they fit in table E1 very well
- are these 'operations' just represented as 'point' cell_method instances?
- dimensions are used to define the size of variables and the size of variables changes, leading to changes in dimension. The value in this case is constant, it is an attribute of the ensemble coordinate. Storing it as a dimension is an indirection step which doesn't feel required and increases the risk of inconsistency. I prefer storing the explicit numerical value.
comment:29 Changed 5 years ago by jonathan
Dear Mark
I suppose we just feel differently about these things, so we'll have to see what others think. For my part, I think that taking a subset of the points in a cell is not radically different from taking a single one, such as the maximum or the minimum, which are special subsets. The facility of recording the size before the operation of cell methods would be generally useful; for example, to calculate a standard error from a standard deviation you need to know how big the sample was from which the statistic was calculated. If the single number has no purpose other than this, storing it in a dimension (rather than a scalar coordinate or in text in the attribute) is perhaps unnecessary machinery, but I don't see what inconsistency it would cause if nothing uses this dimension. On the other hand, it would be positively useful if you wanted to keep some auxiliary coordinate variables for the full sample, which I can imagine being valuable for ensembles. I know that this is anticipating a use-case which has not been raised, but at the time when we have a free choice about how to do things to meet an existing use-case, it's sensible to choose an approach which offers future possibilities. We could offer the alternatives of encoding the number as text in the cell_methods, or naming a dimension in the cell_methods. This would not be a difficulty for parsing since dimension names must begin with a letter.
Best wishes
Jonathan
comment:30 Changed 5 years ago by jonathan
Dear Mark
On further thought, I agree with you about the dimension. For your use-case, you don't need to keep old coordinates, so you don't need a dimension, and it's sufficient to give the size explicitly in the cell_methods. Your use case could be something like ensemble: subset (dimension: 19) ensemble: mean (dimension: 15) to indicate that first a subset of 15 was taken from 19, then the 15 were averaged. If you're not interested in noting that the mean was of only a subset, you could just put ensemble: mean (dimension: 19) to record the original full size. The dimension ensemble=1 is for the resulting axis after all this processing. If subsequently there is a use-case for recording coordinates as they were before cell_methods were applied, it would be an obvious and easy extension to put a dimension name instead of the number.
Best wishes
Jonathan
comment:31 Changed 5 years ago by markh
- Description modified (diff)
I have updated this proposal to take account of source and institution information for model multiplicity definition.
I have also added a short section on ensemble labelling, including a proposed new standard name: ensemble_member_label, canonical unit (a string value is expected) for providing string labels for ensemble members.
I remain worried by the approach under discussion for identifying original ensemble size using cell methods. I think that this information, in the context of my use case, is information about the coordinate variable, not the data variable.
Given the lack of agreement on this topic, I have removed it completely from this proposal. It is an additive change, which ever route is adopted. I propose we evaluate this ticket without that detail; we can return to it later if the rest of this proposal is deemed suitable.
Does this updated proposal have sufficient merit that it may be adopted? Are there remaining areas of concern?
thank you
markh
comment:32 follow-up: ↓ 35 Changed 5 years ago by jonathan
Dear Mark
Please could you write the proposal in a form which shows exactly the text changes you'd like to make the conventions and conformance documents i.e. which sections to be modified (including Appendix A, I guess), where and how?
Many thanks
Jonathan
comment:33 Changed 5 years ago by markh
All of the text in the proposal in 4. Technical Proposal is new text
There's one line to add to Appendix A
| ensemble_control_member | D | C | <<ensemble-coordinate>> | The identifier of the member of the ensemble which is the control member.
I'll get the conformance doc updates up next
comment:34 Changed 5 years ago by jonathan
Dear Mark
What I mean is, the change should be described as a proposal for how the text should be changed exactly e.g. In section N, after the Nth paragraph, insert the following: ... and so on.
Best wishes
Jonathan
comment:35 in reply to: ↑ 32 Changed 5 years ago by markh
Replying to jonathan:
Dear Mark
Please could you write the proposal in a form which shows exactly the text changes you'd like to make the conventions and conformance documents i.e. which sections to be modified (including Appendix A, I guess), where and how?
Many thanks
Jonathan
I have prepared a pull request
https://github.com/cf-convention/cf-conventions/pull/65
This does not include the conformance document changes as this document is not yet under source control, I have a separate PR for that
https://github.com/cf-convention/cf-conventions/pull/64
which will be followed by a couple of details to make this easier
Does this get close to the information you are looking for?
mark
comment:36 Changed 5 years ago by jonathan
Dear Mark
It's not as easy to appreciate as if you would write down in this ticket what change is proposed and where. We might change to using GitHub? in that kind of way, but we decided to postpone thinking about that until we get to CF 1.7 and more of us have some familiarity with it. For the moment, I would appreciate it if we could stick to the current procedure, which we have used for a long time as you know, namely to spell out the proposed change explicitly in the trac ticket. It has the advantage that it's easy to review the entire discussion and the versions of text under consideration in one place.
Best wishes
Jonathan
comment:37 follow-up: ↓ 38 Changed 5 years ago by markh
- Description modified (diff)
New Content
Conventions
4.6 Ensemble Coordinate
Variables representing an ensemble or collection of realizations shall have an attribute axis with a value E. These variables are discrete, as described in section 4.5; they do not represent continuous quantities.
Ensemble variables have a number of optional standardised attributes available for use. Further bespoke attributes to describe the ensemble in project specific ways are always allowed.
4.6.1 Identifying Members
A data variable representing an ensemble will commonly have an ensemble coordinate with the standard name realization; this is not mandatory. The realization standard name is used to provide a unique identifying number to each ensemble member within the ensemble.
An ensemble coordinate providing a string label for each ensemble member within the ensemble may also be included. The standard name ensemble_member_label is proposed (canonical unit = (string expected)); this name is not yet approved as a standard_name. An ensemble_member_label shall have unique values and missing values are not allowed.
4.6.2 Ensemble Control Member ====
An ensemble coordinate with the standard name realization or ensemble_member_label may include an attribute named ensemble_control_member.
This value provides a definition that one member of the ensemble is the control member and identifies this member. This control member shall have the identified value within the ensemble coordinate's data.
The absence of this attribute shall be interpreted as a negative statement, explicitly stating that there is no control member identified within the ensemble.
4.6.3 Single or Multiple Model Ensemble =
An ensemble coordinate may be identified as being from one model or multiple models by providing further variables identified as coordinates or auxiliary coordinates by the data variable. All of these coordinates shall have an attribute axis with a value E.
The standard name's source and institution are used to identify the multiplicity of models which the ensemble is taken from, one other or both may be present.
Such coordinates may be scalar coordinate variables or they may be attached to the same dimension(s) as other ensemble coordinates referenced by a data variable. Scalar coordinates are commonly used to define a single model ensemble; in this case, this is informationally equivalent to auxiliary coordinates with identical values.
The absence of such source and institution coordinates shall not be interpreted as a positive or negative statement. No inference on the multiplicity of models which the ensemble is taken from shall be inferred from the absence of such coordinates.
Appendix A
| ensemble_control_member | D | C | ref:<<ensemble-coordinate>> | The identifier of the member of the ensemble which is the control member. |
Conformance Document
4 Coordinate Types
Requirements:
- The axis attribute may
onlybe attached to a coordinate variable or auxiliary coordinate variable only. - The only legal values of axis are X, Y, Z,
andT and E (case insensitive). - The axis attribute must be consistent with the coordinate type deduced from units and positive.
The axis attribute is not allowed for auxiliary coordinate variables.A data variable must not have more than one coordinate variable with a particular value of the axis attribute.
4.6.2 Ensemble Control Member ====
Requirements
- A variable may have an attribute ensemble_control_member only if it has an axis attribute of E and a standard_name of either realization or ensemble_member_label.
comment:38 in reply to: ↑ 37 Changed 5 years ago by markh
I hope 37 is helpful
comment:39 follow-up: ↓ 41 Changed 5 years ago by jonathan
Dear Mark
Thanks for writing out the proposed changes explicitly. That's very helpful. I think the proposal is generally fine. I have a few comments.
- A modification is also needed to the second para of preamble of sect 4.5. Replace the sentence starting "The attribute axis" with "The attribute axis may be attached to a coordinate variable and given one of the values X, Y, Z, T or E, which stand for a longitude, latitude, vertical, time or ensemble axis respectively."
- Maybe a better title for sect 4.6 would be "Ensemble axis", like following sect 4.5 and the text in the previous bullet? It's not obvious to me what "ensemble coordinate" means. It's not like the phrase "longitude coordinate", which means a value of longitude; "ensemble" does not have a value. Elsewhere in the text I think you mean "ensemble coordinate variable".
- You propose to allow a given value of axis attribute to appear more than once for a given data variable. The reason we currently prohibit that is to reduce the possibility of inconsistency. I guess you to want to allow it for axis="E" particularly because there may well be no (Unidata) coordinate variable and you'd like to attach it to all the aux coord vars. I would suggest an alternative: let's keep the existing prohibition, so that axis="E" is allowed only once, and allow it only to be attached only to a coordinate variable of realization or a string-valued aux coord var of ensemble_member_label. That means, to identify it as an ensemble axis by the definition of 4.6, one or both of these kinds of coordinate must be provided (which would probably be a good thing) and that you know where to find the axis attribute if you want it.
- The first para of 4.6 seems a bit confused to me because it's talking about both data and coordinate variables without distinction.
- I don't think the statement about bespoke attributes is necessary because there is a general statement about this applying to the whole of CF in sect 2.6.
- Since the text is quite short, I think subsubsections aren't necessary - you could just use paragraphs instead.
- I'm not sure what you mean by the absence of the ensemble_control_member att. Since this att identifies the ensemble control member, omitting it obviously means that no control member has been identified - there's no need to state that. I suppose you mean that omitting implies that there is no control member i.e. all members have equal status.
Considering all the above, I suggest a subsection 4.6 as follows:
Data variables representing an ensemble or collection of realizations shall have an ensemble axis. This is a discrete axis, as described in section 4.5. Various attributes are available for use with ensemble axes, as described in this section.
An ensemble axis can optionally have a coordinate variable with the standard name realization, used to provide a unique identifying number to each member of the ensemble. A string-valued auxiliary coordinate variable of the ensemble axis may also be included with the standard name ensemble_member_label. Unlike other auxiliary coordinate variables, this variable must have a unique value for each element and missing values are not allowed. To indicate the ensemble axis, either the realization or the ensemble_member_label variable may have an axis attribute with the value E.
A realization or ensemble_member_label variable with attribute axis="E" can optionally have an ensemble_control_member attribute, whose presence indicates that one member of the ensemble is the control member and whose value is the realization or the ensemble_member_label of the control member. The absence of this attribute implies that the ensemble has no control member.
An ensemble may be described as being from one model or from multiple models by providing string-valued auxiliary coordinates with standard_name source and institution. Such variables can have the ensemble dimension or they can be scalars, indicating a single-model ensemble (equivalent to a variable with the ensemble dimension in which all values are equal). The absence of these variables does not imply anything about the number of models involved in the ensemble.
The text doesn't have to state that ensemble_member_label, source and institution are not yet defined standard_names. If this proposal is agreed, the standard_names should be proposed and I expect they will be approved too.
Best wishes and thanks
Jonathan
comment:40 Changed 5 years ago by markh
- Description modified (diff)
i have modified the proposal addressing the majority of the points raised by jonathan
comment:41 in reply to: ↑ 39 Changed 5 years ago by markh
Replying to jonathan:
- You propose to allow a given value of axis attribute to appear more than once for a given data variable. The reason we currently prohibit that is to reduce the possibility of inconsistency. I guess you to want to allow it for axis="E" particularly because there may well be no (Unidata) coordinate variable and you'd like to attach it to all the aux coord vars. I would suggest an alternative: let's keep the existing prohibition, so that axis="E" is allowed only once, and allow it only to be attached only to a coordinate variable of realization or a string-valued aux coord var of ensemble_member_label. That means, to identify it as an ensemble axis by the definition of 4.6, one or both of these kinds of coordinate must be provided (which would probably be a good thing) and that you know where to find the axis attribute if you want it.
I have not addressed this point. The ability to have multiple variables: coordinate variables, auxiliary coordinate variables and scalar coordinate variables identified as ensemble coordinates from a data variable's point of view is intrinsic to the proposal details.
In a number of cases, these will define a number of degrees of freedom the ensemble may well not be a single 'axis'
This proposal is noticeably diminished by that change
What drives the desire to keep this restriction please?
'The reason we currently prohibit that is to reduce the possibility of inconsistency.'
does not appear a strong driver to me. What kind of inconsistency is feared?
comment:42 Changed 5 years ago by jonathan
Dear Mark
Where is your currently proposed text? Is it different from the version I suggested in my last comment? The text at the start of this ticket is quite different from that. The changes I proposed above were intended to improve logic and clarity. Maybe I did not succeed in those aims!
The potential inconsistency is that if there are various coordinate or 1D auxiliary coordinate variables for a particular dimension all with axis attributes, these attributes might have different values. That could happen in your case too for the ensemble axis. That is why I made my alternative suggestion that axis should be allowed only on realization or ensemble_member_label variables, and not on both or on any other variables. That prevents inconsistency, and also means you have to include realization or ensemble_member_label to define an ensemble axis, which seems to me a good thing. What's the particular need to label other coordinate or auxiliary coordinate variables with axis="E"? We don't do this for any other kind of axis.
Best wishes
Jonathan
comment:43 Changed 5 years ago by markh
- Description modified (diff)
Hello Jonathan
I have not adopted the text you proposed straight away. I have included a set of updates in the ticket summary which pick up the uncontentious points, from my point of view.
There are material changes in the text you propose which I want to discuss further and I felt it was easier for me to be able to contrast between the proposal I have raised and your suggested alterations by doing this.
It is not clear to me why the limitation on the use of the axes attribute exists. I don't see what problems occur by having multiple variables with an axis attribute of E. This is metadata about a coordinate (| aux coord | scalar coord), it is an 'Ensemble coordinate'.
With this in mind, I think I do disagree with changing the title of the section to Ensemble Axis. The aim here is to define one or more ensemble coordinate (| aux coord | scalar coord) instances.
thank you for your continued interest mark
comment:44 Changed 5 years ago by markh
Wow, the 'change description' diffs render awfully in the email notification, yet they come through very nicely to the trac interface, with diff links and so on.
It seems to me quite a quandry which communication mechanism to pander to; I am sure I can't please everybody
comment:45 Changed 5 years ago by markh
Hello Jonathan
Please may you provide some more information on the value of the constraints:
- The axis attribute is not allowed for auxiliary coordinate variables.
- A data variable must not have more than one coordinate variable with a particular value of the axis attribute.
and how these could be altered by the changes under discussion here?
thank you mark
comment:46 Changed 5 years ago by jonathan
Dear Mark
The axis attribute is allowed on auxiliary coordinate variables. (It didn't used to be.) The reason for the prohibition on there being more than one occurrence of any particular value of axis attribute for a given data variable is that they might be inconsistent. For example, you might have one-dimensional (not auxiliary) coordinate variables for two different dimensions, both having axis="X". This would probably be a mistake, and it's easy to identify such mistakes by not allowing X (or any other value) to appear more than once. Of course, several 1D auxiliary coordinate variables for the same dimension all having axis="E" would not be a mistake, but it's more complicated to formulate the rule to identify mistakes if we allow repetition under some circumstances, especially as some auxiliary coordinate variables are multi-dimensional, and also allowed to have an axis attribute.
You might feel this is excessively prohibitive, but it does keep things simple. I feel that there is no great advantage to be allowed to label many coord or aux coord vars as E. If any one of them is labelled like that, it identifies the axis, and I propose that the labelled variable should have one of two possible standard names - that also will make an ensemble axis easier to identify in practice, I think.
I suggest changing the name of the section because I don't think "ensemble coordinate" is a meaningful phrase. An ensemble is a collection of realisations. It's the realisations which have coords and aux coords to distinguish them, not the ensemble. However, I think "ensemble axis" makes sense because the axis is a property of the entire ensemble.
Best wishes
Jonathan
comment:47 Changed 5 years ago by markh
Hello Jonathan
thank you for your response, I understand more now.
The axis attribute is allowed on auxiliary coordinate variables. (It didn't used to be.)
I don't believe that this allowance is written into the conformance document yet.
I think I follow your explanation.
Is it your intent to state that:
- A data variable shall have one coordinate or auxiliary coordinate variable for each allowed axis value in the set (x, y, z, t, e), or zero; never many.
?
thank you
mark
comment:48 Changed 5 years ago by jonathan
Dear Mark
The preamble of section 5 says "If an axis attribute is attached to an auxiliary coordinate variable, it can be used by applications in the same way the axis attribute attached to a coordinate variable is used. However, it is not permissible for a data variable to have both a coordinate variable and an auxiliary coordinate variable, or more than one of either type of variable, having an axis attribute with any given value e.g. there must be no more than one axis attribute for X for any data variable." This appears in section 4 of the conformance document, however, which is confusing. Moreover, that section is out of date, as you say, because it only allows the axis attribute on coord vars (not aux coord vars). We should update the conformance document with a defect ticket, I suppose. After updating, one of the requirements would read, "A data variable must not have more than one coordinate variable or auxiliary coordinate variable with a particular value of the axis attribute."
Best wishes
Jonathan
This ticket leads on from the mailing list discussion: http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2015/058424.html and the 'next message (by thread)' postings
The historical discussion: http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2013/057010.html and the 'next message (by thread)' postings are also relevant