Opened 13 years ago
Closed 10 years ago
#37 closed enhancement (fixed)
Conventions for Point Observation Data
Reported by: | caron | Owned by: | stevehankin |
---|---|---|---|
Priority: | medium | Milestone: | |
Component: | cf-conventions | Version: | |
Keywords: | point | Cc: |
Description (last modified by painter1)
1. Title
Conventions for Point Observation Data
2. Moderator
Steve Hankin
3. Requirement
Current conventions are oriented towards gridded data. This proposal extends the framework to specify how to encode "point observation" data.
4. Initial Statement of Technical Proposal
We show six types of point observational data, and describe a general way to encode many variations. The main technical extension is a simple way to describe ragged arrays, for the case when rectangular arrays are too inefficient.
5. Benefits
- Many data providers would like to use CF conventions when storing their observational data.
- This will allow a standard for converting things like BUFR data into netCDF.
6. Status Quo
Currently sections 5.4 and 5.5 describe 2 examples of point observations (station time series and trajectories). This proposal generalizes those.
7. Detailed Proposal
Because of the length of this, I have written it as a Microsoft Word file to make it easy to edit. I can reformat later when it is close to being finished.
Some background docs and earlier drafts:
Attachments (2)
Change History (80)
comment:1 Changed 13 years ago by jonathan
comment:2 Changed 13 years ago by caron
Hi Jonathan:
These are both good points.
The two methods (contiguous and non-contiguous) have different trade-offs for both reading and writing. The contiguous proposal has better space utilization, and optimizes the common case of reading all children for a given parent (eg all observation for a given time). The non-contiguous case is easier to write because it allows data to be written in any order, but will be rather inefficient for the use cases of reading all children when using the unlimited dimension, since you will have to scan the entire file to satisfy that request.
I did eliminate another method (linked lists) that are being used by at least one important data provider (FSL MADIS), in the interests of simplicity, but I was trying to satisfy what I think is the minimum that real data providers will need.
Having said that, I would be happy if some proportion of real data providers told me that they could live with the single non-contiguous case. So anyone reading this who falls into that category, please give your thoughts.
As for the PointFeature? attribute, I dont think that its possible for software to reliably distinguish all cases without this. For example, 5.8.3.1 (trajectory) looks the same as 5.8.1 (point). In any case, its absence makes generic client software that much harder to write. I feel like forcing the provider to classify their data is a reasonable burden, and is very valuable metadata that can be used in discovery and digital libraries by systems that will not be capable of computing it themselves. I also expect the classification to be the basis of future "feature services", but that is more speculative.
Regards, John
comment:3 Changed 13 years ago by lowry
Hello Jonathan,
Just to say that I agree with John on the value of the PointFeature? attribute.
Cheers, Roy.
comment:4 Changed 13 years ago by caron
I have changed the proposal to live at
http://cf-pcmdi.llnl.gov/trac/wiki/PointObservationConventions
and asked Velemir to modify the original proposal above to reflect that, to eliminate confusion.
This will allow versions of the proposal to coexist.
comment:5 Changed 13 years ago by mlaker1
- Description modified (diff)
Per John's request, changed the link to Detailed Proposal in the ticket description.
comment:6 Changed 12 years ago by mlaker1
- Description modified (diff)
Steve Hankin volunteered to be the moderator.
comment:7 follow-up: ↓ 8 Changed 12 years ago by dmurray
Thanks, John for getting this discussion going. We have long awaited an expanded convention for point data.
- In synoptic meteorology, I would say that the predominant access
of station data is to access many stations over a time range to produce a map. In that case, the station information is really part of the data and breaking it out as station obs doesn't gain you much. If you are primarily accessing a single station for a time range, then indexing on station makes sense. Having the station variables (id, lat,lon, alt) for each observation (essentially always treating them as point data) doesn't seem like a lot of overhead especially if you have 50 variables that you are measuring at each point. Also, station locations may change through time. For example, the location of Pacific Tsunameter station 46413 changed yesterday at 18:15 UTC. If the station info is stored once in the file, that change will not be reflected.
- categorizing data into station/point or trajectory seems arbitrary.
Take synoptic ship observations for example. I would want to access them as above (plot a time animation of ship observations over an area). However, I would also want to create a trajectory of a particular ship (or several ships) and view that over time. In this case, I want to access the point observations either as points or as trajectories. Since the ships are moving, they are not really stations in sense used here. As a user, I want to access them sometimes as points, but other times want to be able to link the points of a particular ship as a trajectory. One person's points is another person's trajectory.
- for radiosonde observations, there are several profiles at
one point (mandatory, significant levels, troposphere, etc). I'm not sure how that would fit into this proposal.
Don Murray
comment:8 in reply to: ↑ 7 ; follow-ups: ↓ 10 ↓ 11 Changed 12 years ago by caron
Replying to dmurray:
Thanks, John for getting this discussion going. We have long awaited an expanded convention for point data.
- In synoptic meteorology, I would say that the predominant access
of station data is to access many stations over a time range to produce a map. In that case, the station information is really part of the data and breaking it out as station obs doesn't gain you much. If you are primarily accessing a single station for a time range, then indexing on station makes sense. Having the station variables (id, lat,lon, alt) for each observation (essentially always treating them as point data) doesn't seem like a lot of overhead especially if you have 50 variables that you are measuring at each point. Also, station locations may change through time. For example, the location of Pacific Tsunameter station 46413 changed yesterday at 18:15 UTC. If the station info is stored once in the file, that change will not be reflected.
It seems to me that the use case that you want to cover is when a data provider wants to write station data, but doesnt want to factor out the station information, ie wants to repeat it in the observation. (this is analogous to unnormalized tables in an RDBMS). Of course, they could always just make it plain point data, but they prefer station feature type so that they can ask for a time series at a particular station. I agree this is something to allow, so let me think about a proposal that would cover it.
The spec is about giving data providers a standard encoding for doing whatever they need to. Because you might have 1000s of observations for every station, the efficiency of factoring out the station info will be important sometimes.
The issue of stations moving is a tricky one. If it happens often enough, its a good reason to put lat,lon info into the observation, in addition / instead of factoring out the station info. I will add some words about this and see what you think.
- categorizing data into station/point or trajectory seems arbitrary.
Take synoptic ship observations for example. I would want to access them as above (plot a time animation of ship observations over an area). However, I would also want to create a trajectory of a particular ship (or several ships) and view that over time. In this case, I want to access the point observations either as points or as trajectories. Since the ships are moving, they are not really stations in sense used here. As a user, I want to access them sometimes as points, but other times want to be able to link the points of a particular ship as a trajectory. One person's points is another person's trajectory.
there's not a lot to do about the fact that different data consumers will want to view data differently. this spec is about letting data providers encode the data in the way that they want. perhaps some best practices can evolve to futher guide them to satisfy consumer needs.
- for radiosonde observations, there are several profiles at
one point (mandatory, significant levels, troposphere, etc). I'm not sure how that would fit into this proposal.
yes, that use case looks rather complicated. let me create a sample CDL for it.
comment:9 follow-up: ↓ 13 Changed 12 years ago by caron
braeckel@… sent me these comments:
Why are stationTimeSeries and stationProfileTimeSeries specific to a station? The optional use of a station name is useful to include in some cases, but I can see running into other cases where several data points are gathered in a time series but there is no associated station. In a couple of documents the stationTimeSeries is likened to the CSML PointSeriesFeature?, but PointSeriesFeature? does not include a station name or related terminology.
The ":CF\:pointFeature" name could be generalized to ":CF\:cdmFeature" or ":CF\:featureType", assuming that there is no additional and necessary semantics to these identifiers being point types vs general types.
comment:10 in reply to: ↑ 8 ; follow-up: ↓ 12 Changed 12 years ago by caron
Don Murray wrote:
- categorizing data into station/point or trajectory seems arbitrary.
Take synoptic ship observations for example. I would want to access them as above (plot a time animation of ship observations over an area). However, I would also want to create a trajectory of a particular ship (or several ships) and view that over time. In this case, I want to access the point observations either as points or as trajectories. Since the ships are moving, they are not really stations in sense used here. As a user, I want to access them sometimes as points, but other times want to be able to link the points of a particular ship as a trajectory. One person's points is another person's trajectory.
BTW, your example of the moving ship would be a "section: a collection of profiles which originate along a trajectory." not a "stationProfileTimeSeries: a time-series of profiles at a named location ". i wasnt sure if that was clear.
comment:11 in reply to: ↑ 8 ; follow-up: ↓ 16 Changed 12 years ago by edavis
Replying to caron:
Replying to dmurray:
- categorizing data into station/point or trajectory seems arbitrary.
Take synoptic ship observations for example. I would want to access them as above (plot a time animation of ship observations over an area). However, I would also want to create a trajectory of a particular ship (or several ships) and view that over time. In this case, I want to access the point observations either as points or as trajectories. Since the ships are moving, they are not really stations in sense used here. As a user, I want to access them sometimes as points, but other times want to be able to link the points of a particular ship as a trajectory. One person's points is another person's trajectory.
there's not a lot to do about the fact that different data consumers will want to view data differently. this spec is about letting data providers encode the data in the way that they want. perhaps some best practices can evolve to futher guide them to satisfy consumer needs.
My hope for this is that the data model for these feature types (and the software that implements it) will support both subsetting and feature type conversion. So, for example, a grid could be asked for a time series at a particular point and a trajectory could be asked for itself as a collection of points.
In which case, data providers may be inclined to try to capture as much of the sematics of the data as possible by specifying the highest level feature type available that fits the dataset.
Ethan
comment:12 in reply to: ↑ 10 ; follow-up: ↓ 14 Changed 12 years ago by dmurray
Replying to caron:
BTW, your example of the moving ship would be a "section: a collection of profiles which originate along a trajectory." not a "stationProfileTimeSeries: a time-series of profiles at a named location ". i wasnt sure if that was clear.
In this case it's not a profile. It's a single synoptic surface observation at a point.
Don
comment:13 in reply to: ↑ 9 Changed 12 years ago by caron
Replying to caron:
braeckel@… sent me these comments:
Why are stationTimeSeries and stationProfileTimeSeries specific to a station? The optional use of a station name is useful to include in some cases, but I can see running into other cases where several data points are gathered in a time series but there is no associated station. In a couple of documents the stationTimeSeries is likened to the CSML PointSeriesFeature?, but PointSeriesFeature? does not include a station name or related terminology.
If the observations are not at the same location, then you would use a point or profile collection, respectively. if you have collections of observation at the same location, then it becomes a time series. the common case is that the location is named, which i call a station. If there is no real name, then you would make one up ("loc1", "loc2", "loc3", ...). The name is needed so that you can refer to that time series by name.
A single stationTimeSeries corresponds to a CSML PointSeriesFeature?. But when you have a collection of multiple stationTimeSeries/PointSeriesFeature, you will need a name in order to distinguish them.
The ":CF\:pointFeature" name could be generalized to ":CF\:cdmFeature" or ":CF\:featureType", assuming that there is no additional and necessary semantics to these identifiers being point types vs general types.
Yes, CF:featureType is probably a better name, so that this attribute could be used for types other points. I will propose to change.
Its important that the values be from a controlled vocabulary. I will list the proposed set of values.
comment:14 in reply to: ↑ 12 ; follow-up: ↓ 15 Changed 12 years ago by caron
Replying to dmurray:
Replying to caron:
BTW, your example of the moving ship would be a "section: a collection of profiles which originate along a trajectory." not a "stationProfileTimeSeries: a time-series of profiles at a named location ". i wasnt sure if that was clear.
In this case it's not a profile. It's a single synoptic surface observation at a point.
Sorry, I misread your post. A writer could encode as a point or a trajectory. If they encode as a trajectory, it could also be automatically processed as a point collection, but if they encode as points, it could not easily and automatically be processed as a trajectory. I think this is Ethan's point.
As you will see in the new proposed CDM API, a trajectory feature can be flattened into a point collection.
comment:15 in reply to: ↑ 14 ; follow-up: ↓ 17 Changed 12 years ago by dmurray
BTW, your example of the moving ship would be a "section: a collection of profiles which originate along a trajectory." not a "stationProfileTimeSeries: a time-series of profiles at a named location ". i wasnt sure if that was clear.
In this case it's not a profile. It's a single synoptic surface observation at a point.
Sorry, I misread your post. A writer could encode as a point or a trajectory. If they encode as a trajectory, it could also be automatically processed as a point collection, but if they encode as points, it could not easily and automatically be processed as a trajectory. I think this is Ethan's point.
While it may be difficult, I think it would be useful to be able to turn a point dataset into a set of trajectories based on the value of a particular variable (e.g. station name).
As you will see in the new proposed CDM API, a trajectory feature can be flattened into a point collection.
Yes, that is useful. We use that in the existing netCDF-Java trajectory code for displaying aircraft tracks as both trajectories and points.
My use case is to take a set of point ship observations and create trajectories based on the station id.
Don
comment:16 in reply to: ↑ 11 Changed 12 years ago by dmurray
- categorizing data into station/point or trajectory seems arbitrary.
Take synoptic ship observations for example. I would want to access them as above (plot a time animation of ship observations over an area). However, I would also want to create a trajectory of a particular ship (or several ships) and view that over time. In this case, I want to access the point observations either as points or as trajectories. Since the ships are moving, they are not really stations in sense used here. As a user, I want to access them sometimes as points, but other times want to be able to link the points of a particular ship as a trajectory. One person's points is another person's trajectory.
there's not a lot to do about the fact that different data consumers will want to view data differently. this spec is about letting data providers encode the data in the way that they want. perhaps some best practices can evolve to futher guide them to satisfy consumer needs.
My hope for this is that the data model for these feature types (and the software that implements it) will support both subsetting and feature type conversion. So, for example, a grid could be asked for a time series at a particular point and a trajectory could be asked for itself as a collection of points.
I agree completely. This was the design behind ADDE in McIDAS.
In which case, data providers may be inclined to try to capture as much of the sematics of the data as possible by specifying the highest level feature type available that fits the dataset.
The trick is to anticipate other ways people might want to look at your data, not just how you are used to looking at it.
Don
comment:17 in reply to: ↑ 15 Changed 12 years ago by caron
Replying to dmurray:
BTW, your example of the moving ship would be a "section: a collection of profiles which originate along a trajectory." not a "stationProfileTimeSeries: a time-series of profiles at a named location ". i wasnt sure if that was clear.
In this case it's not a profile. It's a single synoptic surface observation at a point.
Sorry, I misread your post. A writer could encode as a point or a trajectory. If they encode as a trajectory, it could also be automatically processed as a point collection, but if they encode as points, it could not easily and automatically be processed as a trajectory. I think this is Ethan's point.
While it may be difficult, I think it would be useful to be able to turn a point dataset into a set of trajectories based on the value of a particular variable (e.g. station name).
As you will see in the new proposed CDM API, a trajectory feature can be flattened into a point collection.
Yes, that is useful. We use that in the existing netCDF-Java trajectory code for displaying aircraft tracks as both trajectories and points.
My use case is to take a set of point ship observations and create trajectories based on the station id.
Don
this would be a good feature for the CDM API. i dont think theres anything to be done in this CF convention, other than to encourage, as you say, providers to "anticipate other ways people might want to look at your data, not just how you are used to looking at it".
comment:18 Changed 12 years ago by caron
I have made changes to the proposal, specifically:
- change attribute name pointFeature to featureType
- add section 5.8.2.5 "flattened representation" for station data
- in overview: clarify what coordinates are required
- add standard_name on station id - otherwise theres no way to identify the station id
comment:19 follow-up: ↓ 20 Changed 12 years ago by ngalbraith
Quick question on this definition - "trajectory: a connected set of data points along a 1D curve in time and space."
For drifting buoys, there can be a depth dimension as well as time, with multiple sensors deployed at different depths under the buoy; it's not clear to me if the specification will allow that. I guess the term "1D curve in time and space" is confusing me, and I'm not sure why it's in the definition.
comment:20 in reply to: ↑ 19 Changed 12 years ago by caron
Replying to ngalbraith:
Quick question on this definition - "trajectory: a connected set of data points along a 1D curve in time and space."
For drifting buoys, there can be a depth dimension as well as time, with multiple sensors deployed at different depths under the buoy; it's not clear to me if the specification will allow that. I guess the term "1D curve in time and space" is confusing me, and I'm not sure why it's in the definition.
Hi Nan:
A trajectory is intended to deal with the case where there is no depth dimension, eg an aircraft track or drifting surface buoy. For your case, you want a "section" datatype, see section 5.8.6.
John
comment:21 follow-ups: ↓ 22 ↓ 24 Changed 12 years ago by lowry
Hi John,
I'm a little concerned about your response to Nan. I would see an aircraft track as analogous to a drifting buoy (or AUV) and therefore also a section and not a trajectory.
What you're proposing is that 'section' is data at one or more z values for each (x,y,t) value in the coverage. This makes a 'trajectory' a special case of a 'section' in which there is only one z values, which has (by implication) a value of zero. This challenges some of my prejudices about what constitutes a 'section', but they're open to challenge. Is everybody else comfortable with single-depth dections?
comment:22 in reply to: ↑ 21 ; follow-up: ↓ 23 Changed 12 years ago by mccann
Replying to lowry:
Hi John,
I'm a little concerned about your response to Nan. I would see an aircraft track as analogous to a drifting buoy (or AUV) and therefore also a section and not a trajectory.
I think a drifting buoy is more analogous to a neutrally buoyant balloon than to an aircraft. The difference is whether the measuring platform is traveling through the fluid or drifting with it. Though both ways of measurement will yield a x,y,z,t collection of data (with z=0 for the drifting buoy, unless it has a string of instruments below it) the interpretation s different for these two types of measurements. 'Trajectory' applies to the measurement platform moving through the fluid (aircraft, AUV). Do we have an appropriate word for measurements produced from a drifting measurement platform?
comment:23 in reply to: ↑ 22 Changed 12 years ago by caron
Replying to mccann:
Replying to lowry:
Hi John,
I'm a little concerned about your response to Nan. I would see an aircraft track as analogous to a drifting buoy (or AUV) and therefore also a section and not a trajectory.
I think a drifting buoy is more analogous to a neutrally buoyant balloon than to an aircraft. The difference is whether the measuring platform is traveling through the fluid or drifting with it. Though both ways of measurement will yield a x,y,z,t collection of data (with z=0 for the drifting buoy, unless it has a string of instruments below it) the interpretation s different for these two types of measurements. 'Trajectory' applies to the measurement platform moving through the fluid (aircraft, AUV). Do we have an appropriate word for measurements produced from a drifting measurement platform?
Hi Mike:
Let me just emphasize that the proposed "feature types" are based mostly on topological properties of the data (dimensionality, connectivity). A trajectory is a smooth 1D line in space/time. A section is a trajectory with hair, ie vertical lines attached along the trajectory.
A more sophisticated categorization using standard names or other mechanism could be added, but is not currently a part of this proposal.
So this proposal would allow a drifting buoy, balloon track (neutrally buoyant or not), and an aircraft track to all be classified as a trajectory. The case where the z value remains constant is not explicitly called out. However, by factoring the z out of the "nested table", one could clearly indicate that z is constant, which is a lot better than having to examine a coordinate variable to see if all the values are identical. However, if that case is extremely important, we might want to consider it as a separate type.
The choice of data type is not completely rigid. A balloon sounding, eg, might be classified as a profile, when the horizontal drift of the balloon is not important.
The data types are influenced pragmatically by actual files mostly from atmospheric/ocean data. Interestingly, Andrew Wolf came up with remarkably similar types as part of CSML.
comment:24 in reply to: ↑ 21 Changed 12 years ago by caron
Replying to lowry:
Hi John,
I'm a little concerned about your response to Nan. I would see an aircraft track as analogous to a drifting buoy (or AUV) and therefore also a section and not a trajectory.
What you're proposing is that 'section' is data at one or more z values for each (x,y,t) value in the coverage. This makes a 'trajectory' a special case of a 'section' in which there is only one z values, which has (by implication) a value of zero. This challenges some of my prejudices about what constitutes a 'section', but they're open to challenge. Is everybody else comfortable with single-depth dections?
Hi Roy:
Just to clarify: a good example of a trajectory is an aircraft track, which has x,y,z,t all varying in the general case. Z may be constant, and only t must be monotonic.
A trajectory could be a degenerate section, but making it so would be more complex than I think most data writers would prefer, without any gain. Theres a fair amount of extra complexity to represent the section, as you can see from comparing 5.8.3 and 5.8.6.
comment:25 follow-ups: ↓ 26 ↓ 27 Changed 12 years ago by lowry
Hi John,
I'm 100% with you on Mike's comment: it's the spatio-temporal shape of the data co-ordinates that governs feature type and not the means by which that shape has been attained.
My point - which may have not been expressed that clearly - is whether a trajectory is 3-D or 2-D. Your answer to Nan seemed to me to be contradictory. On the one hand you were saying that a sub-surface buoy series was a 'section' but on the other you were saying that a an aircraft flight - to me the mirror image of a sub-surface buoy track on the other side of the air-sea interface - was a trajectory.
There may be a muddying of the thinking here brought about by Argos buoys, which I would definitely treat as sections because they are a series of profiles at varying x, y, z. However, there are other sub-surface buoys that drift around on an isopycnal measuring temperature (either logging the data for access on recovery or transmitting it acoustically) at that particular depth as a function of x,y,z,t.
So, let's forget about buoys and think about AUVs such as Autosub and aircraft flights. Both cases have monotonic t with consequential x,y,z. Are these trajectories (my preference) or sections? If they are trajectories, why is the data from a simple logging ALACE float also not a trajectory?
I'm totally comfortable with a trajectory definition of monotonic t with associated x, y, z but not with the the idea that a trajectory is 2-D, which was 50% implied by your reply to Nan. Am I right in assuming that this trajectory definition is your intention (it's certainly my take on CSML) and that buoys mapping to multiple feature types has created a red herring?
comment:26 in reply to: ↑ 25 Changed 12 years ago by dmurray
Replying to lowry:
I'm 100% with you on Mike's comment: it's the spatio-temporal shape of the data co-ordinates that governs feature type and not the means by which that shape has been attained.
My point - which may have not been expressed that clearly - is whether a trajectory is 3-D or 2-D. Your answer to Nan seemed to me to be contradictory. On the one hand you were saying that a sub-surface buoy series was a 'section' but on the other you were saying that a an aircraft flight - to me the mirror image of a sub-surface buoy track on the other side of the air-sea interface - was a trajectory.
The section would be when the buoy is measuring parameters at multiple depths for any given txy of the bouy itself (maybe that's the Argos case below). For a drifting buoy that is measuring data at only one z position for every txy, then that would be a trajectory.
There may be a muddying of the thinking here brought about by Argos buoys, which I would definitely treat as sections because they are a series of profiles at varying x, y, z. However, there are other sub-surface buoys that drift around on an isopycnal measuring temperature (either logging the data for access on recovery or transmitting it acoustically) at that particular depth as a function of x,y,z,t.
So, let's forget about buoys and think about AUVs such as Autosub and aircraft flights. Both cases have monotonic t with consequential x,y,z. Are these trajectories (my preference) or sections? If they are trajectories, why is the data from a simple logging ALACE float also not a trajectory?
I think they would also be trajectories if they are just measuring parameters at a particular txyz. Z can vary, but if you had multiple Z at each txy, then it would be a section.
I'm totally comfortable with a trajectory definition of monotonic t with associated x, y, z but not with the the idea that a trajectory is 2-D, which was 50% implied by your reply to Nan. Am I right in assuming that this trajectory definition is your intention (it's certainly my take on CSML) and that buoys mapping to multiple feature types has created a red herring?
In general, trajectories would be 3D spatial on a 1D manifold (time). However a hurricane track is an example of a trajectory that conventionally has no z dimension. Values are assumed to be at sea level (unless otherwise specified), so we should allow for storing a trajectory without having to specify the Z dimension (2D spatial on a 1D manifold).
Don
comment:27 in reply to: ↑ 25 Changed 12 years ago by caron
Replying to lowry:
Hi John,
I'm 100% with you on Mike's comment: it's the spatio-temporal shape of the data co-ordinates that governs feature type and not the means by which that shape has been attained.
My point - which may have not been expressed that clearly - is whether a trajectory is 3-D or 2-D. Your answer to Nan seemed to me to be contradictory. On the one hand you were saying that a sub-surface buoy series was a 'section' but on the other you were saying that a an aircraft flight - to me the mirror image of a sub-surface buoy track on the other side of the air-sea interface - was a trajectory.
There may be a muddying of the thinking here brought about by Argos buoys, which I would definitely treat as sections because they are a series of profiles at varying x, y, z. However, there are other sub-surface buoys that drift around on an isopycnal measuring temperature (either logging the data for access on recovery or transmitting it acoustically) at that particular depth as a function of x,y,z,t.
So, let's forget about buoys and think about AUVs such as Autosub and aircraft flights. Both cases have monotonic t with consequential x,y,z. Are these trajectories (my preference) or sections? If they are trajectories, why is the data from a simple logging ALACE float also not a trajectory?
I'm totally comfortable with a trajectory definition of monotonic t with associated x, y, z but not with the the idea that a trajectory is 2-D, which was 50% implied by your reply to Nan. Am I right in assuming that this trajectory definition is your intention (it's certainly my take on CSML) and that buoys mapping to multiple feature types has created a red herring?
Hi Roy:
Yes, I agree with you, and I probably misunderstood Nan's example. An Autosub would be a trajectory, not a section, unless it was measuring at multiple depths at the same time. I think Don also has clarified this.
thanks for clearing this example up.
comment:28 follow-ups: ↓ 29 ↓ 30 Changed 12 years ago by bnl
Can I just sort of summarise this discussion and see if I've understood you all? I think we've come to the CSML+MOLES position, but I want to be sure:
The key issue seems to be that there is a confusion between the path taken by the platform from which (or at which) measurements are taken, and the measurements themselves. The feature type of a path taken by a platform would always seem to be a trajectory (whether one dimension is collapsed - so it's 2D+t - or not, 3D+t). In some cases there is a "sampling" at or on the platform, in which case the feature type of the data is also a trajectory, or, there are measurements taken in profiles from the platform, in which case the feature type of the set is section.
(To spell this out, the path taken by the platform is trajectory data, where the parameter measured is the actual location of the platform, data collected on, or at, the platform is trajectory data where the parameter measured is whatever ... data measured from the platform may have different feature types.)
comment:29 in reply to: ↑ 28 ; follow-up: ↓ 31 Changed 12 years ago by ngalbraith
Replying to bnl:
The key issue seems to be that there is a confusion between the path taken by the platform from which (or at which) measurements are taken, and the measurements themselves. The feature type of a path taken by a platform would always seem to be a trajectory (whether one dimension is collapsed - so it's 2D+t - or not, 3D+t). In some cases there is a "sampling" at or on the platform, in which case the feature type of the data is also a trajectory, or, there are measurements taken in profiles from the platform, in which case the feature type of the set is section.
By definition, trajectories may have x, y, and z all varying with time, though it is also reasonable to allow a depth dimension of 1 - I can't see the point of using a more constricted definition in the convention.
I'd like to be able to distinguish between the kind of data from something like an Argo float, with profiles along a track (where you clearly can think of the data as a section) and the kind of drifting surface buoy we sometimes deploy, with a few instruments at a constant depth. The dimensions might be
time (unlimited) lon (time) lat (time) depth (3) temp (time,depth)
To describe this as a 'section' seems misleading, especially given the definition provided in the Point Observation document: a 'series of profiles.'
The document shows a mechanism for storing multiple trajectories together, in section 5.8.3.2; however the dimensions indicate that these are not adjacent, parallel trajectories. Maybe it would be useful to generalize that description to include this type of data - I can't think of another practical use for "storing multiple trajectories in the same file, and the number of observations in each trajectory is the same" other than this one (although I suppose somebody has that sort of data).
comment:30 in reply to: ↑ 28 Changed 12 years ago by caron
Replying to bnl:
Can I just sort of summarise this discussion and see if I've understood you all? I think we've come to the CSML+MOLES position, but I want to be sure:
The key issue seems to be that there is a confusion between the path taken by the platform from which (or at which) measurements are taken, and the measurements themselves. The feature type of a path taken by a platform would always seem to be a trajectory (whether one dimension is collapsed - so it's 2D+t - or not, 3D+t). In some cases there is a "sampling" at or on the platform, in which case the feature type of the data is also a trajectory, or, there are measurements taken in profiles from the platform, in which case the feature type of the set is section.
(To spell this out, the path taken by the platform is trajectory data, where the parameter measured is the actual location of the platform, data collected on, or at, the platform is trajectory data where the parameter measured is whatever ... data measured from the platform may have different feature types.)
Hi Brian:
since i stare at netcdf files all day, not "measuring platforms", i dont actually use that terminology, but i think that your description would be equivilent in actual practice.
note that the idea of a trajectoryFeature is that the point measurements are "connected", which probably means that it may make sense to interpolate between the points. If the platform was measuring sporadically, one might decide to classify as pointFeature, which are not connected.
comment:31 in reply to: ↑ 29 Changed 12 years ago by caron
Hi Nan, comments are embedded:
Replying to ngalbraith:
Replying to bnl:
The key issue seems to be that there is a confusion between the path taken by the platform from which (or at which) measurements are taken, and the measurements themselves. The feature type of a path taken by a platform would always seem to be a trajectory (whether one dimension is collapsed - so it's 2D+t - or not, 3D+t). In some cases there is a "sampling" at or on the platform, in which case the feature type of the data is also a trajectory, or, there are measurements taken in profiles from the platform, in which case the feature type of the set is section.
By definition, trajectories may have x, y, and z all varying with time, though it is also reasonable to allow a depth dimension of 1 - I can't see the point of using a more constricted definition in the convention.
I'd like to be able to distinguish between the kind of data from something like an Argo float, with profiles along a track (where you clearly can think of the data as a section) and the kind of drifting surface buoy we sometimes deploy, with a few instruments at a constant depth. The dimensions might be
time (unlimited) lon (time) lat (time) depth (3) temp (time,depth)To describe this as a 'section' seems misleading, especially given the definition provided in the Point Observation document: a 'series of profiles.'
technically, it could be described as a section, since it has multiple depths. Or one could call it a trajectory with constant z = surface, and consider temp as a vector, ie treat the depth as internal to the "point observation".
The document shows a mechanism for storing multiple trajectories together, in section 5.8.3.2; however the dimensions indicate that these are not adjacent, parallel trajectories. Maybe it would be useful to generalize that description to include this type of data - I can't think of another practical use for "storing multiple trajectories in the same file, and the number of observations in each trajectory is the same" other than this one (although I suppose somebody has that sort of data).
yes, that representation would probably not be very common.
comment:32 follow-up: ↓ 33 Changed 12 years ago by Heinke
Dear all,
we are on the way to include point observation data into the WDCC repository. Your proposal is very welcome to us. We are working together with the University of Bonn. Andreas Hense has some remarks to the feature types:
profile is now restricted to a vertical line. Could we expand this to a more general definition: profile: a set of data points along a vertical or horizontal line. or profile: a set of data points along a straight, linear measurement line to be described / parametrized by a single coordinate
We need it in the COPS Project for scintillometer measurements along a horizontal line.
section is restricted to a trajectory. Could we expand this for example to include ppi-scans of a radar (conical section) and rhi-scans of radar (triangular section).
A suggestion: section: a collection of profiles in an orthogonal direction to the measurement line (orthogonal not only in the cartesian sense)
Best wishes Heinke
comment:33 in reply to: ↑ 32 Changed 12 years ago by caron
Hi Heinke:
We are balancing the complexity of possible "data types" with the practicality of stuffing this kind of data into a rather simple storage format. The various data types are intended to be categories of data with which one can take advantage of some regularity to store the data more simply or efficiently.
The generalization of a profile currently is a trajectory, which is parametrized by a single coordinate, but we dont have a way to represent in CDL that the data is along a straight line. A profile, OTOH, fixes x and y to be a constant, which is easily expressible in CDL, eg:
float x; float y; float z(z);
Profile data is fairly common (at least in met data) and because the regularity is expressible in CDL, is a good candidate for a special data type. I havent seen scintillometer data before, but if others think we need a special case for it, we could add another data type.
Scanning radar data, i believe, falls outside the scope of this proposal. Because its so common and large, it deserves careful attention. There is a current effort at NCAR to define a CF Convention for radial data. I can put you in touch with this group offline if you like.
comment:34 follow-up: ↓ 35 Changed 12 years ago by ngalbraith
- Priority changed from high to medium
We're using NetCDf (CF) for station data from surface moorings, and one of the most annoying limitations of NetCDF for this purpose could probably be addressed in this proposal, but doesn't seem to be. The problem, for stations at which multiple instruments at different depths produce a 2-d time series data set, is the lack of structure in the variable attributes. We may have several types of instruments at various depths, all producing sea_water_temperature but with different instrument attributes (model number, serial number, accuracy, resolution, etc).
There are some workarounds, but they're messy; making attribute names like SN_1, SN_2, or comma-separating the values for different bins (instrument_resolution: .01, 0001 ...). These would not easily translate into a more structured metadata language like sensorml, and, since there is no standard approach, there are not likely to be tools built to accommodate any of these workarounds.
It would be great to decide on a standard way to define attributes for different depth bins. Is this something that could be covered in the stationTimeSeries proposal?
comment:35 in reply to: ↑ 34 Changed 12 years ago by jonathan
Dear Nan
Replying to ngalbraith:
The problem, for stations at which multiple instruments at different depths produce a 2-d time series data set, is the lack of structure in the variable attributes. We may have several types of instruments at various depths, all producing sea_water_temperature but with different instrument attributes (model number, serial number, accuracy, resolution, etc).
I would say that the natural CF way to handle this would be to record the sensor information not in an attribute but in a string-valued auxiliary coordinate variable, as described in CF 6.1 (though there it doesn't have this purpose). That's a 2D char array. One dimension is the max string length, the other would be (in this case) the depth dimension. The variable is pointed to by a coordinates attribute of the data variable. This gives you a way to supply a different string for each depth level. An auxiliary coordinate variable, like any coordinate variable, can have attributes such as standard name to identify what it contains.
In the single-level case, you could use a string-valued scalar coordinate variable to supply sensor information. Scalar coordinate variables can have metadata to describe them in just the same way as data variables, and they may be a better mechanism because (exactly as you point out) there always exists the possibility of wanting to generalise them into multivalued things that correspond to dimensions of the data.
Cheers
Jonathan
comment:36 follow-up: ↓ 37 Changed 12 years ago by caron
I agree with Jonathan, this kind of metadata should be in a variable, not an attribute, so that you can use the shared depth dimension.
One can also make it into an auxiliary coordinate variable by placing it in an :coordinates attribute. As Jonathan says, these should be on the data variables, not the depth coordinate. I would say that this part is optional, since probably only your software would know what to do with that.
As an aside, these kinds of auxiliary coordinates have the danger that generic software might confuse them with georeferencing coordinates. Eg if an auxiliary coordinate had units of "degrees_north", it might be impossible to distinguish it from the latitude coordinate. So in general i would advocate that CF move away from the use of units to distinguish coordinate types. We already have other mechanisms, so its probably mostly a matter of best practices.
comment:37 in reply to: ↑ 36 Changed 12 years ago by jonathan
Replying to caron:
As an aside, these kinds of auxiliary coordinates have the danger that generic software might confuse them with georeferencing coordinates. Eg if an auxiliary coordinate had units of "degrees_north", it might be impossible to distinguish it from the latitude coordinate. So in general i would advocate that CF move away from the use of units to distinguish coordinate types. We already have other mechanisms, so its probably mostly a matter of best practices.
I completely agree with this comment of John's. Units are not the right way to distinguish quantities. Other metadata, such as standard names, are intended for that purpose. CF supports distinction by units in special cases for COARDS compatibility.
Jonathan
comment:38 Changed 12 years ago by ngalbraith
Thanks, all - using a variable instead of an attribute is clearly the way to go - for some reason I just hadn't thought of it as an option.
Actually, using the attribute name "ancillary_variables" instead of "auxiliary coordinates" to tie these to the appropriate metadata variable would be more clear and would prevent the possible mis-construction of these fields as alternate spatial coordinates.
This attribute name was listed in the Conventions document (in Appendix A, which I've apparently never read before) and says "ancillary_variables : Identifies a variable that contains closely associated data, e.g., the measurement uncertainties of instrument data."
This will be particularly useful to us because we are allowing several QA/QC related fields to be provided either with dimensions (Z) or (T,Z). So ... unless I misunderstand the documentation, this problem seems to be solved. It might be useful to record it in the station section of the point obs specification, in any case.
comment:39 Changed 12 years ago by jonathan
Dear Nan
The ancillary_variables attribute is intended to point from one data variable to another data variable which supplies point-by-point metadata. I agree that is appropriate for quality data; that is indeed one of the intentions of it, and it might be possible to use one of the standard name modifiers to identify it (CF 3.3 and Appendix C). It would be exactly as intended for quality data (T,Z) i.e. the same dimensions as the data variable. I agree that it seems natural to allow it also for metadata which has a subset of the dimensions of the data variable, and it doesn't appear to be excluded by the standard or the conformance requirements. If there is any possibility of the variable being useful as a coordinate, I think it would be fine and helpful to point to it with both the coordinates and the ancillary_variables attributes.
Cheers
Jonathan
comment:40 follow-up: ↓ 42 Changed 11 years ago by Heinke
Dear John,
I like to make some comments about the new version. You have changed the definition of 'point':
old: * 'point': one or more parameters measured at one point in time and space
new: * 'point': one or more parameters measured at a set of 'point' in time and space
I think with the new definition, every observation data has the feature type 'point'. Every feature type is a subset of 'point'. Is this your intention?
'point' looks like a timeseries of an unstructured grid.
The second I like to mention is the semantic concerning 'TimeSeries?'. If you expect every 'observation data' has a time dimension then you don't need 'TimeSeries?' in the feature type name. 'stationTimeSeries' should be named with 'station'. If not stationProfile should be named with 'stationProfileTimeSeries'...
Best wishes Heinke
comment:41 follow-up: ↓ 43 Changed 11 years ago by dmurray
John-
Thanks for the update and the work on this. Just a couple of comments:
- If there is only one trajectory in a file, is it really necessary to have a trajectory_id?
- I would like to take a point dataset (say ship tracks, or airplane reports) and look at it as a set of trajectories identified by a particular variable. The data for a particular trajectory might not be continuous, but it seems like I could easily just add the trajectory_id to the variable that I'd like to use as the identifier. It would be up to the reader to synthesize the coordinate variable and trajectory_index, but would allow the flexibility of using NcML to arbitrarily change the variable used for the trajectory without it being hardcoded in the file.
Regarding Heinke's comment, perhaps point could be defined as:
one or more parameters measured at an arbitrary set of locations in time and space
comment:42 in reply to: ↑ 40 Changed 11 years ago by caron
Hi Heinke:
Replying to Heinke:
Dear John,
I like to make some comments about the new version. You have changed the definition of 'point':
old: * 'point': one or more parameters measured at one point in time and space
new: * 'point': one or more parameters measured at a set of 'point' in time and space
I think with the new definition, every observation data has the feature type 'point'. Every feature type is a subset of 'point'. Is this your intention?
actually, yes. All point feature types are collection of points. They differ in their connectivity. A point feature collection is an unconnected collection. A reviewer suggested changing to the plural in the definition of point feature.
'point' looks like a timeseries of an unstructured grid.
The second I like to mention is the semantic concerning 'TimeSeries?'. If you expect every 'observation data' has a time dimension then you don't need 'TimeSeries?' in the feature type name. 'stationTimeSeries' should be named with 'station'. If not stationProfile should be named with 'stationProfileTimeSeries'...
A time series of points all must have the same lat/lon. we call this the stationTimeSeries feature type, though just station would be fine also.
this is an important case because the number of stations is typically fixed, while the number of observations is unbounded. One can thus use much simpler techniques for finding all the data in a bounding box than the general case.
thanks for your comments.
comment:43 in reply to: ↑ 41 Changed 11 years ago by caron
Replying to dmurray:
John-
Thanks for the update and the work on this. Just a couple of comments:
- If there is only one trajectory in a file, is it really necessary to have a trajectory_id?
good question, i also thought about that. one of my hidden agendas is to deal with aggregations of multiple files. one often sees a bunch of files with one trajectory in each file. i need a trajectory id for each. I could use the filename if i had to, but the data provider can often give a much better name, and i thought it would be better to force her to do so.
- I would like to take a point dataset (say ship tracks, or airplane reports) and look at it as a set of trajectories identified by a particular variable. The data for a particular trajectory might not be continuous, but it seems like I could easily just add the trajectory_id to the variable that I'd like to use as the identifier. It would be up to the reader to synthesize the coordinate variable and trajectory_index, but would allow the flexibility of using NcML to arbitrarily change the variable used for the trajectory without it being hardcoded in the file.
i know what you mean, this is a request for a client-side library like the CDM. This spec is for data writers, to give them a framework for defining their view of the data. So at the moment, I dont think your request would affect any of these conventions, do you?
Regarding Heinke's comment, perhaps point could be defined as:
one or more parameters measured at an arbitrary set of locations in time and space
or perhaps "unconnected set of locations" ?
comment:44 follow-up: ↓ 61 Changed 11 years ago by rsignell
Are there example NetCDF files or (preferably) OpenDAP data URLs for the various types of Point Feature datasets described here?
comment:45 follow-ups: ↓ 48 ↓ 50 ↓ 51 ↓ 52 ↓ 53 Changed 11 years ago by jonathan
Dear John
It's taken me a couple of months to make time to look at this again - sorry.
- It is evidently welcome to have this clarity about the representation of these data types. You start by saying that the existing 5.4 and 5.5 should be deprecated. Actually I think they are consistent with your representation, aren't they? Perhaps the right thing would be to replace those sections by new ones, based on your proposal, since data written in accordance with 5.4 and 5.5 would still be acceptable.
- In practice netCDF files are often single-quantity datasets (all those in CMIP3 are, for instance), but why do you want to prohibit mixing feature types in a file? There doesn't appear to be a structural limitation preventing more than one type being in the file. You could make the featureType attribute either global (if there was only one feature type in the file) or an attribute of the data variable (to permit more than one type).
- I would suggest that featureType attribute should be optional (like nearly all CF features), but if it is present, it implies various requirements, which you could tabulate for each featureType. These are mainly requirements for which aux coord variables must be present for each type, as that's what distinguishes the types. Is it really necessary to require an order of dimensions (station on the outside, for instance)? Elsewhere in CF we have relaxed ordering restrictions (that COARDS imposed). Software should be able to work out which dimension is which from the metadata; it's less robust to depend on an order.
- Missing data in coordinate variables is not allowed generally. I can't see why you want to allow it. Where does this need arise? It's not allowed because it violates the requirement that coordinate variables must be monotonic; I think that's a Unidata rule. Missing data would be OK in auxiliary coordinate variables.
- station_name and station_id should also be auxiliary coordinate variables i.e. named by the coordinates attribute. They are labels, as described by CF section 6.
- As in my previous comment in this ticket, I much prefer the index representation to the contiguous representation of ragged arrays. No-one replied to your remark on this, I think.
- The contiguous representation is a bit more efficient of space, but both of them are vastly more efficient than a sparsely filled rectangular array.
- The contiguous representation has auxiliary coordinate variables with a dimension (station) which is not a dimension of the data variable; that is not allowed by the CF definition of aux coord vars, which would have to be modified.
- The index representation is more flexible.
- It would be better to have only one of them, on grounds of simplicity.
- The index representation is rather similar to the existing mechanism of "compression by gathering" in CF 8.2. That is for a rather different purpose, but it's the same kind of indirection. Your example is
dimensions: station = 23 ; obs = UNLIMITED ; variables: float lon(station) ; lon:units = "degrees_east"; int stationIndex(obs) ; stationIndex:long_name = "which station this obs is for" ; float temp(obs) ; temp:long_name = "temperature" ; temp:units = "Celsius" ; temp:coordinates = "time lat lon alt" ;
According to CF 8.2, this would be
dimensions: station = 23 ; obs = UNLIMITED ; variables: float lon(station) ; lon:units = "degrees_east"; int obs(obs) ; obs:long_name = "which station this obs is for" ; obs:compress="station"; float temp(obs) ; temp:long_name = "temperature" ; temp:units = "Celsius" ; temp:coordinates = "time lat lon alt" ;
The compress attribute indicates that the variable is an index into the named dimension. In CF 8.2, it's used for compressing a sparsely filled combination of dimensions into 1D. Here it would be used to expand, by repetition, a 1D dimension into another 1D dimension. But the machinery would be rather similar. Since this already exists in CF, I'd say that's another point in its favour over the contiguous representation.
- station_altitude could be just altitude, which is an existing standard name.
Best wishes for 2010
Jonathan
comment:46 follow-up: ↓ 49 Changed 11 years ago by rsignell
Jonathan,
Regarding your point 2: I agree with you that it would be very useful to allow mixing feature types in the same dataset. We often have NetCDF files such as ADCP data that contain both "stationProfile" (the velocity time series at different depth bins) with "station" data (the single time series of temperature and other variables near the transducer).
Regarding your point 4: I think it WOULD be useful to have CF allow missing_data in coordinate variables, as in coastal ocean modeling we often have curvilinear but logically rectangular grids where the lon/lat values (or x,y values) that constitute the grid have been created on the water side only. This is the case for Delft3D grids, for example. So in the masked land regions the coordinate values do not exist and are therefore written as masked values. This currently causes these datasets to be non-CF compliant with no good reason.
comment:47 Changed 11 years ago by jonathan
Dear Rich
Do you mean that there is missing data in the 2D lat-lon coordinates? I think that is OK, because they are auxiliary coordinate variables. It is the 1D coordinate variables (in the Unidata sense) which must be monotonic with no missing data.
Cheers
Jonathan
comment:48 in reply to: ↑ 45 ; follow-up: ↓ 55 Changed 11 years ago by caron
Hi Jonathan:
Thanks for your thorough review and comments. I will break these into pieces so each can be discussed.
Replying to jonathan:
Dear John
It's taken me a couple of months to make time to look at this again - sorry.
- It is evidently welcome to have this clarity about the representation of these data types. You start by saying that the existing 5.4 and 5.5 should be deprecated. Actually I think they are consistent with your representation, aren't they? Perhaps the right thing would be to replace those sections by new ones, based on your proposal, since data written in accordance with 5.4 and 5.5 would still be acceptable.
yes, that would be fine, ill look to see what minor modifications would be needed.
- In practice netCDF files are often single-quantity datasets (all those in CMIP3 are, for instance), but why do you want to prohibit mixing feature types in a file? There doesn't appear to be a structural limitation preventing more than one type being in the file. You could make the featureType attribute either global (if there was only one feature type in the file) or an attribute of the data variable (to permit more than one type).
- I would suggest that featureType attribute should be optional (like nearly all CF features), but if it is present, it implies various requirements, which you could tabulate for each featureType. These are mainly requirements for which aux coord variables must be present for each type, as that's what distinguishes the types. Is it really necessary to require an order of dimensions (station on the outside, for instance)? Elsewhere in CF we have relaxed ordering restrictions (that COARDS imposed). Software should be able to work out which dimension is which from the metadata; it's less robust to depend on an order.
My feeling is that the featureType attribute cannot be optional. The problem is that one cannot in a general way figure out the feature type by examining the file. Further, without knowing the feature type, one cannot correctly figure out how to read the data and map it to things like CDM Feature Types and the OGC WFS. Of course, one can always write specialized software that knows what kind of data is in the file, but for a general purpose library like the CDM, this would be impossible.
Im quite concerned about how complex it is already to decode all the possibilities in this proposal. (A summary of the possibilities is at http://www.unidata.ucar.edu/software/netcdf-java/reference/FeatureDatasets/CFencodingTable.html). I have implemented decoding these in the latest CDM 4.1 library. Im quite sure that I could not do so without knowing the featureType explicitly. Further, it would be much harder if I had to deal with the possibility of multiple featureTypes in a file. My feeling is that the right way to proceed is to get this working, and if there is a compelling real-life need for multiple feature types in a file, to figure out how to extend this later.
I agree in principle that "Software should be able to work out which dimension is which from the metadata", but in this case there is a combinatorial explosion of possibilities that makes the client implementation on the edge of infeasible. I will have a look at relaxing the dimension ordering, but for now I would again advocate starting with this simpler and more restricted case, which I know is possible. Better to relax this later if needed, than to allow something now which makes it all but impossible to handle in generic software.
(ill respond to rest in separate posts).
Regards,
John
comment:49 in reply to: ↑ 46 Changed 11 years ago by caron
Hi Rich:
Replying to rsignell:
Jonathan,
Regarding your point 2: I agree with you that it would be very useful to allow mixing feature types in the same dataset. We often have NetCDF files such as ADCP data that contain both "stationProfile" (the velocity time series at different depth bins) with "station" data (the single time series of temperature and other variables near the transducer).
StationProfile? data is allowed to have data that doesnt have a z dimension. Im hoping this would cover your use case.
comment:50 in reply to: ↑ 45 Changed 11 years ago by caron
Replying to jonathan:
- Missing data in coordinate variables is not allowed generally. I can't see why you want to allow it. Where does this need arise? It's not allowed because it violates the requirement that coordinate variables must be monotonic; I think that's a Unidata rule. Missing data would be OK in auxiliary coordinate variables.
This situation comes up most obviously when you use the multidimensional representation, which creates rectangular arrays out of ragged arrays. For example you have 100 stations, all with 100 time points, except for some with less. So you have
float data1(station, obs); float data2(station, obs); float data3(station, obs); ... float time(station, obs);
and place missing data in the time coordinates where needed, so the application can read just the time coordinate to find out where the data lies, instead of examining all the data arrays.
Note that time is an auxiliary coordinate. I beleive all such cases are (or could be) auxiliary coordinates, so Im being a bit sloppy when i talk about coordinates having missing values. I can clarify that in the wording.
comment:51 in reply to: ↑ 45 ; follow-up: ↓ 56 Changed 11 years ago by caron
Replying to jonathan:
- station_name and station_id should also be auxiliary coordinate variables i.e. named by the coordinates attribute. They are labels, as described by CF section 6.
- The contiguous representation has auxiliary coordinate variables with a dimension (station) which is not a dimension of the data variable; that is not allowed by the CF definition of aux coord vars, which would have to be modified.
Yes, both the contiguous and the indexed representation refer to auxiliary coordinate variables with a dimension which is not a dimension of the data variable. This is needed to be able to not repeat those values for every observation. Essentially we "factor out" the common information into the station "table", then "join" the station to the data using the contiguous or indexed method. This is a new way of indicating coordinates that has not been used in CF previously. So it sounds like we would need to extend the definition of aux coord vars for this case.
station_name and station_id could be auxiliary coordinate variables, although they share this same issue of having a dimension not used by the data variable. My feeling is that the semantics of the contiguous and the indexed representation is clearer without adding the station_name and station_id as auxiliary coordinates.
comment:52 in reply to: ↑ 45 Changed 11 years ago by caron
Replying to jonathan:
- As in my previous comment in this ticket, I much prefer the index representation to the contiguous representation of ragged arrays. No-one replied to your remark on this, I think.
- The contiguous representation is a bit more efficient of space, but both of them are vastly more efficient than a sparsely filled rectangular array.
- The index representation is more flexible.
- It would be better to have only one of them, on grounds of simplicity.
- The index representation is rather similar to the existing mechanism of "compression by gathering" in CF 8.2. That is for a rather different purpose, but it's the same kind of indirection. Your example is
dimensions: station = 23 ; obs = UNLIMITED ; variables: float lon(station) ; lon:units = "degrees_east"; int stationIndex(obs) ; stationIndex:long_name = "which station this obs is for" ; float temp(obs) ; temp:long_name = "temperature" ; temp:units = "Celsius" ; temp:coordinates = "time lat lon alt" ;According to CF 8.2, this would be
dimensions: station = 23 ; obs = UNLIMITED ; variables: float lon(station) ; lon:units = "degrees_east"; int obs(obs) ; obs:long_name = "which station this obs is for" ; obs:compress="station"; float temp(obs) ; temp:long_name = "temperature" ; temp:units = "Celsius" ; temp:coordinates = "time lat lon alt" ;The compress attribute indicates that the variable is an index into the named dimension. In CF 8.2, it's used for compressing a sparsely filled combination of dimensions into 1D. Here it would be used to expand, by repetition, a 1D dimension into another 1D dimension. But the machinery would be rather similar. Since this already exists in CF, I'd say that's another point in its favour over the contiguous representation.
The advantage of the contiguous representation is that it can be much more efficient for the reader to extract all of the observations for a particular station/profile/trajectory, in the case where you are storing large numbers of station/profile/trajectories in the same file. The contiguous representation requires reading the size array and then reading the contiguously-stored observations. The indexed representation requires that the reader read the entire index array, which if it is stored along the unlimited dimension, can be very costly, then reading observations possibly scattered throughout the file.
The indexed representation is more general and easier for the writer, and the contiguous representation is easier and potentially much more efficient for the reader. I've experimented with writing these files a lot in the context of using CF as a transfer format, subsetting large datasets (possibly not stored in netcdf) and returning CF-compliant files. I think its a reasonable tradeoff in complexity for efficient reading.
comment:53 in reply to: ↑ 45 ; follow-up: ↓ 54 Changed 11 years ago by caron
Replying to jonathan:
- station_altitude could be just altitude, which is an existing standard name.
Thanks for pointing this issue out. The problem is that theres a lot of data, eg profile data, whose z coordinate is relative to the ground, i.e. the altitude of the station. So we need two coordinates: "altitude" and "station_altitude".
This is part of the larger problem of specifying the reference surface of z coordinates. Id be happy to discuss that in more detail, but for this proposal, adding "station_altitude" probably would suffice for now.
Best wishes for the new year, John
comment:54 in reply to: ↑ 53 ; follow-up: ↓ 57 Changed 11 years ago by jonathan
Dear John
Replying to caron:
Replying to jonathan:
- station_altitude could be just altitude, which is an existing standard name.
Thanks for pointing this issue out. The problem is that theres a lot of data, eg profile data, whose z coordinate is relative to the ground, i.e. the altitude of the station. So we need two coordinates: "altitude" and "station_altitude".
The existing standard name for "vertical distance above the ground" is height. The standard_name of altitude means "vertical distance above the geoid".
Jonathan
comment:55 in reply to: ↑ 48 ; follow-up: ↓ 58 Changed 11 years ago by jonathan
Dear John
Replying to caron:
My feeling is that the featureType attribute cannot be optional.
It seems to me that it must be optional in some sense, because nothing prevents me from writing these kinds of data variable without including a featureType attribute, does it? They can mostly be decoded according to the existing CF conventions. I think the featureType attribute is a clue to the data-reader about the interpretation of the data.
The problem is that one cannot in a general way figure out the feature type by examining the file.
That seems to imply that some of the feature types are really degenerate. If a human can't tell which it is by looking at the file, is there really a distinction? You say you could not decode the data without the featureType. What extra essential information does it give you?
Is there anything in principle that prevents featureType being used as an attribute of a data variable, rather than a global attribute? That would be useful flexibility. Again, you could write data of several "structures" in one file anyway, without declaring a featureType variable. If you don't want to support files with several featureTypes in your software initially, that's fine. It could just give an error if presented with the featureType as an attribute of a data variable.
Jonathan
comment:56 in reply to: ↑ 51 ; follow-up: ↓ 59 Changed 11 years ago by jonathan
Dear John
Replying to caron:
Yes, both the contiguous and the indexed representation refer to auxiliary coordinate variables with a dimension which is not a dimension of the data variable.
You're right, this is a problem with the indexed representation too, and it is also not dealt with by CF 8.2 on compression by gathering, as it should be. I think we need to add some rules to deal with the dimensions of auxiliary coord variables in these situations.
station_name and station_id could be auxiliary coordinate variables, although they share this same issue of having a dimension not used by the data variable. My feeling is that the semantics of the contiguous and the indexed representation is clearer without adding the station_name and station_id as auxiliary coordinates.
We should sort out the first issue. I don't agree with the second point; they seem just like the other aux coord vars to me. If they are not listed by the coordinates attribute, they are not formally associated with the data variable. In the existing CF conventions about station data etc., these kinds of variable are named as auxiliary coord variables, and they're the same kind of variable in your proposal.
Best wishes
Jonathan
comment:57 in reply to: ↑ 54 ; follow-up: ↓ 60 Changed 11 years ago by caron
Replying to jonathan:
Dear John
Replying to caron:
Replying to jonathan:
- station_altitude could be just altitude, which is an existing standard name.
Thanks for pointing this issue out. The problem is that theres a lot of data, eg profile data, whose z coordinate is relative to the ground, i.e. the altitude of the station. So we need two coordinates: "altitude" and "station_altitude".
The existing standard name for "vertical distance above the ground" is height. The standard_name of altitude means "vertical distance above the geoid".
Ah, very good to know, i will incorporate that. Is there anything for "height of ground at this location", or do you think "station_altitude" is appropriate?
comment:58 in reply to: ↑ 55 Changed 11 years ago by caron
Replying to jonathan:
Dear John
Replying to caron:
My feeling is that the featureType attribute cannot be optional.
It seems to me that it must be optional in some sense, because nothing prevents me from writing these kinds of data variable without including a featureType attribute, does it? They can mostly be decoded according to the existing CF conventions. I think the featureType attribute is a clue to the data-reader about the interpretation of the data.
I would put it this way: the existence of the featureType attribute implies that the file is written in conformance to the CF Point Observation Conventions (chap 9 of the standard). Software can use that fact to do conformance checking.
The problem is that one cannot in a general way figure out the feature type by examining the file.
That seems to imply that some of the feature types are really degenerate. If a human can't tell which it is by looking at the file, is there really a distinction? You say you could not decode the data without the featureType. What extra essential information does it give you?
The featureType attribute gives you topological connectivity information, not discoverable by examining the file's metadata. For example, a collection of points (9.2) and a single station timeseries (9.3.4) look identical. A collection of profiles (9.5.2) and a single section of profiles along a trajectory (9.7.2) look identical. Such connectivity is implicit in gridded data, but must be explicitly specified in point data.
Is there anything in principle that prevents featureType being used as an attribute of a data variable, rather than a global attribute? That would be useful flexibility. Again, you could write data of several "structures" in one file anyway, without declaring a featureType variable. If you don't want to support files with several featureTypes in your software initially, that's fine. It could just give an error if presented with the featureType as an attribute of a data variable.
In principle, I think not, although most "features" need to be represented by multiple data variables, so the possibility of confusion is much greater. I would prefer to add this later if it proves to be important and feasible to implement. I think its essential that there be implementations of conventions before accepting them.
Jonathan
comment:59 in reply to: ↑ 56 Changed 11 years ago by caron
Replying to jonathan:
Dear John
Replying to caron:
Yes, both the contiguous and the indexed representation refer to auxiliary coordinate variables with a dimension which is not a dimension of the data variable.
You're right, this is a problem with the indexed representation too, and it is also not dealt with by CF 8.2 on compression by gathering, as it should be. I think we need to add some rules to deal with the dimensions of auxiliary coord variables in these situations.
Ive added this wording as part of the proposal:
In section 5, first paragraph, change:
"The dimensions of an auxiliary coordinate variable must be a subset of the dimensions of the variable with which the coordinate is associated (an exception is label coordinates (Section 6.1, “Labels”) which contain a dimension for maximum string length)"
to
"The dimensions of an auxiliary coordinate variable must be a subset of the dimensions of the variable with which the coordinate is associated (with two exceptions: 1) label coordinates (see Section 6.1, “Labels”) contain a dimension for maximum string length, and 2) the Point Observation indexed and contiguous representations (see Section 9, “Point Observations”) allow special kinds of coordinates which are connected in a differrent way than by the dimension)"
what do you think?
station_name and station_id could be auxiliary coordinate variables, although they share this same issue of having a dimension not used by the data variable. My feeling is that the semantics of the contiguous and the indexed representation is clearer without adding the station_name and station_id as auxiliary coordinates.
We should sort out the first issue. I don't agree with the second point; they seem just like the other aux coord vars to me. If they are not listed by the coordinates attribute, they are not formally associated with the data variable. In the existing CF conventions about station data etc., these kinds of variable are named as auxiliary coord variables, and they're the same kind of variable in your proposal.
The proposal uses a new convention for this situation, eg for station data:
"All variables that have the station dimension as their outer dimension are considered to be station information, and are called 'station variables'."
I think that this is a fairly natural way to express that a collection of data variables should be treated collectively. If we were using Netcdf-4, we could put this into a Structure, so this is a substitute for that.
Anyway, all the "station information" is associated with the observational data which has that station through the various mechanisms (indexed, contiguous, multdim, etc). Listing all the station variables in the coordinates attribute seems unneeded and cluttered.
This pattern of associating sets of data variables into these "psuedo-structures" is used in many places in the proposal, wherever factoring out information is desired.
Warm regards, John
comment:60 in reply to: ↑ 57 Changed 11 years ago by jonathan
Dear John
Replying to caron:
Replying to jonathan:
Replying to caron:
Replying to jonathan:
- station_altitude could be just altitude, which is an existing standard name.
Thanks for pointing this issue out. The problem is that theres a lot of data, eg profile data, whose z coordinate is relative to the ground, i.e. the altitude of the station. So we need two coordinates: "altitude" and "station_altitude".
The existing standard name for "vertical distance above the ground" is height. The standard_name of altitude means "vertical distance above the geoid".
Ah, very good to know, i will incorporate that. Is there anything for "height of ground at this location", or do you think "station_altitude" is appropriate?
The standard name for the altitude of the ground is surface_altitude.
Cheers
Jonathan
comment:61 in reply to: ↑ 44 Changed 11 years ago by caron
Replying to rsignell:
Are there example NetCDF files or (preferably) OpenDAP data URLs for the various types of Point Feature datasets described here?
I have a bunch of synthetic (NcML) datasets for testing. Do you want me to make them available?
comment:62 follow-ups: ↓ 63 ↓ 64 ↓ 65 ↓ 68 Changed 11 years ago by jonathan
Dear John
Since the proposal is complicated and detailed, I feel it would be helpful put more in the overview. At present, the proposal is mainly demonstrated through the many examples, which are valuable. People wanting to do a certain thing can look for the appropriate example to find out how to do it, and I expect that is how it will be mainly used. However the overview could help by setting out the intentions, the meaning of the classifications, and the structures used in general.
Starting at the top, I think it could have a clearer title than point observation data. As some comments show, point is confusing because these structures are not for unconnected points, and also because one of the featureTypes proposed is itself called point. Observation is too restrictive. Although you are dealing with observational data, the same structures are applicable to model data, which may in fact want to use them with the aim of emulating observational data. I think the defining characteristic of these datasets is that they have one or more dimensions which are arbitrary and discrete, rather than continuous and geophysical; these dimensions are used to aggregate/collect/bundle/concatenate data which would otherwise be stored in separate data variables. The motivation for doing this is to save space in the file and limit the number of dimensions and coordinate variables required, isn't it. The motivation is not principally one of providing descriptive metadata, although the featureType is useful as a description; the proposal is made to address the practical inconvenience of having numerous data variables and dimensions. If that's right, it would be useful to state this at the outset, saying that this is mainly a concern for observational data measured in-situ and model data which emulates observational sampling. I am not sure what title would describe this purpose - does any of the words aggregate, collect, bundle or concatenate sound right to you?
Then you describe the featureTypes. I think it could be helpful for the reader to be more explicit about the structures implied. I think they are as follows:
featureType | data | auxiliary coordinates |
point | data(o) | x(o) y(o) z(o) t(o) n(o) |
stationTimeSeries | data(s,o) | x(s) y(s) z(s) t(s,o) n(s) |
trajectory | data(s,o) | x(s,o) y(s,o) z(s,o) t(s,o) n(s) |
profile | data(s,o) | x(s) y(s) z(s,o) t(s) n(s) |
stationProfile | data(s,p,o) | x(s) y(s) z(s,p,o) t(s,p) n(s) |
section | data(s,p,o) | x(s,o) y(s,o) z(s,p,o) t(s,p) n(s) |
where o s p are index dimensions, x y z t are spatiotemporal coordinates and n is a string-valued coordinate (a label). It seems to me, as we have discussed before, that you can deduce what kind of data you have by examining the dimensions of the data and auxiliary coordinates. Therefore featureType is really redundant. I understand that it could be helpful if it can be relied on, but if it is going to be included than it must be possible to verify that it is correct, and for that we need the above rules to be stated. The CF checker will have to be able to work out what featureType describes the data by examining the data and auxiliary coordinates. Any program could do that.
I have not clarified above which auxiliary coordinate variables are optional. You say x y t are mandatory but z is sometimes optional. I think the rules for this should be explicitly stated too, so the presence of coordinate variables can be checked.
Next, you could describe the flattened representation. This is not included in all your subsections, but it is applicable to all of them except point. Also, it is not entirely "flat" in 9.6.4 and it would perhaps make a simpler convention if it was flat. I think "flattening" means combining all the index dimensions into one index dimension, and repeating values as often as necessary in the auxiliary coordinates. That means making any kind of data look like the point type.
Then the overview can describe, in general terms, the three structures proposed to contain these data: multidimensional, ragged contiguous, and ragged indexed. Multidimensional is straightforward: it means having the dimensions in the above table. Ragged representations are not needed for the point type, but they are for the others, which have more than one index dimension. Apart from describing the intention of the data, the presence of the featureType attribute means that the ragged representations are allowed. The flattened and multidimensional representations are existing CF structures, so they would be allowed without this proposal.
I'd like to suggest a small change to the ragged contiguous representation. It could be described generically like this for a 2D case:
float A(s); float N(s,strlen); // or N(o,strlen) int s(s); s:count="o"; float B(o); float D(o); D:coordinates="A B N";
where A and B are (examples of) numerical auxiliary coordinate variables, N is a string-valued auxiliary coordinate variable (as in CF 6) and D is a data variable. The coordinate variable s is what you have called row_size. That is, I am proposing making this a coordinate variable in the Unidata sense, instead of depending on the name row_size. To assign a "special" variable name would be un-CF-like, since nothing in CF depends on the names of variables (as it says in CF 2.3, third para). To indicate that it has a special role, I propose above the new count attribute, which indicates the dimension whose entries s counts. (Actually my s(s) is not legally a coordinate variable, because its values are not monotonic. The presence of the count attribute can be used to exempt it from that requirement.) In your proposal, you have no formal link between D and s; presumably you depend on their being only one s-like dimension in the file. Making an explicit link could simplify the software which is decoding the data, and also means that the proposal could be generalised to have more than one s-like and o-like dimension, and more than one featureType.
As you see, I still think N should be named as an auxiliary coordinate variable. These string-valued coordinates were included in CF for precisely the reason you want to use them. Listing them in coordinates makes the link between the data variable and its string-valued coordinates.
The 3D case for the ragged contiguous representation is more complicated, of course. It seems that your example in 9.7.3 is actually a combination of the ragged contiguous and ragged indexed structures, one for each of the dimensions. Wold it not be better to use one or the other purely? I find it rather mind-bending to be honest. I do think the ragged indexed representation has the advantage in terms of clarity.
I propose that the ragged indexed representation should look generically like this in 2D:
float A(s); float N(s,strlen); // or N(o,strlen) int o(o); o:expand="s"; float B(o); float D(o); D:coordinates="A B N";
where again I am using a (Unidata) coordinate variable o(o) instead of a variable with a special name - you say parent_index in the description. As I said before, this structure is just like the compression by gathering described in CF 8.2, except I have called the attribute expand instead of compress because it doesn't have the same role. Whereas a compressed index never repeats values, a ragged one usually does. (Hence o(o) is not a legal coordinate variable, because it's not monotonic; again, expand can exempt it from that requirement.) The expand attribute indicates that the o coordinate specifies which s coordinate applies to each entry along the o dimension. This is easier than the contiguous representation to extend to 3D, and in fact CF 8.2 is mainly intended for the case where more than one dimension is compressed. You just have o:expand="s p", for instance, which means that o is indices into the combination of the (s,p) dimensions, collapsed into 1D.
All the above does look rather complicated, and it certainly is! That's why your explicit examples of all the cases are useful. Nonetheless I think it should also be described in general terms how these structures work.
Cheers
Jonathan
comment:63 in reply to: ↑ 62 Changed 11 years ago by caron
Hi Jonathan:
I will again break mt replies into several parts:
Replying to jonathan:
Dear John
Since the proposal is complicated and detailed, I feel it would be helpful put more in the overview. At present, the proposal is mainly demonstrated through the many examples, which are valuable. People wanting to do a certain thing can look for the appropriate example to find out how to do it, and I expect that is how it will be mainly used. However the overview could help by setting out the intentions, the meaning of the classifications, and the structures used in general.
That would be useful.
Starting at the top, I think it could have a clearer title than point observation data. As some comments show, point is confusing because these structures are not for unconnected points, and also because one of the featureTypes proposed is itself called point. Observation is too restrictive. Although you are dealing with observational data, the same structures are applicable to model data, which may in fact want to use them with the aim of emulating observational data. I think the defining characteristic of these datasets is that they have one or more dimensions which are arbitrary and discrete, rather than continuous and geophysical; these dimensions are used to aggregate/collect/bundle/concatenate data which would otherwise be stored in separate data variables. The motivation for doing this is to save space in the file and limit the number of dimensions and coordinate variables required, isn't it. The motivation is not principally one of providing descriptive metadata, although the featureType is useful as a description; the proposal is made to address the practical inconvenience of having numerous data variables and dimensions. If that's right, it would be useful to state this at the outset, saying that this is mainly a concern for observational data measured in-situ and model data which emulates observational sampling. I am not sure what title would describe this purpose - does any of the words aggregate, collect, bundle or concatenate sound right to you?
For me, the motivation is to describe collections of data collocated at specified space/time locations. The underlying data fields are continuous and geophysical. The difference between this proposal and gridded data is the sampling topology, which leads to differing representations than gridded data. However, I think it would be misleading to name the proposal by the representations. Ive thought about "in-situ" data, but it also covers profile data taken from satellites. Perhaps others could chime in on what vocabulary they use in their domain for these data.
comment:64 in reply to: ↑ 62 Changed 11 years ago by caron
Replying to jonathan:
Then you describe the featureTypes. I think it could be helpful for the reader to be more explicit about the structures implied. I think they are as follows:
featureType data auxiliary coordinates point data(o) x(o) y(o) z(o) t(o) n(o) stationTimeSeries data(s,o) x(s) y(s) z(s) t(s,o) n(s) trajectory data(s,o) x(s,o) y(s,o) z(s,o) t(s,o) n(s) profile data(s,o) x(s) y(s) z(s,o) t(s) n(s) stationProfile data(s,p,o) x(s) y(s) z(s,p,o) t(s,p) n(s) section data(s,p,o) x(s,o) y(s,o) z(s,p,o) t(s,p) n(s) where o s p are index dimensions, x y z t are spatiotemporal coordinates and n is a string-valued coordinate (a label). It seems to me, as we have discussed before, that you can deduce what kind of data you have by examining the dimensions of the data and auxiliary coordinates. Therefore featureType is really redundant. I understand that it could be helpful if it can be relied on, but if it is going to be included than it must be possible to verify that it is correct, and for that we need the above rules to be stated. The CF checker will have to be able to work out what featureType describes the data by examining the data and auxiliary coordinates. Any program could do that.
It might be possible using your representation above, but once you add the variants as shown here, you can see that its quite impossible.
Im not sure why you think its important to be able to eliminate the featureType attribute. Its extremely useful metadata.
I have not clarified above which auxiliary coordinate variables are optional. You say x y t are mandatory but z is sometimes optional. I think the rules for this should be explicitly stated too, so the presence of coordinate variables can be checked.
This depends on the representation, and I intended to clarified them in each of the examples. If you see one thats unclear, can you point it out?
comment:65 in reply to: ↑ 62 Changed 11 years ago by caron
I'd like to suggest a small change to the ragged contiguous representation. It could be described generically like this for a 2D case:
float A(s); float N(s,strlen); // or N(o,strlen) int s(s); s:count="o"; float B(o); float D(o); D:coordinates="A B N";where A and B are (examples of) numerical auxiliary coordinate variables, N is a string-valued auxiliary coordinate variable (as in CF 6) and D is a data variable. The coordinate variable s is what you have called row_size. That is, I am proposing making this a coordinate variable in the Unidata sense, instead of depending on the name row_size. To assign a "special" variable name would be un-CF-like, since nothing in CF depends on the names of variables (as it says in CF 2.3, third para).
The proposal is to use a standard name to identify the row_size variable:
"The row_size variable contains the number of observations for each station, and is identified by having a standard_name of "ragged_row_size"."
I dont depend on variable naming anywhere in this proposal.
To indicate that it has a special role, I propose above the new count attribute, which indicates the dimension whose entries s counts. (Actually my s(s) is not legally a coordinate variable, because its values are not monotonic. The presence of the count attribute can be used to exempt it from that requirement.) In your proposal, you have no formal link between D and s; presumably you depend on their being only one s-like dimension in the file. Making an explicit link could simplify the software which is decoding the data, and also means that the proposal could be generalised to have more than one s-like and o-like dimension, and more than one featureType.
Currently the observation dimension is identified by examining the coordinates. I like your idea of explicitly naming the observation dimension. In fact, naming the parent dimension would be helpful also. I will consider that.
comment:66 follow-up: ↓ 67 Changed 11 years ago by bnl
Replying to caron:
Replying to jonathan:
Im not sure why you think its important to be able to eliminate the featureType attribute. Its extremely useful metadata.
Extremely, and it's so totally pervasive folk don't know when they're depending on it ... e.g. "gridded" data carries all the baggage of a feature type ... one knows exactly what is meant in this context, and one can build software to handle it, but it's built on the knowledge of the relationships between the coordinate variables and the grid which is implicit in this contextural knowledge as to what "grid" means ... if one only meant gridded in the mathematical/algorithmic sense, then grids cover all these "observational" types too ...
We have found it necessary to use the feature type concept in much of our work at BADC, and explicitly by name to help software exactly as John proposes here.
comment:67 in reply to: ↑ 66 ; follow-up: ↓ 69 Changed 11 years ago by jonathan
Replying to bnl:
Replying to caron:
Replying to jonathan:
Im not sure why you think its important to be able to eliminate the featureType attribute. Its extremely useful metadata.
Extremely, and it's so totally pervasive folk don't know when they're depending on it ...
I didn't say I wanted to eliminate it. I said I could see it would be useful even though it is redundant (in the sense of duplicating information which is already present in other ways).
I said that I think the structures associated with each featureType should be written down, so they can be checked, and because it could be helpful to a potential user to see them tabulated. Personally, I find it much more helpful to be given a list of things like that, than to have to work them out from examples, which I find rather unsettling, in case I miss something or misunderstand something. No doubt this has to do with learning styles, and we all have different ways of doing it.
John's useful table, which probably I should have read before, is comfortingly similar to what I deduced from the examples. However, it also shows that some featureTypes have more than one structure associated with them. I don't think it's "impossible" to write these down - in fact, John's done it! - but I would say that the featureTypes which actually correspond to more than one structure should be separated into more than one distinct featureType.
There are two motivations for the proposal, then. One is to label featureTypes, and the other is to provide ragged representations for them, to save space and variables. OK. It would be good to be able to say what the different structures have in common that means they need the featureType description. Is it that they have zero or one independent continuously varying geophysical coordinate, and any other coordinates are dependent on this? Again, that is a structural description, and not easily comprehensible, I'm afraid.
Jonathan
comment:68 in reply to: ↑ 62 ; follow-up: ↓ 70 Changed 11 years ago by caron
Replying to jonathan:
The 3D case for the ragged contiguous representation is more complicated, of course. It seems that your example in 9.7.3 is actually a combination of the ragged contiguous and ragged indexed structures, one for each of the dimensions. Wold it not be better to use one or the other purely? I find it rather mind-bending to be honest. I do think the ragged indexed representation has the advantage in terms of clarity.
I propose that the ragged indexed representation should look generically like this in 2D:
float A(s); float N(s,strlen); // or N(o,strlen) int o(o); o:expand="s"; float B(o); float D(o); D:coordinates="A B N";where again I am using a (Unidata) coordinate variable o(o) instead of a variable with a special name - you say parent_index in the description. As I said before, this structure is just like the compression by gathering described in CF 8.2, except I have called the attribute expand instead of compress because it doesn't have the same role. Whereas a compressed index never repeats values, a ragged one usually does. (Hence o(o) is not a legal coordinate variable, because it's not monotonic; again, expand can exempt it from that requirement.) The expand attribute indicates that the o coordinate specifies which s coordinate applies to each entry along the o dimension. This is easier than the contiguous representation to extend to 3D, and in fact CF 8.2 is mainly intended for the case where more than one dimension is compressed. You just have o:expand="s p", for instance, which means that o is indices into the combination of the (s,p) dimensions, collapsed into 1D.
The "compression by gathering" feature is for sparse rectangular arrays. This is for ragged arrays, and the meaning of the values needed are different. I dont think we can use expand = "s p" if I understand that since that would require a rectangular array.
I originally tried to allow any combination of representations, but it proved too difficult, so I decided to choose one. The indexed representation would require that the reader read the entire file to extract a single profile. So I made the assumption that the writer would have all the obs for a profile, and so could write them contiguously. Choosing the indexed representation to associate profiles with sections was fairly arbitrary, but I chose it because its more general, and allows profiles for multiple sections to be written in any order. Reading section_index(profile) is pretty fast, since probably profile wont be the unlimited dimension. However, one does have to choose a profile dimension size ahead of time, but thats a generic weakness in netcdf-3 (only one unlimited dimension).
I do acknowlege the complexity here, it one of my main concerns. However, in my experience many data writers are happy to be given a template to use, as long as it meets their needs, so Im hoping their minds wont be as boggled as ours.
comment:69 in reply to: ↑ 67 Changed 11 years ago by caron
Replying to jonathan:
Replying to bnl:
Replying to caron:
Replying to jonathan:
Im not sure why you think its important to be able to eliminate the featureType attribute. Its extremely useful metadata.
Extremely, and it's so totally pervasive folk don't know when they're depending on it ...
I didn't say I wanted to eliminate it. I said I could see it would be useful even though it is redundant (in the sense of duplicating information which is already present in other ways).
I said that I think the structures associated with each featureType should be written down, so they can be checked, and because it could be helpful to a potential user to see them tabulated. Personally, I find it much more helpful to be given a list of things like that, than to have to work them out from examples, which I find rather unsettling, in case I miss something or misunderstand something. No doubt this has to do with learning styles, and we all have different ways of doing it.
John's useful table, which probably I should have read before, is comfortingly similar to what I deduced from the examples. However, it also shows that some featureTypes have more than one structure associated with them. I don't think it's "impossible" to write these down - in fact, John's done it! - but I would say that the featureTypes which actually correspond to more than one structure should be separated into more than one distinct featureType.
There are two motivations for the proposal, then. One is to label featureTypes, and the other is to provide ragged representations for them, to save space and variables. OK. It would be good to be able to say what the different structures have in common that means they need the featureType description. Is it that they have zero or one independent continuously varying geophysical coordinate, and any other coordinates are dependent on this? Again, that is a structural description, and not easily comprehensible, I'm afraid.
Jonathan
One of the ways that I think of this (Id love to hear from others) is that the canonical representation of this kind of data is:
float data(obs); float lat(obs); float lon(obs); float alt(obs); float time(obs);
as opposed to gridded data:
float data(t,z,y,x); float t(t); float z(z); float y(y); float x(x);
the first says that the data is on a one-dimensional manifold (subset) embedded in 4 dimensional space/time, the second is that the data is a four-dimensional manifold embedded in 4 dimensional space/time.
OTOH, looking at dimensions like this isnt generally that clear, since dimensions in netcdf are used for data layout as well, so its pretty much impossible to examine the data structures in a file and figure out what the meaning is.
comment:70 in reply to: ↑ 68 Changed 11 years ago by jonathan
Dear John
Replying to caron:
The "compression by gathering" feature is for sparse rectangular arrays. This is for ragged arrays, and the meaning of the values needed are different. I dont think we can use expand = "s p" if I understand that since that would require a rectangular array.
(s,p) is an implied rectangular array (in CDL dimension order); they are independent dimensions in the multidimensional representation. For instance, in 9.7.1 you have s=22 p=33. Collapsing (s,p) into 1D (as in compression by gathering), index 0 means s-index 0 p-index 0; index 1 means s-index 0 p-index 1, ..., index 33 means s-index 1 pindex-0, ... If the o-variable with expand="s p" includes 33,33,33,33 that indicates station 1 profile 0 has four observations. This is like compression by gathering, except the indices are repeated, so the data of dimension o is mapped onto a smaller array (s,p), with several elements of (o) being assigned to each element of (s,p). In gathering, there are fewer elements in (o) than in (s,p), which is only sparsely populated by the mapping. I think the same attribute compress could be used formally but I suggested a different one because (a) the routine to do the mapping would be quite different (b) in compression, o can be strictly monotonic, whereas it generally will not be in expansion because of the repeated elements.
Bottom line: the ragged indexed representation looks formally similar to compression by gathering, which CF supports already, so it would be logical and parsimonious to use the existing kind of machinery.
Jonathan
comment:71 follow-up: ↓ 73 Changed 11 years ago by ngalbraith
Here is a question about the scope of this ticket. Is the main goal to define the structure of NetCDF files for in situ and other observational data using feature types, or is there a broader aim to provide mechanisms (and vocabularies) to help make CF-NetCDF meet the needs of observational oceanography?
I'm thinking about ways to describe instruments and non-geophysical "instrument variables" within NetCDF and CF, and I'm looking for some kind of consensus (if that's possible; I'll settle for a conversation and/or community guidelines) about what really needs to be standardized and, therefore, potentially, interoperable.
- Do we want or need to have standard vocabularies for all variables we carry along with a dataset, or just for geophysical quantities?
- What is the best way to encode information on sensors, such as
- sample schemes (beyond what's available via cell methods)
- complete precision/accuracy
- important sensor deployment info (mount, orientation)
- relevant set-up information
I think a lot of people are inserting this info into NetCDF, but relegating it to the status of a comment, by using free-field names for these concepts and processes. Conversely, the OceanSITES project is planning to put this kind of information in separate metadata files (SensorML or other); but it seems like we should be able to include it in the NetCDF, since that is supposed to be self-describing.
I'm using ancillary variables to encode instrument information, such as: sensor_vendor, sensor_name, sensor_mount, sensor_orientation, sensor_depth (e.g. for adcp), sensor_sampling_period, sensor_sampling_frequency, sensor_reporting_time, sensor_serial_number, and providing setup commands verbatim where they're needed to document the data (e.g. for ADCPs).
My question is whether the people who worked on the point obs specification are also working on other means of making NetCDF and CF more useful for in situ data. Or, is there another group working on this? Any resources you know of where guidelines for including instrument metadata have been developed?
Thanks for any ideas on this - Nan
comment:72 Changed 11 years ago by stevehankin
Nan,
It is an important subtlety that you've raised: defining a general structure for in situ data versus meeting the needs of operational oceanography. I'd argue that the role of CF, per se, should be limited to the former -- to define the structures needed to carry in situ data collections. It would then be the task of the oceanography community to define a more detailed profile that builds upon CF and requires the specific metadata elements that operational oceanographers need. (In the satellite community the GHRSST data standards documents (the 'GDS') provide an excellent example of this approach. Ocean-Sites is doing this, too, isn't it?)
The boundary between these two goals, however, is a gray area. It would be a positive contribution to CF if we can add those attributes (metadata content) into it that are generally applicable to many types of in situ observations. I'd argue, though, that those discussions should take place in a *separate* trac ticket from this one. This ticket already has a very full plate in dealing with the general in situ data structure issues.
Care to start a new trac ticket?
comment:73 in reply to: ↑ 71 Changed 11 years ago by caron
Hi Nan:
As Steve suggests, this would be a good new topic, as we are not trying to deal with this kind of "domain metadata" in this ticket.
John
comment:74 Changed 10 years ago by caron
We have a new draft of this proposal ready for your comments. Its available at:
http://www.unidata.ucar.edu/staff/caron/public/CFch9-feb25_jg.pdf http://www.unidata.ucar.edu/staff/caron/public/CFch9-feb25_jg.docx
Ive attached the docx file to this trac page, but the pdf is too big (450K - can we increase the limit on the Trac site) ?
thanks for your patience for this process,
John Caron, Jonathan Gregory, Steve Hankin
comment:75 Changed 10 years ago by painter1
- Description modified (diff)
Changed 10 years ago by painter1
comment:76 Changed 10 years ago by painter1
- Description modified (diff)
comment:77 Changed 10 years ago by stevehankin
- Owner changed from cf-conventions@… to stevehankin
- Status changed from new to assigned
The comments that have been received on this ticket have been discussed and incorporated into the latest Word version of this chapter. (Final editorial changes to be posted on-line shortly.) Within a few weeks the chapter will be posted in an HTML and PDF format that is consistent with the main body of CF.
As there haven't been any further objections or comments for more than three weeks, in my capacity as the moderator of this ticket, I declare this chapter -- Chapter 9, Discrete Sampling Geometries -- to be accepted according to the rules of the CF standards process.
Steve Hankin
comment:78 Changed 10 years ago by stevehankin
- Resolution set to fixed
- Status changed from assigned to closed
Dear John
As you say, the main technical addition which is needed is a way to represent ragged array, for instance so that timeseries of different lengths from various stations can be contained in one data variable. You propose two ways to do this
The second method is more general. It's quite easy to use: it's like a "select" with "where" in SQL. It takes more space because the index array has the same dimension as the data array. However that is no more than a doubling of space, and so it's still likely to be more economical than not using a ragged array, if there is a large spread in the length of the timeseries. So could we just adopt the second method, for the sake of simplicity?
You also propose a PointFeature? attribute. What is the need for this? The cases are formally similar. Why do they need to be distinguished, other than by the metadata which describes the coordinate and data variables, from which one can deduce the intended purpose (as station, trajectory, etc.)?
Best wishes
Jonathan