Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataless mth5 from fdsn #151

Closed
kkappler opened this issue Jul 8, 2023 · 16 comments · Fixed by #142
Closed

dataless mth5 from fdsn #151

kkappler opened this issue Jul 8, 2023 · 16 comments · Fixed by #142
Assignees

Comments

@kkappler
Copy link
Collaborator

kkappler commented Jul 8, 2023

As part of widescale testing on earthscope, one exercise undertaken was to try to create mth5s from the fdsn StationXML served by IRIS/Earthscope.

The iteration was ultimately sourced by stations scraped from SPUD transfer functions.
Stations and network codes were directly extracted from the "data" XML files (files that were accessed from URLs of the form: https://ds.iris.edu/spudservice/data/) in all cases that the string "mda" was grepped.

Moreover, any remote stations listed in said files were also added to the iterator.

In all, on a first pass, 2007 independent cases were identified.
When a remote-reference station was accessed, because the TF XML do not associate a network code, it was assumed to be in the same network as the "primary" station (the one associated with the transfer function).

The results of the first-pass attempt to build these 2007 mth5 files (with metadata only) were recorded in a dataframe and are linked to this ticket in the following table. 02_local_metadata_coverage.csv

In summary, 1631 h5 files were built successfully and 376 exceptions were encountered.

The exception type value counts are as follows:

IndexError 198
AttributeError 102
XMLSyntaxError 59
ValueError 16
TypeError 1

IndexError cases likely represent data were not found. This could be for a few reasons, such as:

  1. The remote reference station was from Intermagnet, not earthscope (there is at least one such case, where FRD was used as RR)
  2. The data are not yet archived at Earthscope
  3. Some other condition

Here is a sample URL to the response level metadata for Network=EM, station=FL001, which has the IndexError:

https://service.iris.edu/fdsnws/station/1/query?net=EM&sta=FL001&level=response&format=xml&includecomments=true&includeavailability=true&nodata=404

There were 78 unique AttributeError messages, all associated with the "Person" object. A few are listed below,
a full set can be seen by reading in the csv and calling unique on the AttributeError rows, i.e.:

attribute_error_df = df[df.exception == "AttributeError"]
print(attribute_error_df.error_message.unique())

"'Person' object has no attribute 'Ey polarity was reversed for all runs, as verified by visually remote referencing the time series against neighboring sites'",
"'Person' object has no attribute 'S 48m followed by 48m E&W'",
"'Person' object has no attribute 'flat forest'",
...
"'Person' object has no attribute 'woody leafy ground cover'",
...
"'Person' object has no attribute 'swapped out E-W electrodes'",
"'Person' object has no attribute 'no guests welcome'",
"'Person' object has no attribute 'site located in a watershed on a slope'"],

Here is a sample URL to the response level metadata for Network=8P, station=CAV09, which has the AttributeError:

https://service.iris.edu/fdsnws/station/1/query?net=8P&sta=CAV09&level=response&format=xml&includecomments=true&includeavailability=true&nodata=404

The XMLSyntaxError cases had only a single unique value:
"Start tag expected, '<' not found, line 1, column 1"

Here is a sample URL to the response level metadata for Network=EM, station=NEN28, which has the XMLSyntaxError:
https://service.iris.edu/fdsnws/station/1/query?net=EM&sta=NEN28&level=response&format=xml&includecomments=true&includeavailability=true&nodata=404

The ValueError cases were all of the form:

'value 361.4 out of bounds (0, 360)',
...
'value 361.5 out of bounds (0, 360)',
'value 360.7 out of bounds (0, 360)',
'value 360.2 out of bounds (0, 360)',

Here is a sample URL to the response level metadatat for Network=EM, station=CON25, which has the ValueError:
https://service.iris.edu/fdsnws/station/1/query?net=EM&sta=CON25&level=response&format=xml&includecomments=true&includeavailability=true&nodata=404

And the lone TypeError was:

'Index must be a string or integer value.']

Here the URL to the response level metadata for Network=8P, station=REU09, which had the TypeError:

https://service.iris.edu/fdsnws/station/1/query?net=8P&sta=REU09&level=response&format=xml&includecomments=true&includeavailability=true&nodata=404

@kkappler
Copy link
Collaborator Author

kkappler commented Jul 15, 2023

@kujaku11 After re-running with the latest updates here is the summary:

Identified 6 exception types
['IndexError', 'ValueError', 'XMLSyntaxError', 'AttributeError', 'TypeError', 'MTValidatorError']

  • IndexError: 198 instances; Likely not Earthscope stations, need to verify though
  • 3 of these stations: ZYX, T, Z come from emtf_id=20619435, data_id=21057339, which corresponds to ZU, ARU42, in the XML we have:
<Filename>ARU42e_ZYX_T_ARY42_ARV40_ARW40_ARW39_ARU42-ARU42e_ARY42_ARV40_ARW40_ARW39_ARU42-ARU42e_ARV40_ARU42-ARU42cd_ZYZ_T41-ARU42cd_ARU42_TNV44-ARU42cd_T41.png</Filename>

which are getting parsed as bogus station names

  • 56 of these are 3-char names, which look like they have the state-code omitted
  • 139 are 5-char codes that are likely not (yet?) archived at Earthscope
  • 1 is a 4 char code NMX2, maybe should be NMX02 ? NMX20?, it comes from 8P_RET11 which associates within the XML a multi-station processing filename:

RET11a_RET11_AZV19_TXD27_TXD25_AZU19-RET11a_U19-RET11a_RET11_AZV19_TXD27_TXD25_AZU19-NVT11a_COS21_RER14_NMX21_NVT11-NVT11a_RER14_NVT11.png

and processing Tag: RET11a_RET11_AZV19_TXD27_TXD25_AZU19-RET11a_U19-RET11a_RET11_AZV19_TXD27_TXD25_AZU19-NVT11a_COS21_RER14_NMX2 so it is probably supposed to be NMX21
  • ValueError: 12 instances; These are all of the form 'value 361.4 out of bounds (0, 360)'. This is an obspy error. Options are file an Issue or a Pull Request on obspy, or correct the data so that Azimuths are modulo 360.
  • XMLSyntaxError has 25 instances, with 1 unique message: "Start tag expected, '<' not found, line 1, column 1"
    Sample URL for Network: EM Station NEN28
  • AttributeError: 139 instances; 119 unique messages, all of the form "'Person' object has no attribute 'Ey polarity was reversed for all runs, as verified by visually remote referencing the time series against neighboring sites'". or some other seemingly random block of text flagged as not being a Person Attr. Sample URL for response level metadata for Network=8P, station=CAV09
  • TypeError: 1 instances; ['Index must be a string or integer value.']. Sample URL to the response level metadata for Network=8P, station=REU09, which had the TypeError:
  • MTValidatorError: 1 instance, ['Attribute name cannot start with a number, 90 to 180,270 -- jade timing shifted -2310s to 10-apr-2022 23']. Sample URL for Network ZU, Station TNV46

Below is pasted a condensed output of error messages.

IndexError has 198 instances, with 1 unique message(s)
['list index out of range']
ValueError has 12 instances, with 11 unique message(s)
['value 361.4 out of bounds (0, 360)' 'value 360.9 out of bounds (0, 360)'
... 'value 366.7 out of bounds (0, 360)']
XMLSyntaxError has 25 instances, with 1 unique message(s)
["Start tag expected, '<' not found, line 1, column 1"]
AttributeError has 139 instances, with 119 unique message(s)
["'Person' object has no attribute 'flat forest'"
"'Person' object has no attribute '48m S, 47'"
...
'Person' object has no attribute 'west of a 2 track'"]
TypeError has 1 instances, with 1 unique message(s)
['Index must be a string or integer value.']
MTValidatorError has 1 instances, with 1 unique message(s)
['Attribute name cannot start with a number, 90 to 180,270 -- jade timing shifted -2310s to 10-apr-2022 23']

@kkappler
Copy link
Collaborator Author

Re-ran this test on 20230720, using wildcards ["F", "Q"] as channel selection. Results were:

{'FDSNNoDataException': 198, 'ValueError': 18, 'IndexError': 143, 'AttributeError': 158, 'TypeError': 1, 'MTValidatorError': 1}

TOTAL #Exceptions 519 of 2007 Cases

However, only two channels are getting saved to the h5, probably something to do with the length of the request df when wildcards are used. I'll look into this.
02_exceptions_summary_20230720.txt
02_local_metadata_coverage.csv

@kkappler
Copy link
Collaborator Author

kkappler commented Jul 22, 2023

When wildcards are not used, and explicit channel lists are passed the stats improve slightly:

There were 2007 unique network-station pairs in 2007 rows

{'IndexError': 227, 'ValueError': 18, 'AttributeError': 158, 'TypeError': 1, 'FDSNTimeoutException': 1, 'MTValidatorError': 1}TOTAL #Exceptions 406 of 2007 Cases

02_exceptions_summary_2023-07-22_082438.txt
02_local_metadata_coverage.csv

@kkappler
Copy link
Collaborator Author

kkappler commented Aug 5, 2023

@kujaku11 Ignoring wildcards for now, there is a more fundamental problem, which is that the filters in many of the mth5 are not correct. I added two more columns to the widescale dataless mth5 building summary dataframe.
One column is called num_filterless_channels, which is correctly zero in the pasted sample below, and the other column is a string representation of a dictionary, keyed by "component", with value "num_filters" for each row of the dataframe.

Importantly, the pasted example, for network EM, and station ORF08 shows that ex is associated with 2 filters but ey is associated with only one filter. This is not correct, and the source xml
Indicates 2-stages for both MQE and MQN, so the xml is not being parsed correctly.

Also, in many of the files (including CAS04 which processes in agreement with SPUD TFs) there are 6 stages for E-fields and 3 for magnetics. So there is also a question about why the filters are stored in different ways, but the first and foremost problem is to sort out why is the second stage of ey being ignored sometimes.

*CAS04 looks like: {'ex': 6, 'ey': 6, 'hx': 3, 'hy': 3, 'hz': 3}

<title></title>
<meta name="generator" content="LibreOffice 7.3.7.2 (Linux)"/>
<style type="text/css">
	body,div,table,thead,tbody,tfoot,tr,th,td,p { font-family:"Liberation Sans"; font-size:x-small }
	a.comment-indicator:hover + comment { background:#ffd; position:absolute; display:block; border:1px solid black; padding:0.5em;  } 
	a.comment-indicator { background:red; display:inline-block; border:1px solid black; width:0.5em; height:0.5em;  } 
	comment { display:none;  } 
</style>
num_filterless_channels num_filter_details
0 {'ex': 2, 'ey': 1, 'hx': 1, 'hy': 1, 'hz': 1}

02_local_metadata_coverage.csv

@kkappler
Copy link
Collaborator Author

kkappler commented Aug 6, 2023

The problem with not getting all the filters entered was in:

def _xml_response_to_mt(self, xml_channel, existing_filters={}):

This function parses the filters from obspy into mt filters

The method _add_filter_number() was issuing redundant numbers when encountering multiple new, nameless filters.
I tried to fixed this and created a branch (fix_issue_151)

Also, note that if the filter has_a name, we do not assess whether or not it is new, ... is this OK??
I think so, because if it has_a name, it will be registered in ch_filter_dict which is returned by _xml_response_to_mt and then merged into the survey before the next channel is processed.

kkappler added a commit that referenced this issue Aug 6, 2023
The method _add_filter_number() was issuing redundant numbers when encountering multiple new, nameless filters.
I tried to fixed this and created a branch (fix_issue_151)

Also, note that if the filter has_a name, we do not assess whether or not it is new, ... is this OK??
I think so, because if it has_a name, it will be registered in ch_filter_dict which is returned by _xml_response_to_mt and then merged into the survey before the next channel is processed.

Relates to issue #151
@kujaku11
Copy link
Owner

kujaku11 commented Aug 18, 2023

@kkappler Attribute Error should be fixed now, it had to do with parsing the comments.

@kujaku11
Copy link
Owner

@kkappler TypeError fixed, the run id was None, added logic to skip.

@kujaku11
Copy link
Owner

@kkappler The IndexError seems to be coming from the fact that the xml's pulled from IRIS are transfer functions and not stationxmls. Can you verify that is correct? I'm basing this on the links you pasted for the examples to the IndexError and they return a transfer function xml.

@kujaku11
Copy link
Owner

@kkappler As for the filters, this is a MakeMTH5 issue because the stationxml is agnostic to the runs, so in the case that the data contains multiple runs that are not in the StationXML then the metadata currently are not populated with the channel metadata. Thus we need to fix this in MakeMTH5.

@kkappler
Copy link
Collaborator Author

kkappler commented Aug 20, 2023

Re-ran data-less wide-scale build Sat 19 Aug, 2023. 5 errors. I think 2 are moot, and will confirm. and will debug another this week, leaving 2 errors I don't understand.

===============================================

*** EXCEPTIONS SUMMARY ***

Identified 5 exception types
['XMLSyntaxError', 'IndexError', 'ValueError', 'NotImplementedError', 'TypeError']

5 instances of XMLSyntaxError, with 1 unique error(s)
["Start tag expected, '<' not found, line 1, column 1"]

EM,IDK11,,,0,0,-1,,XMLSyntaxError,"Start tag expected, '<' not found, line 1, column 1",14864917,14864916,14864917_EM_IDH12.xml
example link

However, rerunning the code the next day, EM_IDK11 seems to have built fine. Weird. Makes me wonder if this is some error during transmission of data, and sometimes a malformed xml is received? I will start tracking which Network-Station pairs give the error, to see if there is a pattern.

  • 196 instances of IndexError, with 1 unique error(s)
    ['list index out of range']
    At least most of these were because these are not archived at IRIS (404 Not Found)
    Still analysing some details...

  • 18 instances of ValueError, with 17 unique error(s)

['value 361.4 out of bounds (0, 360)' 'value 362.6 out of bounds (0, 360)'
'value 360.9 out of bounds (0, 360)' 'value 360.8 out of bounds (0, 360)'
'value 361.8 out of bounds (0, 360)' 'value 363.4 out of bounds (0, 360)'
'value 362.0 out of bounds (0, 360)' 'value 361.5 out of bounds (0, 360)'
'value 360.7 out of bounds (0, 360)' 'value 361.1 out of bounds (0, 360)'
'value 360.2 out of bounds (0, 360)' 'value 370.1 out of bounds (0, 360)'
'value 369.6 out of bounds (0, 360)' 'value 361.0 out of bounds (0, 360)'
'value 367.3 out of bounds (0, 360)' 'value 366.8 out of bounds (0, 360)'
'value 366.7 out of bounds (0, 360)']
This is an obspy + metadata issue, either fix the metadata or file PR on obspy to add modulo360 to their azimuth handling

377 instances of NotImplementedError, with 2 unique error(s)
['Expected a unique network associated with 1970-01-01 00:00:00--2023-08-19 00:00:00Instead found 9 networks'
'Expected a unique network associated with 1970-01-01 00:00:00--2023-08-19 00:00:00Instead found 4 networks']
These are new -- @kkappler will debug issue

1 instances of TypeError, with 1 unique error(s)
Index must be a string or integer value.

There were 2005 unique network-station pairs in 2005 rows

{'XMLSyntaxError': 5, 'IndexError': 196, 'ValueError': 18, 'NotImplementedError': 377, 'TypeError': 1}
TOTAL #Exceptions 597 of 2005 Cases
Total scraping & review time 1736.10s using 1 partitions

(https://github.com/kujaku11/mt_metadata/files/12386965/02_exceptions_summary_2023-08-19_171631.txt)
02_local_metadata_coverage.csv

@kkappler
Copy link
Collaborator Author

kkappler commented Aug 20, 2023

Rerunning with latest updates yields:

{'XMLSyntaxError': 6, 'FDSNNoDataException': 110, 'ValueError': 18, 'NotImplementedError': 460, 'TypeError': 1}TOTAL #Exceptions 595 of 2005 Cases

  • 1 instances of TypeError, with 1 unique error(s) Index must be a string or integer value.

@kujaku11 can you try parsing the xml for 8P, REU09? (example link)
02_exceptions_summary_2023-08-20_112742.txt
02_local_metadata_coverage.csv

@kkappler
Copy link
Collaborator Author

kkappler commented Aug 20, 2023

The NotImplementedError cases are happening because a unique network obspy object cannot be determined from the date range.
This is because I am just using 1970-present as the date range. Network ZU has 9 instances in that interval. The best solution for this is to specify the data time interval, because then a unique network should be returned. The only way I can think of to do that would be to consult IRIS data availability.

To handle this, I have created a custom error in the earthscope test_utils called DataAvailabilityError, this is triggered when Laura's data availability tables show no available data. There were 196 (out of 2005) instances of this error. All 196 were checked to give 404 Not Found when the following url was checked
url = f"https://service.iris.edu/fdsnws/station/1/query?net={net}&sta={sta}&level=response&format=xml&includecomments=true&includeavailability=true&nodata=404"

This brings the current set of errors down to:
196 DataAvailabilityError
18 ValueError

Which are both understood. The only remaining were:
1 FDSNTimeoutException: EM, OHM52, this was re-run and built successfully
1 TypeError: 8P, REU09, this is a real error, it occurs in
mt_station = self.station_translator.xml_to_mt(xml_station)
1 IndexError: 8P, REX11, It looks like the issue is that LQN is in a separate Network than the other channels, it's weird, and it also built in the past, but I need to check if it built correctly in the past.

@kujaku11
Copy link
Owner

@kkappler I was able to parse 8P.REU09 at least from the console just downloading the link you set.

@kkappler
Copy link
Collaborator Author

@kujaku11 I confirm 8P_REU09.h5 builds.

The only outstanding case is now 8P REX11.

@kujaku11
Copy link
Owner

@kkappler Can you confirm 8P_REX11 stationxml can be read. I think the issue with building the H5 is that channel LQN only has a tag for run d. I don't know if this is a mistake in the metadata or that LQN was only recorded for run d. This is what the metadata from IRIS says.

image

And here's the data availability:

#Network Station Location Channel Quality SampleRate Earliest Latest
8P REX11 -- LFE M 1.0 2019-07-21T17:00:28.000000Z 2019-07-21T17:34:45.000000Z
8P REX11 -- LFE M 1.0 2019-07-21T17:49:40.000000Z 2019-08-03T02:11:34.000000Z
8P REX11 -- LFE M 1.0 2019-08-03T02:20:10.000000Z 2019-08-03T14:58:30.000000Z
8P REX11 -- LFE M 1.0 2019-08-03T17:19:18.000000Z 2019-08-03T17:49:04.000000Z
8P REX11 -- LFE M 1.0 2019-08-03T18:04:35.000000Z 2019-08-11T15:37:29.000000Z
8P REX11 -- LFN M 1.0 2019-07-21T17:00:28.000000Z 2019-07-21T17:34:45.000000Z
8P REX11 -- LFN M 1.0 2019-07-21T17:49:40.000000Z 2019-08-03T02:11:34.000000Z
8P REX11 -- LFN M 1.0 2019-08-03T02:20:10.000000Z 2019-08-03T14:58:30.000000Z
8P REX11 -- LFN M 1.0 2019-08-03T17:19:18.000000Z 2019-08-03T17:49:04.000000Z
8P REX11 -- LFN M 1.0 2019-08-03T18:04:35.000000Z 2019-08-11T15:37:29.000000Z
8P REX11 -- LFZ M 1.0 2019-07-21T17:00:28.000000Z 2019-07-21T17:34:45.000000Z
8P REX11 -- LFZ M 1.0 2019-07-21T17:49:40.000000Z 2019-08-03T02:11:34.000000Z
8P REX11 -- LFZ M 1.0 2019-08-03T02:20:10.000000Z 2019-08-03T14:58:30.000000Z
8P REX11 -- LFZ M 1.0 2019-08-03T17:19:18.000000Z 2019-08-03T17:49:04.000000Z
8P REX11 -- LFZ M 1.0 2019-08-03T18:04:35.000000Z 2019-08-11T15:37:29.000000Z
8P REX11 -- LQE M 1.0 2019-07-21T17:00:28.000000Z 2019-07-21T17:34:45.000000Z
8P REX11 -- LQE M 1.0 2019-07-21T17:49:40.000000Z 2019-08-03T02:11:34.000000Z
8P REX11 -- LQE M 1.0 2019-08-03T02:20:10.000000Z 2019-08-03T14:58:30.000000Z
8P REX11 -- LQE M 1.0 2019-08-03T17:19:18.000000Z 2019-08-03T17:49:04.000000Z
8P REX11 -- LQE M 1.0 2019-08-03T18:04:35.000000Z 2019-08-11T15:37:29.000000Z
8P REX11 -- LQN M 1.0 2019-08-03T18:04:35.000000Z 2019-08-11T15:37:29.000000Z

So it looks like LQN was only recorded for run d. Not sure where the problem arises. I'll continue to investigate.

@kujaku11
Copy link
Owner

@kkappler I was able to build an H5 for REX11. Here's the channel summary.

image

Can you confirm or refute the building of REX11?

@kujaku11 kujaku11 linked a pull request Sep 21, 2023 that will close this issue
Merged
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants