Opened 4 years ago

Last modified 4 years ago

#85 new enhancement

Enable survex parser to be used by more external programs

Reported by: Wookey Owned by: Olly Betts
Priority: minor Milestone:
Component: Other Version:
Keywords: Cc:

Description

Survex files are hard to parse fully and correctly. Quite a few other programs have implemented parsers and mostly get it wrong.

Using the real parser is a much better plan. However this is only currently possible by snapshotting the imgc. and img.h files, and as they are C it only works sensibly with C and C++ programs. Anyone parsing from java (Tunnel or caveconverter) or Python (troggle) is out of luck.

Dump3d partly solves this problem, but only a subset of info in the the .3d files.

Making an actual library and/or API which could be used by other programs would make life a lot easier for many things.

Being able to support both survex and therion formats would be even better, but that should be another ticket.

Change History (5)

comment:1 Changed 4 years ago by Wookey

SWIG is one sensible approach to making the library available to many more languages.

comment:2 Changed 4 years ago by Olly Betts

Priority: majorminor

Can you clarify what you mean by "survex files"?

You talk about wrapping the img code with SWIG - img reads processed data (.3d, .plt, etc), and writes .3d. Wrapping img with SWIG would be pretty easy to do (no deep familiarity with the Survex code required).

If you're talking about ".svx" files (which "support both survex and therion formats" seems to suggest, or did you mean "lox" by "therion format"?) that's a very different matter. The parsing of ".svx" files is not a single module which you can easily lift out of the cavern code. It would indeed be nicer if it was, but changing that is a significant amount of work - every single place where the parser interacts with the rest of the code would need replacing with some sort of external interface, and I'm not entirely sure that the benefits really justify the effort (especially as I'm not short of useful things to work on).

And most cases where people try to parse .svx files would be better done from the processed data anyway (that's true of troggle and tunnel).

comment:3 Changed 4 years ago by Wookey

Good point. I am mixing up the .3d reader (img.*, easily separated) from the .svx parsing (not easily separated).

If everything is available in .3d then yes reading from there may well be sufficient. But do we have title, team? Exports? references, co-ordinate systems?

I did point out to Julian At Eurospeleo that everything he needed (all names in equates, and all connectivity) came out of the dump3d format, and he agreed, so need for parsing .svx removed.

He pointed out there there is no actual reference-ID for each point, just the XYZ co-ordinates, which could potentially clash, and certainly having an ID would make the files shorter. In practice this works well enough, but a reference would be better.

comment:4 Changed 4 years ago by Olly Betts

But do we have title

Yes, but only for the top-level.

team?

No - I'd like to, but the obstacle is parsing - so far cavern has done no checking of the format, and the examples I have of actual use don't match the Survex manual and vary wildly. Same issue for "instruments".

Exports?

Yes (see blue blobs in aven).

references

No, that's just waiting for the next time the .3d format is revised (it's a free-form field so no parsing issues).

co-ordinate systems?

Yes (needed for things like GPX and KML export, loading DEMs).

It'd be useful to know what to prioritise for adding to what is carried over into the 3d format - was that just an ad-hoc list, or are those the things which troggle or something needs access to in order to stop trying to parse .3d files?


He pointed out there there is no actual reference-ID for each point, just the XYZ co-ordinates, which could potentially clash,

I pointed this wrinkle out to Julian on expo actually. It's pretty much theoretical, but at the very least it is slightly unsatisfying.

and certainly having an ID would make the files shorter.

I'm not sure that's necessarily true - each point is going to be used very close to 2 times on average (because the number of legs and stations is close to equal and each leg has two ends). So you save on storing the coordinates an extra time per station, but you have to store the reference ID three times (once with the coordinates, and once each time it is used). Each set of coordinates is three 4-byte integers, so if the reference ID is 4 bytes (e.g. all-region.svx has >60,000 stations) you save nothing, and probably lose a bit due needing a byte or two to say "this is a reference ID" before each one is defined. I think you'd have to use some sort of variable width encoding for the reference ID if you hope to save space with this change.

In practice this works well enough, but a reference would be better.

One issue is that it needs to work in such a way as to provide an API which can load files which rely on coordinate equality (existing .3d files for a start).

In the source data a leg is actually between two named stations, so maybe the leg ends should be station names (rather than these reference ids) to avoid losing the information about which station in an equated group was in the source data. You'd also need to store equate information explicitly too or else you still have to assume that equal coordinates means an equate. Both of those are actually on the TODO list (top section): http://survex.com/todo.html

comment:5 Changed 4 years ago by Philip Schuchardt

I would like to have support for parsing *.svx. It's useful for porting Survex files to other formats or importing the original data into external programs, specifically Cavewhere. I envision having a library that can parse *.svx files and generate a data structure of the files in memory. Then parsed data structure could be exported to other formats or be passed to survey processing routine in the library. The processing method then creates another structure, which I think already exists in img.h. The img.h can still produce 3d files, which the library can read back into the same structure that was created by the processing routine.

I'm not sure what this would take, but I'm willing to work on it! If you have any pointers, it would be super helpful, let me know.

Also are there any test cases or unit tests I can use to make sure I don't break anything?

Note: See TracTickets for help on using tickets.