Context Navigation

← Previous Ticket
Next Ticket →

#85 new enhancement

Enable survex parser to be used by more external programs

Reported by:	Wookey	Owned by:	Olly Betts
Priority:	minor	Milestone:
Component:	Other	Version:
Keywords:		Cc:

Description

Survex files are hard to parse fully and correctly. Quite a few other programs have implemented parsers and mostly get it wrong.

Using the real parser is a much better plan. However this is only currently possible by snapshotting the imgc. and img.h files, and as they are C it only works sensibly with C and C++ programs. Anyone parsing from java (Tunnel or caveconverter) or Python (troggle) is out of luck.

Dump3d partly solves this problem, but only a subset of info in the the .3d files.

Making an actual library and/or API which could be used by other programs would make life a lot easier for many things.

Being able to support both survex and therion formats would be even better, but that should be another ticket.

Change History (7)

comment:1 Changed 9 years ago by Wookey

SWIG is one sensible approach to making the library available to many more languages.

comment:2 Changed 9 years ago by Olly Betts

Priority:	major → minor

Can you clarify what you mean by "survex files"?

You talk about wrapping the img code with SWIG - img reads processed data (.3d, .plt, etc), and writes .3d. Wrapping img with SWIG would be pretty easy to do (no deep familiarity with the Survex code required).

If you're talking about ".svx" files (which "support both survex and therion formats" seems to suggest, or did you mean "lox" by "therion format"?) that's a very different matter. The parsing of ".svx" files is not a single module which you can easily lift out of the cavern code. It would indeed be nicer if it was, but changing that is a significant amount of work - every single place where the parser interacts with the rest of the code would need replacing with some sort of external interface, and I'm not entirely sure that the benefits really justify the effort (especially as I'm not short of useful things to work on).

And most cases where people try to parse .svx files would be better done from the processed data anyway (that's true of troggle and tunnel).

comment:3 Changed 9 years ago by Wookey

Good point. I am mixing up the .3d reader (img.*, easily separated) from the .svx parsing (not easily separated).

If everything is available in .3d then yes reading from there may well be sufficient. But do we have title, team? Exports? references, co-ordinate systems?

I did point out to Julian At Eurospeleo that everything he needed (all names in equates, and all connectivity) came out of the dump3d format, and he agreed, so need for parsing .svx removed.

He pointed out there there is no actual reference-ID for each point, just the XYZ co-ordinates, which could potentially clash, and certainly having an ID would make the files shorter. In practice this works well enough, but a reference would be better.

comment:4 Changed 9 years ago by Olly Betts

But do we have title

Yes, but only for the top-level.

team?

No - I'd like to, but the obstacle is parsing - so far cavern has done no checking of the format, and the examples I have of actual use don't match the Survex manual and vary wildly. Same issue for "instruments".

Exports?

Yes (see blue blobs in aven).

references

No, that's just waiting for the next time the .3d format is revised (it's a free-form field so no parsing issues).

co-ordinate systems?

Yes (needed for things like GPX and KML export, loading DEMs).

It'd be useful to know what to prioritise for adding to what is carried over into the 3d format - was that just an ad-hoc list, or are those the things which troggle or something needs access to in order to stop trying to parse .3d files?

He pointed out there there is no actual reference-ID for each point, just the XYZ co-ordinates, which could potentially clash,

I pointed this wrinkle out to Julian on expo actually. It's pretty much theoretical, but at the very least it is slightly unsatisfying.

and certainly having an ID would make the files shorter.

I'm not sure that's necessarily true - each point is going to be used very close to 2 times on average (because the number of legs and stations is close to equal and each leg has two ends). So you save on storing the coordinates an extra time per station, but you have to store the reference ID three times (once with the coordinates, and once each time it is used). Each set of coordinates is three 4-byte integers, so if the reference ID is 4 bytes (e.g. all-region.svx has >60,000 stations) you save nothing, and probably lose a bit due needing a byte or two to say "this is a reference ID" before each one is defined. I think you'd have to use some sort of variable width encoding for the reference ID if you hope to save space with this change.

In practice this works well enough, but a reference would be better.

One issue is that it needs to work in such a way as to provide an API which can load files which rely on coordinate equality (existing .3d files for a start).

In the source data a leg is actually between two named stations, so maybe the leg ends should be station names (rather than these reference ids) to avoid losing the information about which station in an equated group was in the source data. You'd also need to store equate information explicitly too or else you still have to assume that equal coordinates means an equate. Both of those are actually on the TODO list (top section): http://survex.com/todo.html

comment:5 Changed 9 years ago by Philip Schuchardt

I would like to have support for parsing *.svx. It's useful for porting Survex files to other formats or importing the original data into external programs, specifically Cavewhere. I envision having a library that can parse *.svx files and generate a data structure of the files in memory. Then parsed data structure could be exported to other formats or be passed to survey processing routine in the library. The processing method then creates another structure, which I think already exists in img.h. The img.h can still produce 3d files, which the library can read back into the same structure that was created by the processing routine.

I'm not sure what this would take, but I'm willing to work on it! If you have any pointers, it would be super helpful, let me know.

Also are there any test cases or unit tests I can use to make sure I don't break anything?

comment:6 Changed 2 years ago by Pawczak

Hey, is there any actions on this topic ? I'm also interested in parsing svx files in external applications. I could also help with coding such a library / parser which could be used outside survex.

comment:7 Changed 6 months ago by Olly Betts

There's a library to read .3d files (and has been for a long time). That's actually a better answer for many uses. Not all metadata is currently in the .3d file which is the main limitation there (so you can't get team names, etc but cavern doesn't actually parse these at all currently either).

I suspect there are different ideas about what this hypothetical ".svx library" would look like. If it's literally just the parser then you'd get some sort of AST (https://en.wikipedia.org/wiki/Abstract_syntax_tree), which seems to be what's described by comment:5. Currently nothing like that is created inside cavern - it just parses as it goes.

The real problem is that parsing of .svx data is deeply entwined with the rest of the cavern code - extracting it as a library seems like a huge project to me. If we actually built all the data as an AST and then consumed it like comment:5 seems to suggest then that means a major overhaul of the cavern code, and the end result for cavern would be significant space overhead (and probably time overhead too), and likely some new bugs introduced in the process. I think it would really need to be a more callback oriented approach.

If I was writing cavern from scratch now I'd probably structure it in a much more modular way, but trying to retrofit a modular structure at this point doesn't really seem feasible.

If you're not wanting an AST but wanting the leg information (and fixed points) then that's quite a different API. I'd really recommend using the .3d file instead if you can for that.

Overall I'm afraid this is a low priority compared to most of the other open tickets.

Note: See TracTickets for help on using tickets.

Download in other formats: