1

I am attempting to parse a large amount of text that cannot be readily used by other software due to the human readable design. However, each "section" of the text has the same format. By "section", I mean lines 1-10 (section 1) will have the same format as lines 11-20 (section 2). My plan is to loop through this text and gather all sections into a list. Then convert that list into a CSV.

Example:

####################################
123456789   first-name last-name   2/25/2018
------------------------------------
more-user-info1               12345
other-info1
123                   user-name1
------------------------------------
even-more-data1
####################################
112233445   first-name last-name   1/1/2018
------------------------------------
more-user-info2               78900
other-info2
555                   user-name2
------------------------------------
even-more-data2
####################################
<piece 3 here> ...

So to give you an idea ###... means new section.

My question is what is the best approach to parsing this data? This is just an example, a real section has a lot more data and a lot of edge cases/optional data. This is also coming from a super old program so I am not aware of any blueprint or business rules around the data.

I am currently using regex to find the data I need to store. Are there any options besides regex to parsing non-strictly structured text?

rys
  • 113
  • 3
  • 1
    I think that the general question 'how to parse some data that is understandable by (some) humans' is a task of machine learning that is too hard to have a practical solution. It is also too broad for a question here. The question 'how to parse this particular piece of data' might be a better start. If you're ok with general methods as answer I think that would be still on-topic here. – Discrete lizard Feb 26 '18 at 10:41

1 Answers1

3

My approach is to check that my assumptions about the format of the data are correct while reading. Example, I ensure that as expected, I get the series of # characters, that the next line starts with a number and ends with a date as mm/dd/yyyy. Upon any failure, I would throw an exception with a message such as "expected so and so at line xxx but found yyy". In case things go south, I will not be left with some cryptic error message related to regex.

Tarik
  • 166
  • 6