# Installation
**Requirements for running:**
- Python version 3.11.4
- pip version 23.0.1
- `genderize` library version 0.3.1
- `gender-guesser` library version 0.4.0
- `xmltodict` library version 0.13.0
- `chardet` library version 5.2.0
- `lxml` library version 5.2.2

To install the mentioned libraries, use the following commands:
```bash
pip install chardet
pip install xmltodict
pip install genderize
pip install gender-guesser
```

# Running
Run the tool using the command:
```bash
python vote2json.py -mf <meta-file>
```

The meta file should be in JSON format. Paths to meta files and scripts must also be respected.

# Meta File
To create a new meta file, it is recommended to use the template in the `metafiles/metafile_template.json` file or a meta file with a structure similar to the input data file.

# Practical Recommendations for Creating a Meta File
1) For attribute *data* values, it is recommended to use placeholder values.  
Example: 
```json
"idPolitickeSubjekty": {
    "data": [
        1
    ]
},
```

2) Redirect the standard output to a log file, e.g., *out.log*. It is easier to debug errors using the log file.

3) When creating user-defined functions (i.e., **FunctionStringType**), debugging is most effective by adding debug outputs. Ideally, redirect these to the error output using the `sys` library.  
Example:
```python
import sys

def function(value):
    print("ERROR XY", file=sys.stderr)
    return value
```

4) When defining user functions, apply them to the entire remaining part of the tree after the last *tree_position* section.

5) For loading HTML and transforming it into XML, the *lxml* library is used. For transforming from HTML to a Python dictionary (*dict*), the *xmltodict* library is used. For reading CSV files, the *csv* library is used. Refer to the documentation of these libraries to understand syntax for defining meta models and value access:
   - [lxml documentation](https://lxml.de)
   - [xmltodict documentation](https://pypi.org/project/xmltodict/)
   - [CSV library documentation](https://docs.python.org/3/library/csv.html)

6) For visualizing HTML or XML files and attribute names after transformation into another format, use one of the scripts in `help_scripts/visualise_xml` or `help_scripts/visualise_html`.

7) When building a meta model for data split into many files, it is recommended to incrementally provide input data: 1 voting -> 2 votings -> 1 sitting -> 2 sittings -> X sittings -> the entire dataset.

8) When building a meta model for HTML data, the structure of the file may change. Such changes can be tracked in the log as the last used file.

# Configuration Section
Explanation of attributes in the configuration section:

| Parameter         | Meaning                                      |
|-------------------|----------------------------------------------|
| format            | Format of input data                        |
| output_file       | Name of the output file                     |
| directory_input   | Folder containing input data                |
| _delimiter        | Cell delimiter in the tabular file          |
| _user_function_file | File with user-defined functions          |
| _cleaning         | Methods for data cleaning                   |
| _data_to_combine  | Indicates whether only files need merging. Only for already transformed datasets! |

```json
config: {
    "format": <file_format>,
    "output_file": <output_file_name>,
    "directory_input": <input_folder>,
    "_delimiter": <delimiter>,
    "_user_function_file": <user_functions_file>,
    "_cleaning": <list_of_cleaning_methods>,
    "_data_to_combine": <boolean>
}
```

# Templates for Attributes

## Template for Tabular Data Attributes
| Parameter                 | Meaning                                    | Allowed Value Types                          |
|---------------------------|--------------------------------------------|---------------------------------------------|
| data                      | Non-derivable data                        | List of non-derivable data or null          |
| start_x                   | Starting column position                  | int or "end"                                |
| end_x                     | Ending column position                    | int or "end"                                |
| start_y                   | Starting row position                     | int or "end"                                |
| end_y                     | Ending row position                       | int or "end"                                |
| _type                     | String type                               | String (type name) or null                  |
| parameters                | Parameters for string type                | List of parameters                          |
| _list_type                | List type                                 | String (type name) or null                  |
| _split_number             | Split list after this many elements       | int or null                                 |
| _delete_all_values_except_one | Keep only the first value in the list | true, false, or null                        |

*Note*: "end" indicates the end of the file. Negative values can be used to indicate positions from the end of the list.

### Syntax in Meta File
```json
"data": <non_derivable_data>,
"start_x": <position_value>,
"end_x": <position_value>,
"start_y": <position_value>,
"end_y": <position_value>,
"_type": <string_type>,
"parameters": [<list_of_parameters>],
"_split_number": <split_value>,
"_delete_all_values_except_one": <boolean_value>,
"_list_type": <list_type>
```

## Template for Tree-Structured Data Attributes
| Parameter                 | Meaning                                    | Allowed Value Types                          |
|---------------------------|--------------------------------------------|---------------------------------------------|
| data                      | Non-derivable data                        | List of non-derivable data or null          |
| tree_position             | Data position in tree                     | String                                       |
| _type                     | String type                               | String (type name) or null                  |
| parameters                | Parameters for string type                | List of parameters                          |
| _list_type                | List type                                 | String (type name) or null                  |
| _key_is_value             | Use the key as the value                  | true, false, or null                        |
| _make_list_one_dimension  | Flatten the resulting list to 1D          | true, false, or null                        |

### Syntax in Meta File
```json
"data": <non_derivable_data>,
"tree_position": <tree_position>,
"_type": <string_type>,
"parameters": [<list_of_parameters>],
"_list_type": <list_type>,
"_key_is_value": <boolean_value>,
"_make_list_one_dimension": <boolean_value>
```

# Tree Position Indicators
| Ending Symbols    | Function                                      |
|-------------------|----------------------------------------------|
| .                 | Traverse all parts of the tree at one level lower |
| `                 | Traverse only a single part of the tree at one level lower |
| \|                | Combine multiple keys at the lowest level into a single value list |
| +                 | Concatenate values from multiple keys at the lowest level into a single value |
| *                 | Used for cleaning tree data with `change_deputy_party_value_in_date_interval` method |
| ""                | Empty symbol denoting the end of a position string |

"!" is used to apply any <b>StringType</b> to a specific attribute in the + notation.



**Note**: Square brackets `[ ]` can specify parts of the list during nesting, such as:
| Value in Brackets       | Meaning                                      |
|-------------------------|----------------------------------------------|
| number                  | Index in the list                           |
| number:number           | Range in the list                           |
| string:value            | Extracts only matching values               |

Example:
``` `zastupitelstvo`radaZastImport.navrhUsneseni.vysledekHlasovani`pro|proti|zdrzelSe|nepritomen|omluven|neomluven` ```

**Note**: `_split_date` divides datasets by date in `%Y-%m-%d` format.

```json
_split_date:"<date>"
```

# Cleaning Methods for Tabular Data

### CombineCellsAndDeleteSecond
Combines two cells and deletes the second.

|Parameter|Meaning|
|:-:|:-:|
|first_cell_x|location of first cell on x axis|
|second_cell_x|location of second cell on x axis|
|first_cell_y|location of first cell on y axis|
|second_cell_y|location of second cell on y axis|
### Meta file syntax
```json
"method": "CombineCellsAndDeleteSecond",
"parameters": {
"first_cell_x": <position>,
"second_cell_x": <position>,
"first_cell_y": <position>,
"second_cell_y": <position>
}
```
### CombineCellsAndKeepBoth
Combines two cells, writes the concatenated string to the first cell, and keeps the second cell intact.

|Parameter|Meaning|
|:-:|:-:|
|first_cell_x|location of first cell on x axis|
|second_cell_x|location of second cell on x axis|
|first_cell_y|location of first cell on y axis|
|second_cell_y|location of second cell on y axis|
|_first_cell_cut|cut index of the string in the first cell|
|_second_cell_cut|cut index of the string in the second cell|
|cell_divider|divider of the concatenated cells|
### Meta file syntax
```json
"method": "CombineCellsAndKeepBoth",
"parameters": {
"first_cell_x": <position>,
"second_cell_x": <position>,
"first_cell_y": <position>,
"second_cell_y": <position>,
"_first_cell_cut": <cut_index>,
"_second_cell_cut": <cut_index>,
"cell_divider": <delimiter>
}
```


## DeleteRow
Deletes certain row by index

|Parameter|Meaning|
|:-:|:-:|
|row|index of row to be deleted|
### Meta file syntax
```
"method": "DeleteRow",
"parameters": {
"row": 33
}
```

# Cleaning Methods for Tree Data

### ChangeValueInTree
Changes values throughout the file based on tree positions.

|Parameter|Meaning|
|:-:|:-:|
|tree_position|location of data in tree|
|old_name|original value|
|new_name|new value|
### Meta file syntax
```json
"method": "ChangeValueInTree",
"parameters": {
"tree_position": <tree_position>,
"old_name": <old_value>,
"new_name": <new_value>
}
```
The parameter "tree position" should be written in this format: 
`"data.start_date[date]*voting_results.deputies_details.name[name]*party"`
The asterisk indicates that the value is obtained by the parameter, and the square brackets denote the key used to determine the parameter.


### ChangeDeputyPartyValueInDateInterval
Modifies the political party value for a deputy within a specific date range.
|Parameter|Meaning|
|:-:|:-:|
|tree_position|tree position in tree|
|original_date_format|original date format|
|start_date_limit|start of time interval|
|end_date_limit|end of time interval|
|name|deputy name|
|new_value|new party name value|
### Syntaxe v meta souboru
```json
"method": "ChangeDeputyPartyValueInDateInterval",
"parameters": {
"tree_position": <tree_position>,
"original_date_format": <date_format>,
"start_date_limit": <start_date>,
"end_date_limit": <end_date>,
"name": <deputy_name>,
"new_value": <new_party_value>
}
```
The note "tree position" parameter should be written in this format: 
`"data.start_date[date]*voting_results.deputies_details.name[name]*party"`
The asterisk indicates that the value is used to obtain the parameter, and the square brackets contain the key used to determine the parameter.




# String Types
## FromToStringType
Extracts a substring from a string based on indexes.

|Parameter|Meaning|
|:-:|:-:|
|od|index of start of the substring|
|do|index of end of the substring|

### Meta file syntax
```json
"_type": "FromToStringType",
"parameters": [
<start_index>,
<end_index>
]
```

## VotesNameChangeStringType
Maps voting values to values defined in a reference model.

|Parameter|Meaning|
|:--:|:--:|
|for|vote for|
|against|vote against|
|absent|deputy was missing|
|did not vote|deputy did not vote|
|abstained|deputy abstained|
|deputy was excused|excused|

### Meta file syntax
```json
"_type": "VotesNameChangeStringType",
"parameters": [
<for_value>,
<against_value>,
<absent_value>,
<not_voted_value>,
<abstain_value>,
<excused_value>
]
```

## VoteApprovedChangeStringType
Converts approval status values (e.g., "accepted", "rejected") to a value defined in reference model.

|Parameter|Meaning|
|:--:|:--:|
|accepted|voting accepted|
|rejected|voting rejected|
### Meta file syntax
```json
"_type": "VoteApprovedChangeStringType",
"parameters": [
<approved_value>,
<rejected_value>
]
```

## RegexSplitStringType
Splits a string into a list using a regular expression.

|Parameter|Meaning|
|:--:|:--:|
|regex|regular expression used for splitting|
|index|index of a wanted value in a list|

### Meta file syntax
```json
"_type": "RegexSplitStringType",
"parameters": [
<regex>,
<index>
]
```

## RegexSplitAndConcatStringType
Splits a string using a regex and concatenates parts based on two indexes.

|Parameter|Meaning|
|:--:|:--:|
|regex|regular expression used for splitting|
|index1|first index in a list splitting by regex|
|index2|second index in a list splitting by regex|


### Meta file syntax
```json
"_type": "RegexSplitAndConcatStringType",
"parameters": [
<regex>,
<index1>,
<index2>
]
```

## RegexFindAllStringType
Finds all matches of a regex within a string and returns value by index.

|Parameter|Meaning|
|:--:|:--:|
|regex|regular expression that puts all substring that fulfill regex into list|
|index|index in list of values fulfilled by regex|

### Meta file syntax
```json
"_type": "RegexFindAllStringType",
"parameters": [
<regex>,
<index>
]
```

## DateStringType
Extracts a date from a string using a specified format.

|Parameter|Meaning|
|:--:|:--:|
|date format|date format specification|

### Meta file syntax
```json
"_type": "DateStringType",
"parameters": [
<date_format>
]
```

## RegexDateStringType
Uses a regex to find a substring and extracts a date from it based on a format.

|Parameter|Meaning|
|:--:|:--:|
|regex|regular expression to find substring in a string|
|date format|date format specification|

### Meta file syntax
```json
"_type": "RegexDateStringType",
"parameters": [
<regex>,
<date_format>
]
```

## TimeStringType
Extracts a time from a string based on a specified format.

|Parameter|Meaning|
|:--:|:--:|
|time format|time format specification|

### Meta file syntax
```json
"_type": "TimeStringType",
"parameters": [
<time_format>
]
```

## RegexTimeStringType
Uses a regex to find a substring and extracts a time based on a format.

|Parameter|Meaning|
|:--:|:--:|
|regex|regular expression to find substring in a string|
|time format|time format specification|

### Meta file syntax
```json
"_type": "RegexTimeStringType",
"parameters": [
<regex>,
<time_format>
]
```

## ConcatWithDateStringType
Concatenates a date value with a provided string.

|Parameter|Meaning|
|:--:|:--:|
|date format|date format specification|
|string to concat|string concatenated in front of the date|

### Meta file syntax
```json
"_type": "ConcatWithDateStringType",
"parameters": [
<date_format>,
<string_to_concat>
]
```

## FunctionStringType
Applies a user-defined function to the value.
The **entire** remaining structure is passed into the function as a value, so it is necessary to work with the rest of the structure, not just with the value. The function itself is defined in a separate file with user functions under the attribute `config.user_function_file`.

|Parameter|Meaning|
|:--:|:--:|
|function parameters|function parameters without the value itself|
|function name|name of the function to be included into global namespace|
|apply on list|denotes whether function should be applied on whole list or each value separately|

### Meta file syntax
```json
"_type": "FunctionStringType",
"parameters": {
"function_parameters": <function_parameters>,
"function_name": <function_name>,
"_apply_on_list": <boolean>
}
```

# List Types
## GetMinListType
Extracts the minimum value from a list.
### Meta file syntax
```json
"_list_type": "GetMinListType"
```

## GetMaxListType
Extracts the maximum value from a list.
### Meta file syntax
```json
"_list_type": "GetMaxListType"
```

## GetMinDateListType
Finds the earliest date in a list.
### Meta file syntax
```json
"_list_type": "GetMinDateListType"
```

## GetMaxDateListType
Finds the latest date in a list.
### Meta file syntax
```json
"_list_type": "GetMaxDateListType"
```
