Compare commits

...

32 Commits

Author SHA1 Message Date
71dc906a96
chore(release): 1.7.0
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
gitea-physics/deepdog/pipeline/tag This commit looks good
2025-02-26 21:57:13 -06:00
24c6e311c1
feat: adds configurable skip if file exists
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
2025-02-26 21:55:12 -06:00
4dd3004a7b
chore(release): 1.6.0
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
gitea-physics/deepdog/pipeline/tag This commit looks good
2025-02-26 21:08:00 -06:00
46f6b6cdf1
feat: Adds ability to parse bayesruns without timestamps
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
2025-02-26 21:01:19 -06:00
c8435b4b2a
feat: allows negative log magnitude strings in models
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
2025-02-24 08:34:11 -06:00
c2375e6f5c
chore(release): 1.5.0
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
gitea-physics/deepdog/pipeline/tag This commit looks good
2024-12-29 21:23:30 -06:00
a1b59cd18b
feat: add configurable max number of dipoles to write
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
2024-12-29 21:14:59 -06:00
53f8993f2b
feat: add configurable max number of dipoles to write 2024-12-29 21:13:34 -06:00
700f32ea58
chore(release): 1.4.0
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
gitea-physics/deepdog/pipeline/tag This commit looks good
2024-09-04 13:58:56 -05:00
3737252c4b
log: adds additional logging of dipole count
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
2024-09-04 13:56:09 -05:00
6f79a49e59
log: adds additional logging of dipole count
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
2024-09-04 13:54:50 -05:00
d962ecb11e
feat: indexifier now has len 2024-08-26 03:34:57 -05:00
7beca501bf
fmt: ran formatter 2024-08-26 03:34:50 -05:00
5425ce1362
feat: allows some betetr matching for single_dipole runs 2024-08-26 03:31:15 -05:00
6a5c5931d4
fix: update log file arg names in cli scripts 2024-05-21 16:10:02 -05:00
36ff75576c
chore: removes redundant import
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
2024-05-21 15:55:25 -05:00
e76c619c8b
fmt: formatting changes 2024-05-21 15:54:55 -05:00
c881da2837
feat: add subset sim probs command for bayes for subset simulation results
Some checks failed
gitea-physics/deepdog/pipeline/head There was a failure building this commit
2024-05-21 15:54:08 -05:00
1a1ecc01ea
chore: adds vscode to gitignore 2024-05-21 15:53:21 -05:00
9cfd484d7c
chore(release): 1.3.0
All checks were successful
gitea-physics/deepdog/pipeline/tag This commit looks good
gitea-physics/deepdog/pipeline/head This commit looks good
2024-05-19 22:13:39 -05:00
09fad2e102
feat: improve initial cost calculation to allow multiprocessing, adds ability to specify a number of levels to do with direct mc instead of subset simulation 2024-05-19 22:11:50 -05:00
24ac65bf9c
fix: fix seeding to avoid recreating seed combinations across multi runs 2024-05-19 22:10:40 -05:00
8fbae32111
doc: some commenting and logging changes 2024-05-19 22:09:52 -05:00
b1c01b25c8
fix: Adds ugly hack for stdevs for this uniform range to multiply by root3, proper fix would be in pdme 2024-05-19 22:08:44 -05:00
a14d9834e5
doc: note on refactoring for subset sim probs 2024-05-19 22:01:42 -05:00
8d04803eb3
fmt: formatting, nicer log, removing comment 2024-05-19 02:29:59 -05:00
92b49fce7c
feat: add multi run to wrap multi model and repeat runs 2024-05-19 02:27:11 -05:00
8845b2875f
feat: adds a filter that works with cost functions 2024-05-19 02:26:00 -05:00
72791f2d0f
deps: update pdme 2024-05-19 02:25:29 -05:00
d258cfbec7
chore(release): 1.2.1
All checks were successful
gitea-physics/deepdog/pipeline/head This commit looks good
gitea-physics/deepdog/pipeline/tag This commit looks good
2024-05-11 20:51:05 -05:00
b3bf4cde97
perf: precompile the magic regexes for probs parsing 2024-05-11 20:49:45 -05:00
60f29b0b2f
perf: avoid recalculating product dict in indexifier to improve performance for probs 2024-05-11 20:49:26 -05:00
26 changed files with 1426 additions and 304 deletions

2
.gitignore vendored
View File

@ -145,3 +145,5 @@ cython_debug/
*.csv *.csv
local_scripts/ local_scripts/
.vscode

View File

@ -2,6 +2,60 @@
All notable changes to this project will be documented in this file. See [standard-version](https://github.com/conventional-changelog/standard-version) for commit guidelines. All notable changes to this project will be documented in this file. See [standard-version](https://github.com/conventional-changelog/standard-version) for commit guidelines.
## [1.7.0](https://gitea.deepak.science:2222/physics/deepdog/compare/1.6.0...1.7.0) (2025-02-27)
### Features
* adds configurable skip if file exists ([24c6e31](https://gitea.deepak.science:2222/physics/deepdog/commit/24c6e311c1d3067eb98cc60e6ca38d76373bf08e))
## [1.6.0](https://gitea.deepak.science:2222/physics/deepdog/compare/1.5.0...1.6.0) (2025-02-27)
### Features
* Adds ability to parse bayesruns without timestamps ([46f6b6c](https://gitea.deepak.science:2222/physics/deepdog/commit/46f6b6cdf15c67aedf0c871d201b8db320bccbdf))
* allows negative log magnitude strings in models ([c8435b4](https://gitea.deepak.science:2222/physics/deepdog/commit/c8435b4b2a6e4b89030f53b5734eb743e2003fb7))
## [1.5.0](https://gitea.deepak.science:2222/physics/deepdog/compare/1.4.0...1.5.0) (2024-12-30)
### Features
* add configurable max number of dipoles to write ([a1b59cd](https://gitea.deepak.science:2222/physics/deepdog/commit/a1b59cd18b30359328a09210d9393f211aab30c2))
* add configurable max number of dipoles to write ([53f8993](https://gitea.deepak.science:2222/physics/deepdog/commit/53f8993f2b155228fff5cbee84f10c62eb149a1f))
## [1.4.0](https://gitea.deepak.science:2222/physics/deepdog/compare/1.3.0...1.4.0) (2024-09-04)
### Features
* add subset sim probs command for bayes for subset simulation results ([c881da2](https://gitea.deepak.science:2222/physics/deepdog/commit/c881da28370a1e51d062e1a7edaa62af6eb98d0a))
* allows some betetr matching for single_dipole runs ([5425ce1](https://gitea.deepak.science:2222/physics/deepdog/commit/5425ce1362919af4cc4dbd5813df3be8d877b198))
* indexifier now has len ([d962ecb](https://gitea.deepak.science:2222/physics/deepdog/commit/d962ecb11e929de1d9aa458b5d8e82270eff0039))
### Bug Fixes
* update log file arg names in cli scripts ([6a5c593](https://gitea.deepak.science:2222/physics/deepdog/commit/6a5c5931d4fc849d0d6a0f2b971523a0f039d559))
## [1.3.0](https://gitea.deepak.science:2222/physics/deepdog/compare/1.2.1...1.3.0) (2024-05-20)
### Features
* add multi run to wrap multi model and repeat runs ([92b49fc](https://gitea.deepak.science:2222/physics/deepdog/commit/92b49fce7c86f14484deb1c4aaaa810a6f69c08a))
* adds a filter that works with cost functions ([8845b28](https://gitea.deepak.science:2222/physics/deepdog/commit/8845b2875f2c91c91dd3988fabda26400c59b2d7))
* improve initial cost calculation to allow multiprocessing, adds ability to specify a number of levels to do with direct mc instead of subset simulation ([09fad2e](https://gitea.deepak.science:2222/physics/deepdog/commit/09fad2e1024d9237a6a4f7931f51cb4c84b83bf8))
### Bug Fixes
* Adds ugly hack for stdevs for this uniform range to multiply by root3, proper fix would be in pdme ([b1c01b2](https://gitea.deepak.science:2222/physics/deepdog/commit/b1c01b25c8f2c3947be23f5b2c656c37437dab17))
* fix seeding to avoid recreating seed combinations across multi runs ([24ac65b](https://gitea.deepak.science:2222/physics/deepdog/commit/24ac65bf9c74c454fec826ca9de640fe095f5a17))
### [1.2.1](https://gitea.deepak.science:2222/physics/deepdog/compare/1.2.0...1.2.1) (2024-05-12)
## [1.2.0](https://gitea.deepak.science:2222/physics/deepdog/compare/1.1.0...1.2.0) (2024-05-09) ## [1.2.0](https://gitea.deepak.science:2222/physics/deepdog/compare/1.1.0...1.2.0) (2024-05-09)

View File

@ -13,7 +13,7 @@ def parse_args() -> argparse.Namespace:
"probs", description="Calculating probability from finished bayesrun" "probs", description="Calculating probability from finished bayesrun"
) )
parser.add_argument( parser.add_argument(
"--log_file", "--log-file",
type=str, type=str,
help="A filename for logging to, if not provided will only log to stderr", help="A filename for logging to, if not provided will only log to stderr",
default=None, default=None,

View File

@ -72,6 +72,7 @@ def main(args: argparse.Namespace):
for f in tqdm.tqdm(out_files, desc="reading files", leave=False) for f in tqdm.tqdm(out_files, desc="reading files", leave=False)
] ]
# Refactor here to allow for arbitrary likelihood file sources
_logger.info("building uncoalesced dict") _logger.info("building uncoalesced dict")
uncoalesced_dict = deepdog.cli.probs.dicts.build_model_dict(parsed_output_files) uncoalesced_dict = deepdog.cli.probs.dicts.build_model_dict(parsed_output_files)

View File

@ -0,0 +1,5 @@
from deepdog.cli.subset_sim_probs.main import wrapped_main
__all__ = [
"wrapped_main",
]

View File

@ -0,0 +1,52 @@
import argparse
import os
def parse_args() -> argparse.Namespace:
def dir_path(path):
if os.path.isdir(path):
return path
else:
raise argparse.ArgumentTypeError(f"readable_dir:{path} is not a valid path")
parser = argparse.ArgumentParser(
"subset_sim_probs",
description="Calculating probability from finished subset sim run",
)
parser.add_argument(
"--log-file",
type=str,
help="A filename for logging to, if not provided will only log to stderr",
default=None,
)
parser.add_argument(
"--results-directory",
"-d",
type=dir_path,
help="The directory to search for bayesrun files, defaulting to cwd if not passed",
default=".",
)
parser.add_argument(
"--indexify-json",
help="A json file with the indexify config for parsing job indexes. Will skip if not present",
default="",
)
parser.add_argument(
"--outfile",
"-o",
type=str,
help="output filename for coalesced data. If not provided, will not be written",
default=None,
)
confirm_outfile_overwrite_group = parser.add_mutually_exclusive_group()
confirm_outfile_overwrite_group.add_argument(
"--never-overwrite-outfile",
action="store_true",
help="If a duplicate outfile is detected, skip confirmation and automatically exit early",
)
confirm_outfile_overwrite_group.add_argument(
"--force-overwrite-outfile",
action="store_true",
help="Skips checking for duplicate outfiles and overwrites",
)
return parser.parse_args()

View File

@ -0,0 +1,136 @@
import typing
from deepdog.results import GeneralOutput
import logging
import csv
import tqdm
_logger = logging.getLogger(__name__)
def build_model_dict(
general_outputs: typing.Sequence[GeneralOutput],
) -> typing.Dict[
typing.Tuple, typing.Dict[typing.Tuple, typing.Dict["str", typing.Any]]
]:
"""
Maybe someday do something smarter with the coalescing and stuff but don't want to so i won't
"""
# assume that everything is well formatted and the keys are the same across entire list and initialise list of keys.
# model dict will contain a model_key: {calculation_dict} where each calculation_dict represents a single calculation for that model,
# the uncoalesced version, keyed by the specific file keys
model_dict: typing.Dict[
typing.Tuple, typing.Dict[typing.Tuple, typing.Dict["str", typing.Any]]
] = {}
_logger.info("building model dict")
for out in tqdm.tqdm(general_outputs, desc="reading outputs", leave=False):
for model_result in out.results:
model_key = tuple(v for v in model_result.parsed_model_keys.values())
if model_key not in model_dict:
model_dict[model_key] = {}
calculation_dict = model_dict[model_key]
calculation_key = tuple(v for v in out.data.values())
if calculation_key not in calculation_dict:
calculation_dict[calculation_key] = {
"_model_key_dict": model_result.parsed_model_keys,
"_calculation_key_dict": out.data,
"num_finished_runs": int(
model_result.result_dict["num_finished_runs"]
),
"num_runs": int(model_result.result_dict["num_runs"]),
"estimated_likelihood": float(
model_result.result_dict["estimated_likelihood"]
),
}
else:
raise ValueError(
f"Got {calculation_key} twice for model_key {model_key}"
)
return model_dict
def coalesced_dict(
uncoalesced_model_dict: typing.Dict[
typing.Tuple, typing.Dict[typing.Tuple, typing.Dict["str", typing.Any]]
],
):
"""
pass in uncoalesced dict
the minimum_count field is what we use to make sure our probs are never zero
"""
coalesced_dict = {}
# we are already iterating so for no reason because performance really doesn't matter let's count the keys ourselves
num_keys = 0
# first pass coalesce
for model_key, model_dict in uncoalesced_model_dict.items():
num_keys += 1
for calculation in model_dict.values():
if model_key not in coalesced_dict:
coalesced_dict[model_key] = {
"_model_key_dict": calculation["_model_key_dict"].copy(),
"calculations_coalesced": 1,
"num_finished_runs": calculation["num_finished_runs"],
"num_runs": calculation["num_runs"],
"estimated_likelihood": calculation["estimated_likelihood"],
}
else:
_logger.error(f"We shouldn't be here! Double key for {model_key=}")
raise ValueError()
# second pass do probability calculation
prior = 1 / num_keys
_logger.info(f"Got {num_keys} model keys, so our prior will be {prior}")
total_weight = 0
for coalesced_model_dict in coalesced_dict.values():
model_weight = coalesced_model_dict["estimated_likelihood"] * prior
total_weight += model_weight
total_prob = 0
for coalesced_model_dict in coalesced_dict.values():
likelihood = coalesced_model_dict["estimated_likelihood"]
prob = likelihood * prior / total_weight
coalesced_model_dict["prob"] = prob
total_prob += prob
_logger.debug(
f"Got a total probability of {total_prob}, which should be close to 1 up to float/rounding error"
)
return coalesced_dict
def write_coalesced_dict(
coalesced_output_filename: typing.Optional[str],
coalesced_model_dict: typing.Dict[typing.Tuple, typing.Dict["str", typing.Any]],
):
if coalesced_output_filename is None or coalesced_output_filename == "":
_logger.warning("Not provided a uncoalesced filename, not going to try")
return
first_value = next(iter(coalesced_model_dict.values()))
model_field_names = set(first_value["_model_key_dict"].keys())
_logger.info(f"Detected model field names {model_field_names}")
collected_fieldnames = list(model_field_names)
collected_fieldnames.extend(
["calculations_coalesced", "num_finished_runs", "num_runs", "prob"]
)
with open(coalesced_output_filename, "w", newline="") as coalesced_output_file:
writer = csv.DictWriter(coalesced_output_file, fieldnames=collected_fieldnames)
writer.writeheader()
for model_dict in coalesced_model_dict.values():
row = model_dict["_model_key_dict"].copy()
row.update(
{
"calculations_coalesced": model_dict["calculations_coalesced"],
"num_finished_runs": model_dict["num_finished_runs"],
"num_runs": model_dict["num_runs"],
"prob": model_dict["prob"],
}
)
writer.writerow(row)

View File

@ -0,0 +1,113 @@
import logging
import argparse
import json
import deepdog.cli.subset_sim_probs.args
import deepdog.cli.subset_sim_probs.dicts
import deepdog.cli.util
import deepdog.results
import deepdog.indexify
import pathlib
import tqdm
import os
import tqdm.contrib.logging
_logger = logging.getLogger(__name__)
def set_up_logging(log_file: str):
log_pattern = "%(asctime)s | %(levelname)-7s | %(name)s:%(lineno)d | %(message)s"
if log_file is None:
handlers = [
logging.StreamHandler(),
]
else:
handlers = [logging.StreamHandler(), logging.FileHandler(log_file)]
logging.basicConfig(
level=logging.DEBUG,
format=log_pattern,
# it's okay to ignore this mypy error because who cares about logger handler types
handlers=handlers, # type: ignore
)
logging.captureWarnings(True)
def main(args: argparse.Namespace):
"""
Main function with passed in arguments and no additional logging setup in case we want to extract out later
"""
with tqdm.contrib.logging.logging_redirect_tqdm():
_logger.info(f"args: {args}")
if "outfile" in args and args.outfile:
if os.path.exists(args.outfile):
if args.never_overwrite_outfile:
_logger.warning(
f"Filename {args.outfile} already exists, and never want overwrite, so aborting."
)
return
elif args.force_overwrite_outfile:
_logger.warning(f"Forcing overwrite of {args.outfile}")
else:
# need to confirm
confirm_overwrite = deepdog.cli.util.confirm_prompt(
f"Filename {args.outfile} exists, overwrite?"
)
if not confirm_overwrite:
_logger.warning(
f"Filename {args.outfile} already exists and do not want overwrite, aborting."
)
return
else:
_logger.warning(f"Overwriting file {args.outfile}")
indexifier = None
if args.indexify_json:
with open(args.indexify_json, "r") as indexify_json_file:
indexify_spec = json.load(indexify_json_file)
indexify_data = indexify_spec["indexes"]
if "seed_spec" in indexify_spec:
seed_spec = indexify_spec["seed_spec"]
indexify_data[seed_spec["field_name"]] = list(
range(seed_spec["num_seeds"])
)
# _logger.debug(f"Indexifier data looks like {indexify_data}")
indexifier = deepdog.indexify.Indexifier(indexify_data)
results_dir = pathlib.Path(args.results_directory)
out_files = [
f for f in results_dir.iterdir() if f.name.endswith("subsetsim.csv")
]
_logger.info(
f"Reading {len(out_files)} subsetsim.csv files in directory {args.results_directory}"
)
# _logger.info(out_files)
parsed_output_files = [
deepdog.results.read_subset_sim_file(f, indexifier)
for f in tqdm.tqdm(out_files, desc="reading files", leave=False)
]
# Refactor here to allow for arbitrary likelihood file sources
_logger.info("building uncoalesced dict")
uncoalesced_dict = deepdog.cli.subset_sim_probs.dicts.build_model_dict(
parsed_output_files
)
_logger.info("building coalesced dict")
coalesced = deepdog.cli.subset_sim_probs.dicts.coalesced_dict(uncoalesced_dict)
if "outfile" in args and args.outfile:
deepdog.cli.subset_sim_probs.dicts.write_coalesced_dict(
args.outfile, coalesced
)
else:
_logger.info("Skipping writing coalesced")
def wrapped_main():
args = deepdog.cli.subset_sim_probs.args.parse_args()
set_up_logging(args.log_file)
main(args)

View File

@ -0,0 +1,3 @@
from deepdog.cli.util.confirm import confirm_prompt
__all__ = ["confirm_prompt"]

View File

@ -0,0 +1,23 @@
_RESPONSE_MAP = {
"yes": True,
"ye": True,
"y": True,
"no": False,
"n": False,
"nope": False,
"true": True,
"false": False,
}
def confirm_prompt(question: str) -> bool:
"""Prompt with the question and returns yes or no based on response."""
prompt = question + " [y/n]: "
while True:
choice = input(prompt).lower()
if choice in _RESPONSE_MAP:
return _RESPONSE_MAP[choice]
else:
print('Respond with "yes" or "no"')

View File

@ -0,0 +1,24 @@
from deepdog.direct_monte_carlo.direct_mc import DirectMonteCarloFilter
from typing import Callable
import numpy
class CostFunctionTargetFilter(DirectMonteCarloFilter):
def __init__(
self,
cost_function: Callable[[numpy.ndarray], numpy.ndarray],
target_cost: float,
):
"""
Filters dipoles by cost, only leaving dipoles with cost below target_cost
"""
self.cost_function = cost_function
self.target_cost = target_cost
def filter_samples(self, samples: numpy.ndarray) -> numpy.ndarray:
current_sample = samples
costs = self.cost_function(current_sample)
current_sample = current_sample[costs < self.target_cost]
return current_sample

View File

@ -1,3 +1,5 @@
import re
import pathlib
import csv import csv
import pdme.model import pdme.model
import pdme.measurement import pdme.measurement
@ -36,9 +38,35 @@ class DirectMonteCarloConfig:
tag: str = "" tag: str = ""
cap_core_count: int = 0 # 0 means cap at num cores - 1 cap_core_count: int = 0 # 0 means cap at num cores - 1
chunk_size: int = 50 chunk_size: int = 50
write_bayesrun_file = True
bayesrun_file_timestamp = True
# chunk size of some kind # chunk size of some kind
write_bayesrun_file: bool = True
bayesrun_file_timestamp: bool = True
skip_if_exists: bool = False
def get_filename(self) -> str:
"""
Generate a filename for the output of this run.
"""
# set starting execution timestamp
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
if self.bayesrun_file_timestamp:
timestamp_str = f"{timestamp}-"
else:
timestamp_str = ""
filename = f"{timestamp_str}{self.tag}.realdata.fast_filter.bayesrun.csv"
_logger.debug(f"Got filename {filename}")
return filename
def get_filename_regex(self) -> str:
"""
Generate a regex for the output of this run.
"""
# having both timestamp and the hyphen separately optional is a bit of a hack
# too loose, but will never matter
pattern = rf"(?P<timestamp>\d{{8}}-\d{{6}})?-?{self.tag}\.realdata\.fast_filter\.bayesrun\.csv"
return pattern
# Aliasing dict as a generic data container # Aliasing dict as a generic data container
@ -145,15 +173,21 @@ class DirectMonteCarloRun:
single run wrapped up for multiprocessing call. single run wrapped up for multiprocessing call.
takes in a tuple of arguments corresponding to takes in a tuple of arguments corresponding to
(model_name_pair, seed) (model_name_pair, seed, return_configs)
return_configs is a boolean, if true then will return tuple of (count, [matching configs])
if false, return (count, [])
""" """
# here's where we do our work # here's where we do our work
model_name_pair, seed = args model_name_pair, seed, return_configs = args
cycle_success_configs = self._single_run(model_name_pair, seed) cycle_success_configs = self._single_run(model_name_pair, seed)
cycle_success_count = len(cycle_success_configs) cycle_success_count = len(cycle_success_configs)
return cycle_success_count if return_configs:
return (cycle_success_count, cycle_success_configs)
else:
return (cycle_success_count, [])
def execute_no_multiprocessing(self) -> Sequence[DirectMonteCarloResult]: def execute_no_multiprocessing(self) -> Sequence[DirectMonteCarloResult]:
@ -198,9 +232,11 @@ class DirectMonteCarloRun:
) )
dipole_count = numpy.array(cycle_success_configs).shape[1] dipole_count = numpy.array(cycle_success_configs).shape[1]
for n in range(dipole_count): for n in range(dipole_count):
number_dipoles_to_write = self.config.target_success * 5
_logger.info(f"Limiting to {number_dipoles_to_write=}")
numpy.savetxt( numpy.savetxt(
f"{self.config.tag}_{step_count}_{cycle_i}_dipole_{n}.csv", f"{self.config.tag}_{step_count}_{cycle_i}_dipole_{n}.csv",
sorted_by_freq[:, n], sorted_by_freq[:number_dipoles_to_write, n],
delimiter=",", delimiter=",",
) )
total_success += cycle_success_count total_success += cycle_success_count
@ -222,8 +258,27 @@ class DirectMonteCarloRun:
def execute(self) -> Sequence[DirectMonteCarloResult]: def execute(self) -> Sequence[DirectMonteCarloResult]:
# set starting execution timestamp filename = self.config.get_filename()
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") if self.config.skip_if_exists:
_logger.info(f"Checking if {filename} exists")
cwd = pathlib.Path.cwd()
if (cwd / filename).exists():
_logger.info(f"File {filename} exists, skipping")
return []
if self.config.bayesrun_file_timestamp:
_logger.info(
"Also need to check file endings because of possible past or current timestamps, check only occurs if writing timestamp is set"
)
pattern = self.config.get_filename_regex()
for file in cwd.iterdir():
match = re.match(pattern, file.name)
if match is not None:
_logger.info(f"Matched {file.name} to {pattern}")
_logger.info(f"File {filename} exists, skipping")
return []
_logger.info(
f"Finished checking against pattern {pattern}, hopefully didn't take too long!"
)
count_per_step = ( count_per_step = (
self.config.monte_carlo_count_per_cycle * self.config.monte_carlo_cycles self.config.monte_carlo_count_per_cycle * self.config.monte_carlo_cycles
@ -259,15 +314,71 @@ class DirectMonteCarloRun:
seeds = seed_sequence.spawn(self.config.monte_carlo_cycles) seeds = seed_sequence.spawn(self.config.monte_carlo_cycles)
pool_results = sum( raw_pool_results = list(
pool.imap_unordered( pool.imap_unordered(
self._wrapped_single_run, self._wrapped_single_run,
[(model_name_pair, seed) for seed in seeds], [
(
model_name_pair,
seed,
self.config.write_successes_to_file,
)
for seed in seeds
],
self.config.chunk_size, self.config.chunk_size,
) )
) )
pool_results = sum(result[0] for result in raw_pool_results)
_logger.debug(f"Pool results: {pool_results}") _logger.debug(f"Pool results: {pool_results}")
if self.config.write_successes_to_file:
_logger.info("Writing dipole results")
cycle_success_configs = numpy.concatenate(
[result[1] for result in raw_pool_results]
)
dipole_count = numpy.array(cycle_success_configs).shape[1]
max_number_dipoles_to_write = self.config.target_success * 5
_logger.debug(
f"Limiting to {max_number_dipoles_to_write=}, have {len(cycle_success_configs)}"
)
if len(cycle_success_configs):
sorted_by_freq = numpy.array(
[
pdme.subspace_simulation.sort_array_of_dipoles_by_frequency(
dipole_config
)
for dipole_config in cycle_success_configs[
:max_number_dipoles_to_write
]
]
)
for n in range(dipole_count):
dipole_filename = (
f"{self.config.tag}_{step_count}_dipole_{n}.csv"
)
_logger.debug(
f"Writing {min(len(cycle_success_configs), max_number_dipoles_to_write)} to {dipole_filename}"
)
numpy.savetxt(
dipole_filename,
sorted_by_freq[:, n],
delimiter=",",
)
else:
_logger.debug(
"Instructed to write results, but none obtained"
)
total_success += pool_results total_success += pool_results
total_count += count_per_step total_count += count_per_step
_logger.debug( _logger.debug(
@ -285,14 +396,6 @@ class DirectMonteCarloRun:
if self.config.write_bayesrun_file: if self.config.write_bayesrun_file:
if self.config.bayesrun_file_timestamp:
timestamp_str = f"{timestamp}-"
else:
timestamp_str = ""
filename = (
f"{timestamp_str}{self.config.tag}.realdata.fast_filter.bayesrun.csv"
)
_logger.info(f"Going to write to file [{filename}]") _logger.info(f"Going to write to file [{filename}]")
# row: Dict[str, Union[int, float, str]] = {} # row: Dict[str, Union[int, float, str]] = {}
row = {} row = {}

View File

@ -31,10 +31,14 @@ class Indexifier:
def __init__(self, list_dict: typing.Dict[str, typing.Sequence]): def __init__(self, list_dict: typing.Dict[str, typing.Sequence]):
self.dict = list_dict self.dict = list_dict
self.product_dict = _dict_product(self.dict)
def indexify(self, n: int) -> typing.Dict[str, typing.Any]: def indexify(self, n: int) -> typing.Dict[str, typing.Any]:
product_dict = _dict_product(self.dict) return self.product_dict[n]
return product_dict[n]
def __len__(self) -> int:
weights = [len(v) for v in self.dict.values()]
return math.prod(weights)
def _indexify_indices(self, n: int) -> typing.Sequence[int]: def _indexify_indices(self, n: int) -> typing.Sequence[int]:
""" """

View File

@ -5,64 +5,38 @@ import logging
import deepdog.indexify import deepdog.indexify
import pathlib import pathlib
import csv import csv
from deepdog.results.read_csv import (
parse_bayesrun_row,
BayesrunModelResult,
parse_general_row,
GeneralModelResult,
)
from deepdog.results.filename import parse_file_slug
_logger = logging.getLogger(__name__) _logger = logging.getLogger(__name__)
FILENAME_REGEX = r"(?P<timestamp>\d{8}-\d{6})-(?P<filename_slug>.*)\.realdata\.fast_filter\.bayesrun\.csv" FILENAME_REGEX = re.compile(
r"(?P<timestamp>\d{8}-\d{6})-(?P<filename_slug>.*)\.realdata\.fast_filter\.bayesrun\.csv"
)
MODEL_REGEXES = [ # probably a better way but who cares
r"geom_(?P<xmin>-?\d+)_(?P<xmax>-?\d+)_(?P<ymin>-?\d+)_(?P<ymax>-?\d+)_(?P<zmin>-?\d+)_(?P<zmax>-?\d+)-orientation_(?P<orientation>free|fixedxy|fixedz)-dipole_count_(?P<avg_filled>\d+)_(?P<field_name>\w*)", NO_TIMESTAMP_FILENAME_REGEX = re.compile(
r"geom_(?P<xmin>-?\d+)_(?P<xmax>-?\d+)_(?P<ymin>-?\d+)_(?P<ymax>-?\d+)_(?P<zmin>-?\d+)_(?P<zmax>-?\d+)-magnitude_(?P<log_magnitude>\d*\.?\d+)-orientation_(?P<orientation>free|fixedxy|fixedz)-dipole_count_(?P<avg_filled>\d+)_(?P<field_name>\w*)", r"(?P<filename_slug>.*)\.realdata\.fast_filter\.bayesrun\.csv"
r"geom_(?P<xmin>-?\d*\.?\d+)_(?P<xmax>-?\d*\.?\d+)_(?P<ymin>-?\d*\.?\d+)_(?P<ymax>-?\d*\.?\d+)_(?P<zmin>-?\d*\.?\d+)_(?P<zmax>-?\d*\.?\d+)-magnitude_(?P<log_magnitude>\d*\.?\d+)-orientation_(?P<orientation>free|fixedxy|fixedz)-dipole_count_(?P<avg_filled>\d+)_(?P<field_name>\w*)" )
]
FILE_SLUG_REGEXES = [
r"mock_tarucha-(?P<job_index>\d+)", SUBSET_SIM_FILENAME_REGEX = re.compile(
r"(?:(?P<mock>mock)_)?tarucha(?:_(?P<tarucha_run_id>\d+))?-(?P<job_index>\d+)", r"(?P<filename_slug>.*)-(?:no_adaptive_steps_)?(?P<num_ss_runs>\d+)-nc_(?P<n_c>\d+)-ns_(?P<n_s>\d+)-mmax_(?P<mmax>\d+)\.multi\.subsetsim\.csv"
r"(?P<tag>\w+)-(?P<job_index>\d+)", )
]
@dataclasses.dataclass @dataclasses.dataclass
class BayesrunOutputFilename: class BayesrunOutputFilename:
timestamp: str timestamp: typing.Optional[str]
filename_slug: str filename_slug: str
path: pathlib.Path path: pathlib.Path
class BayesrunColumnParsed:
"""
class for parsing a bayesrun while pulling certain special fields out
"""
def __init__(self, groupdict: typing.Dict[str, str]):
self.column_field = groupdict["field_name"]
self.model_field_dict = {
k: v for k, v in groupdict.items() if k != "field_name"
}
self._groupdict_str = repr(groupdict)
def __str__(self):
return f"BayesrunColumnParsed[{self.column_field}: {self.model_field_dict}]"
def __repr__(self):
return f"BayesrunColumnParsed({self._groupdict_str})"
def __eq__(self, other):
if isinstance(other, BayesrunColumnParsed):
return (self.column_field == other.column_field) and (
self.model_field_dict == other.model_field_dict
)
return NotImplemented
@dataclasses.dataclass
class BayesrunModelResult:
parsed_model_keys: typing.Dict[str, str]
success: int
count: int
@dataclasses.dataclass @dataclasses.dataclass
class BayesrunOutput: class BayesrunOutput:
filename: BayesrunOutputFilename filename: BayesrunOutputFilename
@ -70,88 +44,52 @@ class BayesrunOutput:
results: typing.Sequence[BayesrunModelResult] results: typing.Sequence[BayesrunModelResult]
def _batch_iterable_into_chunks(iterable, n=1): @dataclasses.dataclass
""" class GeneralOutput:
utility for batching bayesrun files where columns appear in threes filename: BayesrunOutputFilename
""" data: typing.Dict["str", typing.Any]
for ndx in range(0, len(iterable), n): results: typing.Sequence[GeneralModelResult]
yield iterable[ndx : min(ndx + n, len(iterable))]
def _parse_bayesrun_column( def _parse_string_output_filename(
column: str, filename: str,
) -> typing.Optional[BayesrunColumnParsed]: ) -> typing.Tuple[typing.Optional[str], str]:
""" if match := FILENAME_REGEX.match(filename):
Tries one by one all of a predefined list of regexes that I might have used in the past. groups = match.groupdict()
Returns the groupdict for the first match, or None if no match found. return (groups["timestamp"], groups["filename_slug"])
""" elif match := NO_TIMESTAMP_FILENAME_REGEX.match(filename):
for pattern in MODEL_REGEXES: groups = match.groupdict()
match = re.match(pattern, column) return (None, groups["filename_slug"])
if match:
return BayesrunColumnParsed(match.groupdict())
else: else:
return None raise ValueError(f"Could not parse {filename} as a bayesrun output filename")
def _parse_bayesrun_row(
row: typing.Dict[str, str],
) -> typing.Sequence[BayesrunModelResult]:
results = []
batched_keys = _batch_iterable_into_chunks(list(row.keys()), 3)
for model_keys in batched_keys:
parsed = [_parse_bayesrun_column(column) for column in model_keys]
values = [row[column] for column in model_keys]
if parsed[0] is None:
raise ValueError(f"no viable success row found for keys {model_keys}")
if parsed[1] is None:
raise ValueError(f"no viable count row found for keys {model_keys}")
if parsed[0].column_field != "success":
raise ValueError(f"The column {model_keys[0]} is not a success field")
if parsed[1].column_field != "count":
raise ValueError(f"The column {model_keys[1]} is not a count field")
parsed_keys = parsed[0].model_field_dict
success = int(values[0])
count = int(values[1])
results.append(
BayesrunModelResult(
parsed_model_keys=parsed_keys,
success=success,
count=count,
)
)
return results
def _parse_output_filename(file: pathlib.Path) -> BayesrunOutputFilename: def _parse_output_filename(file: pathlib.Path) -> BayesrunOutputFilename:
filename = file.name filename = file.name
match = re.match(FILENAME_REGEX, filename) timestamp, slug = _parse_string_output_filename(filename)
return BayesrunOutputFilename(timestamp=timestamp, filename_slug=slug, path=file)
def _parse_ss_output_filename(file: pathlib.Path) -> BayesrunOutputFilename:
filename = file.name
match = SUBSET_SIM_FILENAME_REGEX.match(filename)
if not match: if not match:
raise ValueError(f"{filename} was not a valid bayesrun output") raise ValueError(f"{filename} was not a valid subset sim output")
groups = match.groupdict() groups = match.groupdict()
return BayesrunOutputFilename( return BayesrunOutputFilename(
timestamp=groups["timestamp"], filename_slug=groups["filename_slug"], path=file filename_slug=groups["filename_slug"], path=file, timestamp=None
) )
def _parse_file_slug(slug: str) -> typing.Optional[typing.Dict[str, str]]: def read_subset_sim_file(
for pattern in FILE_SLUG_REGEXES:
match = re.match(pattern, slug)
if match:
return match.groupdict()
else:
return None
def read_output_file(
file: pathlib.Path, indexifier: typing.Optional[deepdog.indexify.Indexifier] file: pathlib.Path, indexifier: typing.Optional[deepdog.indexify.Indexifier]
) -> BayesrunOutput: ) -> GeneralOutput:
parsed_filename = tag = _parse_output_filename(file) parsed_filename = tag = _parse_ss_output_filename(file)
out = BayesrunOutput(filename=parsed_filename, data={}, results=[]) out = GeneralOutput(filename=parsed_filename, data={}, results=[])
out.data.update(dataclasses.asdict(tag)) out.data.update(dataclasses.asdict(tag))
parsed_tag = _parse_file_slug(parsed_filename.filename_slug) parsed_tag = parse_file_slug(parsed_filename.filename_slug)
if parsed_tag is None: if parsed_tag is None:
_logger.warning( _logger.warning(
f"Could not parse {tag} against any matching regexes. Going to skip tag parsing" f"Could not parse {tag} against any matching regexes. Going to skip tag parsing"
@ -176,8 +114,53 @@ def read_output_file(
row = rows[0] row = rows[0]
else: else:
raise ValueError(f"Confused about having multiple rows in {file.name}") raise ValueError(f"Confused about having multiple rows in {file.name}")
results = _parse_bayesrun_row(row) results = parse_general_row(
row, ("num_finished_runs", "num_runs", None, "estimated_likelihood")
)
out.results = results out.results = results
return out return out
def read_output_file(
file: pathlib.Path, indexifier: typing.Optional[deepdog.indexify.Indexifier]
) -> BayesrunOutput:
parsed_filename = tag = _parse_output_filename(file)
out = BayesrunOutput(filename=parsed_filename, data={}, results=[])
out.data.update(dataclasses.asdict(tag))
parsed_tag = parse_file_slug(parsed_filename.filename_slug)
if parsed_tag is None:
_logger.warning(
f"Could not parse {tag} against any matching regexes. Going to skip tag parsing"
)
else:
out.data.update(parsed_tag)
if indexifier is not None:
try:
job_index = parsed_tag["job_index"]
indexified = indexifier.indexify(int(job_index))
out.data.update(indexified)
except KeyError:
# This isn't really that important of an error, apart from the warning
_logger.warning(
f"Parsed tag to {parsed_tag}, and attempted to indexify but no job_index key was found. skipping and moving on"
)
with file.open() as input_file:
reader = csv.DictReader(input_file)
rows = [r for r in reader]
if len(rows) == 1:
row = rows[0]
else:
raise ValueError(f"Confused about having multiple rows in {file.name}")
results = parse_bayesrun_row(row)
out.results = results
return out
__all__ = ["read_output_file", "BayesrunOutput"]

View File

@ -0,0 +1,22 @@
import re
import typing
FILE_SLUG_REGEXES = [
re.compile(pattern)
for pattern in [
r"(?P<tag>\w+)-(?P<job_index>\d+)",
r"mock_tarucha-(?P<job_index>\d+)",
r"(?:(?P<mock>mock)_)?tarucha(?:_(?P<tarucha_run_id>\d+))?-(?P<job_index>\d+)",
r"(?P<tag>\w+)-(?P<included_dots>[\w,]+)-(?P<target_cost>\d*\.?\d+)-(?P<job_index>\d+)",
]
]
def parse_file_slug(slug: str) -> typing.Optional[typing.Dict[str, str]]:
for pattern in FILE_SLUG_REGEXES:
match = pattern.match(slug)
if match:
return match.groupdict()
else:
return None

141
deepdog/results/read_csv.py Normal file
View File

@ -0,0 +1,141 @@
import typing
import re
import dataclasses
MODEL_REGEXES = [
re.compile(pattern)
for pattern in [
r"geom_(?P<xmin>-?\d+)_(?P<xmax>-?\d+)_(?P<ymin>-?\d+)_(?P<ymax>-?\d+)_(?P<zmin>-?\d+)_(?P<zmax>-?\d+)-orientation_(?P<orientation>free|fixedxy|fixedz)-dipole_count_(?P<avg_filled>\d+)_(?P<field_name>\w*)",
r"geom_(?P<xmin>-?\d+)_(?P<xmax>-?\d+)_(?P<ymin>-?\d+)_(?P<ymax>-?\d+)_(?P<zmin>-?\d+)_(?P<zmax>-?\d+)-magnitude_(?P<log_magnitude>\d*\.?\d+)-orientation_(?P<orientation>free|fixedxy|fixedz)-dipole_count_(?P<avg_filled>\d+)_(?P<field_name>\w*)",
r"geom_(?P<xmin>-?\d*\.?\d+)_(?P<xmax>-?\d*\.?\d+)_(?P<ymin>-?\d*\.?\d+)_(?P<ymax>-?\d*\.?\d+)_(?P<zmin>-?\d*\.?\d+)_(?P<zmax>-?\d*\.?\d+)-magnitude_(?P<log_magnitude>\d*\.?\d+)-orientation_(?P<orientation>free|fixedxy|fixedz)-dipole_count_(?P<avg_filled>\d+)_(?P<field_name>\w*)",
r"geom_(?P<xmin>-?\d+)_(?P<xmax>-?\d+)_(?P<ymin>-?\d+)_(?P<ymax>-?\d+)_(?P<zmin>-?\d+)_(?P<zmax>-?\d+)-magnitude_(?P<log_magnitude>-?\d*\.?\d+)-orientation_(?P<orientation>free|fixedxy|fixedz)-dipole_count_(?P<avg_filled>\d+)_(?P<field_name>\w*)",
r"geom_(?P<xmin>-?\d*\.?\d+)_(?P<xmax>-?\d*\.?\d+)_(?P<ymin>-?\d*\.?\d+)_(?P<ymax>-?\d*\.?\d+)_(?P<zmin>-?\d*\.?\d+)_(?P<zmax>-?\d*\.?\d+)-magnitude_(?P<log_magnitude>-?\d*\.?\d+)-orientation_(?P<orientation>free|fixedxy|fixedz)-dipole_count_(?P<avg_filled>\d+)_(?P<field_name>\w*)",
]
]
@dataclasses.dataclass
class BayesrunModelResult:
parsed_model_keys: typing.Dict[str, str]
success: int
count: int
@dataclasses.dataclass
class GeneralModelResult:
parsed_model_keys: typing.Dict[str, str]
result_dict: typing.Dict[str, str]
class BayesrunColumnParsed:
"""
class for parsing a bayesrun while pulling certain special fields out
"""
def __init__(self, groupdict: typing.Dict[str, str]):
self.column_field = groupdict["field_name"]
self.model_field_dict = {
k: v for k, v in groupdict.items() if k != "field_name"
}
self._groupdict_str = repr(groupdict)
def __str__(self):
return f"BayesrunColumnParsed[{self.column_field}: {self.model_field_dict}]"
def __repr__(self):
return f"BayesrunColumnParsed({self._groupdict_str})"
def __eq__(self, other):
if isinstance(other, BayesrunColumnParsed):
return (self.column_field == other.column_field) and (
self.model_field_dict == other.model_field_dict
)
return NotImplemented
def _parse_bayesrun_column(
column: str,
) -> typing.Optional[BayesrunColumnParsed]:
"""
Tries one by one all of a predefined list of regexes that I might have used in the past.
Returns the groupdict for the first match, or None if no match found.
"""
for pattern in MODEL_REGEXES:
match = pattern.match(column)
if match:
return BayesrunColumnParsed(match.groupdict())
else:
return None
def _batch_iterable_into_chunks(iterable, n=1):
"""
utility for batching bayesrun files where columns appear in threes
"""
for ndx in range(0, len(iterable), n):
yield iterable[ndx : min(ndx + n, len(iterable))]
def parse_general_row(
row: typing.Dict[str, str],
expected_fields: typing.Sequence[typing.Optional[str]],
) -> typing.Sequence[GeneralModelResult]:
results = []
batched_keys = _batch_iterable_into_chunks(list(row.keys()), len(expected_fields))
for model_keys in batched_keys:
parsed = [_parse_bayesrun_column(column) for column in model_keys]
values = [row[column] for column in model_keys]
result_dict = {}
parsed_keys = None
for expected_field, parsed_field, value in zip(expected_fields, parsed, values):
if expected_field is None:
continue
if parsed_field is None:
raise ValueError(
f"No viable row found for {expected_field=} in {model_keys=}"
)
if parsed_field.column_field != expected_field:
raise ValueError(
f"The column {parsed_field.column_field} does not match expected {expected_field}"
)
result_dict[expected_field] = value
if parsed_keys is None:
parsed_keys = parsed_field.model_field_dict
if parsed_keys is None:
raise ValueError(f"Somehow parsed keys is none here, for {row=}")
results.append(
GeneralModelResult(parsed_model_keys=parsed_keys, result_dict=result_dict)
)
return results
def parse_bayesrun_row(
row: typing.Dict[str, str],
) -> typing.Sequence[BayesrunModelResult]:
results = []
batched_keys = _batch_iterable_into_chunks(list(row.keys()), 3)
for model_keys in batched_keys:
parsed = [_parse_bayesrun_column(column) for column in model_keys]
values = [row[column] for column in model_keys]
if parsed[0] is None:
raise ValueError(f"no viable success row found for keys {model_keys}")
if parsed[1] is None:
raise ValueError(f"no viable count row found for keys {model_keys}")
if parsed[0].column_field != "success":
raise ValueError(f"The column {model_keys[0]} is not a success field")
if parsed[1].column_field != "count":
raise ValueError(f"The column {model_keys[1]} is not a count field")
parsed_keys = parsed[0].model_field_dict
success = int(values[0])
count = int(values[1])
results.append(
BayesrunModelResult(
parsed_model_keys=parsed_keys,
success=success,
count=count,
)
)
return results

View File

@ -1,9 +1,11 @@
import logging import logging
import multiprocessing
import numpy import numpy
import pdme.measurement import pdme.measurement
import pdme.measurement.input_types import pdme.measurement.input_types
import pdme.model
import pdme.subspace_simulation import pdme.subspace_simulation
from typing import Sequence, Tuple, Optional from typing import Sequence, Tuple, Optional, Callable, Union, List
from dataclasses import dataclass from dataclasses import dataclass
@ -18,47 +20,63 @@ class SubsetSimulationResult:
under_target_cost: Optional[float] under_target_cost: Optional[float]
under_target_likelihood: Optional[float] under_target_likelihood: Optional[float]
lowest_likelihood: Optional[float] lowest_likelihood: Optional[float]
messages: Sequence[str]
@dataclass
class MultiSubsetSimulationResult:
child_results: Sequence[SubsetSimulationResult]
model_name: str
estimated_likelihood: float
arithmetic_mean_estimated_likelihood: float
num_children: int
num_finished_children: int
clean_estimate: bool
class SubsetSimulation: class SubsetSimulation:
def __init__( def __init__(
self, self,
model_name_pair, model_name_pair,
dot_inputs, # actual_measurements: Sequence[pdme.measurement.DotMeasurement],
actual_measurements: Sequence[pdme.measurement.DotMeasurement], cost_function: Callable[[numpy.ndarray], numpy.ndarray],
n_c: int, n_c: int,
n_s: int, n_s: int,
m_max: int, m_max: int,
target_cost: Optional[float] = None, target_cost: Optional[float] = None,
level_0_seed: int = 200, level_0_seed: Union[int, Sequence[int]] = 200,
mcmc_seed: int = 20, mcmc_seed: Union[int, Sequence[int]] = 20,
use_adaptive_steps=True, use_adaptive_steps=True,
default_phi_step=0.01, default_phi_step=0.01,
default_theta_step=0.01, default_theta_step=0.01,
default_r_step=0.01, default_r_step=0.01,
default_w_log_step=0.01, default_w_log_step=0.01,
default_upper_w_log_step=4, default_upper_w_log_step=4,
num_initial_dmc_gens=1,
keep_probs_list=True, keep_probs_list=True,
dump_last_generation_to_file=False, dump_last_generation_to_file=False,
initial_cost_chunk_size=100, initial_cost_chunk_size=100,
initial_cost_multiprocess=True,
cap_core_count: int = 0, # 0 means cap at num cores - 1
): ):
name, model = model_name_pair name, model = model_name_pair
self.model_name = name self.model_name = name
self.model = model self.model = model
_logger.info(f"got model {self.model_name}") _logger.info(f"got model {self.model_name}")
self.dot_inputs_array = pdme.measurement.input_types.dot_inputs_to_array( # dot_inputs = [(meas.r, meas.f) for meas in actual_measurements]
dot_inputs # self.dot_inputs_array = pdme.measurement.input_types.dot_inputs_to_array(
) # dot_inputs
# )
# _logger.debug(f"actual measurements: {actual_measurements}") # _logger.debug(f"actual measurements: {actual_measurements}")
self.actual_measurement_array = numpy.array([m.v for m in actual_measurements]) # self.actual_measurement_array = numpy.array([m.v for m in actual_measurements])
def cost_function_to_use(dipoles_to_test): # def cost_function_to_use(dipoles_to_test):
return pdme.subspace_simulation.proportional_costs_vs_actual_measurement( # return pdme.subspace_simulation.proportional_costs_vs_actual_measurement(
self.dot_inputs_array, self.actual_measurement_array, dipoles_to_test # self.dot_inputs_array, self.actual_measurement_array, dipoles_to_test
) # )
self.cost_function_to_use = cost_function_to_use self.cost_function_to_use = cost_function
self.n_c = n_c self.n_c = n_c
self.n_s = n_s self.n_s = n_s
@ -68,16 +86,25 @@ class SubsetSimulation:
self.mcmc_seed = mcmc_seed self.mcmc_seed = mcmc_seed
self.use_adaptive_steps = use_adaptive_steps self.use_adaptive_steps = use_adaptive_steps
self.default_phi_step = default_phi_step self.default_phi_step = (
default_phi_step * 1.73
) # this is a hack to fix a missing sqrt 3 in the proposal function code.
self.default_theta_step = default_theta_step self.default_theta_step = default_theta_step
self.default_r_step = default_r_step self.default_r_step = (
self.default_w_log_step = default_w_log_step default_r_step * 1.73
) # this is a hack to fix a missing sqrt 3 in the proposal function code.
self.default_w_log_step = (
default_w_log_step * 1.73
) # this is a hack to fix a missing sqrt 3 in the proposal function code.
self.default_upper_w_log_step = default_upper_w_log_step self.default_upper_w_log_step = default_upper_w_log_step
_logger.info("using params:") _logger.info("using params:")
_logger.info(f"\tn_c: {self.n_c}") _logger.info(f"\tn_c: {self.n_c}")
_logger.info(f"\tn_s: {self.n_s}") _logger.info(f"\tn_s: {self.n_s}")
_logger.info(f"\tm: {self.m_max}") _logger.info(f"\tm: {self.m_max}")
_logger.info(f"\t{num_initial_dmc_gens=}")
_logger.info(f"\t{mcmc_seed=}")
_logger.info(f"\t{level_0_seed=}")
_logger.info("let's do level 0...") _logger.info("let's do level 0...")
self.target_cost = target_cost self.target_cost = target_cost
@ -87,158 +114,176 @@ class SubsetSimulation:
self.dump_last_generations = dump_last_generation_to_file self.dump_last_generations = dump_last_generation_to_file
self.initial_cost_chunk_size = initial_cost_chunk_size self.initial_cost_chunk_size = initial_cost_chunk_size
self.initial_cost_multiprocess = initial_cost_multiprocess
self.cap_core_count = cap_core_count
self.num_dmc_gens = num_initial_dmc_gens
def _single_chain_gen(self, args: Tuple):
threshold_cost, stdevs, rng_seed, (c, s) = args
rng = numpy.random.default_rng(rng_seed)
return self.model.get_repeat_counting_mcmc_chain(
s,
self.cost_function_to_use,
self.n_s,
threshold_cost,
stdevs,
initial_cost=c,
rng_arg=rng,
)
def execute(self) -> SubsetSimulationResult: def execute(self) -> SubsetSimulationResult:
probs_list = [] probs_list = []
output_messages = []
# If we have n_s = 10 and n_c = 100, then our big N = 1000 and p = 1/10
# The DMC stage would normally generate 1000, then pick the best 100 and start counting prob = p/10.
# Let's say we want our DMC stage to go down to level 2.
# Then we need to filter out p^2, so our initial has to be N_0 = N / p = n_c * n_s^2
initial_dmc_n = self.n_c * (self.n_s**self.num_dmc_gens)
initial_level = (
self.num_dmc_gens - 1
) # This is perfunctory but let's label it here really explicitly
_logger.info(f"Generating {initial_dmc_n} for DMC stage")
sample_dipoles = self.model.get_monte_carlo_dipole_inputs( sample_dipoles = self.model.get_monte_carlo_dipole_inputs(
self.n_c * self.n_s, initial_dmc_n,
-1, -1,
rng_to_use=numpy.random.default_rng(self.level_0_seed), rng_to_use=numpy.random.default_rng(self.level_0_seed),
) )
# _logger.debug(sample_dipoles) # _logger.debug(sample_dipoles)
# _logger.debug(sample_dipoles.shape) # _logger.debug(sample_dipoles.shape)
raw_costs = [] _logger.debug("Finished dipole generation")
_logger.debug( _logger.debug(
f"Using iterated cost function thing with chunk size {self.initial_cost_chunk_size}" f"Using iterated multiprocessing cost function thing with chunk size {self.initial_cost_chunk_size}"
) )
for x in range(0, len(sample_dipoles), self.initial_cost_chunk_size): # core count etc. logic here
_logger.debug(f"doing chunk {x}") core_count = multiprocessing.cpu_count() - 1 or 1
raw_costs.extend( if (self.cap_core_count >= 1) and (self.cap_core_count < core_count):
self.cost_function_to_use( core_count = self.cap_core_count
sample_dipoles[x : x + self.initial_cost_chunk_size] _logger.info(f"Using {core_count} cores")
)
with multiprocessing.Pool(core_count) as pool:
# Do the initial DMC calculation in a multiprocessing
chunks = numpy.array_split(
sample_dipoles,
range(
self.initial_cost_chunk_size,
len(sample_dipoles),
self.initial_cost_chunk_size,
),
) )
costs = numpy.array(raw_costs) if self.initial_cost_multiprocess:
_logger.debug("Multiprocessing initial costs")
raw_costs = pool.map(self.cost_function_to_use, chunks)
else:
_logger.debug("Single process initial costs")
raw_costs = []
for chunk_idx, chunk in enumerate(chunks):
_logger.debug(f"doing chunk #{chunk_idx}")
raw_costs.append(self.cost_function_to_use(chunk))
costs = numpy.concatenate(raw_costs)
_logger.debug("finished initial dmc cost calculation")
# _logger.debug(f"costs: {costs}")
sorted_indexes = costs.argsort()[::-1]
_logger.debug(f"costs: {costs}") # _logger.debug(costs[sorted_indexes])
sorted_indexes = costs.argsort()[::-1] # _logger.debug(sample_dipoles[sorted_indexes])
_logger.debug(costs[sorted_indexes]) sorted_costs = costs[sorted_indexes]
_logger.debug(sample_dipoles[sorted_indexes]) sorted_dipoles = sample_dipoles[sorted_indexes]
sorted_costs = costs[sorted_indexes] all_dipoles = numpy.array(
sorted_dipoles = sample_dipoles[sorted_indexes] [
pdme.subspace_simulation.sort_array_of_dipoles_by_frequency(samp)
threshold_cost = sorted_costs[-self.n_c] for samp in sorted_dipoles
]
all_dipoles = numpy.array( )
[ all_chains = list(zip(sorted_costs, all_dipoles))
pdme.subspace_simulation.sort_array_of_dipoles_by_frequency(samp) for dmc_level in range(initial_level):
for samp in sorted_dipoles # if initial level is 1, we want to print out what the level 0 threshold would have been?
] _logger.debug(f"Get the pseudo statistics for level {dmc_level}")
) _logger.debug(f"Whole chain has length {len(all_chains)}")
all_chains = list(zip(sorted_costs, all_dipoles)) pseudo_threshold_index = -(
self.n_c * (self.n_s ** (self.num_dmc_gens - dmc_level - 1))
mcmc_rng = numpy.random.default_rng(self.mcmc_seed) )
for i in range(self.m_max):
next_seeds = all_chains[-self.n_c :]
if self.dump_last_generations:
_logger.info("writing out csv file")
next_dipoles_seed_dipoles = numpy.array([n[1] for n in next_seeds])
for n in range(self.model.n):
_logger.info(f"{next_dipoles_seed_dipoles[:, n].shape}")
numpy.savetxt(
f"generation_{self.n_c}_{self.n_s}_{i}_dipole_{n}.csv",
next_dipoles_seed_dipoles[:, n],
delimiter=",",
)
next_seeds_as_array = numpy.array([s for _, s in next_seeds])
stdevs = self.get_stdevs_from_arrays(next_seeds_as_array)
_logger.info(f"got stdevs: {stdevs.stdevs}")
all_long_chains = []
for seed_index, (c, s) in enumerate(
next_seeds[:: len(next_seeds) // 20]
):
# chain = mcmc(s, threshold_cost, n_s, model, dot_inputs_array, actual_measurement_array, mcmc_rng, curr_cost=c, stdevs=stdevs)
# until new version gotta do
_logger.debug(f"\t{seed_index}: doing long chain on the next seed")
long_chain = self.model.get_mcmc_chain(
s,
self.cost_function_to_use,
1000,
threshold_cost,
stdevs,
initial_cost=c,
rng_arg=mcmc_rng,
)
for _, chained in long_chain:
all_long_chains.append(chained)
all_long_chains_array = numpy.array(all_long_chains)
for n in range(self.model.n):
_logger.info(f"{all_long_chains_array[:, n].shape}")
numpy.savetxt(
f"long_chain_generation_{self.n_c}_{self.n_s}_{i}_dipole_{n}.csv",
all_long_chains_array[:, n],
delimiter=",",
)
if self.keep_probs_list:
for cost_index, cost_chain in enumerate(all_chains[: -self.n_c]):
probs_list.append(
(
((self.n_c * self.n_s - cost_index) / (self.n_c * self.n_s))
/ (self.n_s ** (i)),
cost_chain[0],
i + 1,
)
)
next_seeds_as_array = numpy.array([s for _, s in next_seeds])
stdevs = self.get_stdevs_from_arrays(next_seeds_as_array)
_logger.info(f"got stdevs: {stdevs.stdevs}")
_logger.debug("Starting the MCMC")
all_chains = []
for seed_index, (c, s) in enumerate(next_seeds):
# chain = mcmc(s, threshold_cost, n_s, model, dot_inputs_array, actual_measurement_array, mcmc_rng, curr_cost=c, stdevs=stdevs)
# until new version gotta do
_logger.debug( _logger.debug(
f"\t{seed_index}: getting another chain from the next seed" f"Have a pseudo_threshold_index of {pseudo_threshold_index}, or {len(all_chains) + pseudo_threshold_index}"
) )
chain = self.model.get_mcmc_chain( pseudo_threshold_cost = all_chains[-pseudo_threshold_index][0]
s, _logger.info(
self.cost_function_to_use, f"Pseudo-level {dmc_level} threshold cost {pseudo_threshold_cost}, at P = (1 / {self.n_s})^{dmc_level + 1}"
self.n_s,
threshold_cost,
stdevs,
initial_cost=c,
rng_arg=mcmc_rng,
) )
for cost, chained in chain: all_chains = all_chains[pseudo_threshold_index:]
try:
filtered_cost = cost[0]
except (IndexError, TypeError):
filtered_cost = cost
all_chains.append((filtered_cost, chained))
_logger.debug("finished mcmc")
# _logger.debug(all_chains)
all_chains.sort(key=lambda c: c[0], reverse=True) long_mcmc_rng = numpy.random.default_rng(self.mcmc_seed)
_logger.debug("finished sorting all_chains") mcmc_rng_seed_sequence = numpy.random.SeedSequence(self.mcmc_seed)
threshold_cost = all_chains[-self.n_c][0] threshold_cost = all_chains[-self.n_c][0]
_logger.info( _logger.info(
f"current threshold cost: {threshold_cost}, at P = (1 / {self.n_s})^{i + 1}" f"Finishing DMC threshold cost {threshold_cost} at level {initial_level}, at P = (1 / {self.n_s})^{initial_level + 1}"
) )
if (self.target_cost is not None) and (threshold_cost < self.target_cost): _logger.debug(f"Executing the MCMC with chains of length {len(all_chains)}")
_logger.info(
f"got a threshold cost {threshold_cost}, less than {self.target_cost}. will leave early"
)
cost_list = [c[0] for c in all_chains] # Now we move on to the MCMC part of the algorithm
over_index = reverse_bisect_right(cost_list, self.target_cost)
shorter_probs_list = [] # This is important, we want to allow some extra initial levels so we need to account for that here!
for cost_index, cost_chain in enumerate(all_chains): for i in range(self.num_dmc_gens, self.m_max):
if self.keep_probs_list: _logger.info(f"Starting level {i}")
next_seeds = all_chains[-self.n_c :]
if self.dump_last_generations:
_logger.info("writing out csv file")
next_dipoles_seed_dipoles = numpy.array([n[1] for n in next_seeds])
for n in range(self.model.n):
_logger.info(f"{next_dipoles_seed_dipoles[:, n].shape}")
numpy.savetxt(
f"generation_{self.n_c}_{self.n_s}_{i}_dipole_{n}.csv",
next_dipoles_seed_dipoles[:, n],
delimiter=",",
)
next_seeds_as_array = numpy.array([s for _, s in next_seeds])
stdevs = self.get_stdevs_from_arrays(next_seeds_as_array)
_logger.info(f"got stdevs: {stdevs.stdevs}")
all_long_chains = []
for seed_index, (c, s) in enumerate(
next_seeds[:: len(next_seeds) // 20]
):
# chain = mcmc(s, threshold_cost, n_s, model, dot_inputs_array, actual_measurement_array, mcmc_rng, curr_cost=c, stdevs=stdevs)
# until new version gotta do
_logger.debug(
f"\t{seed_index}: doing long chain on the next seed"
)
long_chain = self.model.get_mcmc_chain(
s,
self.cost_function_to_use,
1000,
threshold_cost,
stdevs,
initial_cost=c,
rng_arg=long_mcmc_rng,
)
for _, chained in long_chain:
all_long_chains.append(chained)
all_long_chains_array = numpy.array(all_long_chains)
for n in range(self.model.n):
_logger.info(f"{all_long_chains_array[:, n].shape}")
numpy.savetxt(
f"long_chain_generation_{self.n_c}_{self.n_s}_{i}_dipole_{n}.csv",
all_long_chains_array[:, n],
delimiter=",",
)
if self.keep_probs_list:
for cost_index, cost_chain in enumerate(all_chains[: -self.n_c]):
probs_list.append( probs_list.append(
( (
( (
@ -250,26 +295,105 @@ class SubsetSimulation:
i + 1, i + 1,
) )
) )
shorter_probs_list.append(
(
cost_chain[0],
((self.n_c * self.n_s - cost_index) / (self.n_c * self.n_s))
/ (self.n_s ** (i)),
)
)
# _logger.info(shorter_probs_list)
result = SubsetSimulationResult(
probs_list=probs_list,
over_target_cost=shorter_probs_list[over_index - 1][0],
over_target_likelihood=shorter_probs_list[over_index - 1][1],
under_target_cost=shorter_probs_list[over_index][0],
under_target_likelihood=shorter_probs_list[over_index][1],
lowest_likelihood=shorter_probs_list[-1][1],
)
return result
# _logger.debug([c[0] for c in all_chains[-n_c:]]) next_seeds_as_array = numpy.array([s for _, s in next_seeds])
_logger.info(f"doing level {i + 1}")
stdevs = self.get_stdevs_from_arrays(next_seeds_as_array)
_logger.debug(f"got stdevs, begin: {stdevs.stdevs[:10]}")
_logger.debug("Starting the MCMC")
all_chains = []
seeds = mcmc_rng_seed_sequence.spawn(len(next_seeds))
pool_results = pool.imap_unordered(
self._single_chain_gen,
[
(threshold_cost, stdevs, rng_seed, test_seed)
for rng_seed, test_seed in zip(seeds, next_seeds)
],
chunksize=50,
)
# count for ergodicity analysis
samples_generated = 0
samples_rejected = 0
for rejected_count, chain in pool_results:
for cost, chained in chain:
try:
filtered_cost = cost[0]
except (IndexError, TypeError):
filtered_cost = cost
all_chains.append((filtered_cost, chained))
samples_generated += self.n_s
samples_rejected += rejected_count
_logger.debug("finished mcmc")
_logger.debug(f"{samples_rejected=} out of {samples_generated=}")
if samples_rejected * 2 > samples_generated:
reject_ratio = samples_rejected / samples_generated
rejectionmessage = f"On level {i}, rejected {samples_rejected} out of {samples_generated}, {reject_ratio=} is too high and may indicate ergodicity problems"
output_messages.append(rejectionmessage)
_logger.warning(rejectionmessage)
# _logger.debug(all_chains)
all_chains.sort(key=lambda c: c[0], reverse=True)
_logger.debug("finished sorting all_chains")
threshold_cost = all_chains[-self.n_c][0]
_logger.info(
f"current threshold cost: {threshold_cost}, at P = (1 / {self.n_s})^{i + 1}"
)
if (self.target_cost is not None) and (
threshold_cost < self.target_cost
):
_logger.info(
f"got a threshold cost {threshold_cost}, less than {self.target_cost}. will leave early"
)
cost_list = [c[0] for c in all_chains]
over_index = reverse_bisect_right(cost_list, self.target_cost)
winner = all_chains[over_index][1]
_logger.info(f"Winner obtained: {winner}")
shorter_probs_list = []
for cost_index, cost_chain in enumerate(all_chains):
if self.keep_probs_list:
probs_list.append(
(
(
(self.n_c * self.n_s - cost_index)
/ (self.n_c * self.n_s)
)
/ (self.n_s ** (i)),
cost_chain[0],
i + 1,
)
)
shorter_probs_list.append(
(
cost_chain[0],
(
(self.n_c * self.n_s - cost_index)
/ (self.n_c * self.n_s)
)
/ (self.n_s ** (i)),
)
)
# _logger.info(shorter_probs_list)
result = SubsetSimulationResult(
probs_list=probs_list,
over_target_cost=shorter_probs_list[over_index - 1][0],
over_target_likelihood=shorter_probs_list[over_index - 1][1],
under_target_cost=shorter_probs_list[over_index][0],
under_target_likelihood=shorter_probs_list[over_index][1],
lowest_likelihood=shorter_probs_list[-1][1],
messages=output_messages,
)
return result
# _logger.debug([c[0] for c in all_chains[-n_c:]])
_logger.info(f"doing level {i + 1}")
if self.keep_probs_list: if self.keep_probs_list:
for cost_index, cost_chain in enumerate(all_chains): for cost_index, cost_chain in enumerate(all_chains):
@ -285,8 +409,8 @@ class SubsetSimulation:
_logger.info( _logger.info(
f"final threshold cost: {threshold_cost}, at P = (1 / {self.n_s})^{self.m_max + 1}" f"final threshold cost: {threshold_cost}, at P = (1 / {self.n_s})^{self.m_max + 1}"
) )
for a in all_chains[-10:]: # for a in all_chains[-10:]:
_logger.info(a) # _logger.info(a)
# for prob, prob_cost in probs_list: # for prob, prob_cost in probs_list:
# _logger.info(f"\t{prob}: {prob_cost}") # _logger.info(f"\t{prob}: {prob_cost}")
probs_list.sort(key=lambda c: c[0], reverse=True) probs_list.sort(key=lambda c: c[0], reverse=True)
@ -300,6 +424,7 @@ class SubsetSimulation:
under_target_cost=None, under_target_cost=None,
under_target_likelihood=None, under_target_likelihood=None,
lowest_likelihood=min_likelihood, lowest_likelihood=min_likelihood,
messages=output_messages,
) )
return result return result
@ -358,6 +483,116 @@ class SubsetSimulation:
return stdevs return stdevs
class MultiSubsetSimulations:
def __init__(
self,
model_name_pairs: Sequence[Tuple[str, pdme.model.DipoleModel]],
# actual_measurements: Sequence[pdme.measurement.DotMeasurement],
cost_function: Callable[[numpy.ndarray], numpy.ndarray],
num_runs: int,
n_c: int,
n_s: int,
m_max: int,
target_cost: float,
num_initial_dmc_gens: int = 1,
level_0_seed_seed: int = 200,
mcmc_seed_seed: int = 20,
use_adaptive_steps=True,
default_phi_step=0.01,
default_theta_step=0.01,
default_r_step=0.01,
default_w_log_step=0.01,
default_upper_w_log_step=4,
initial_cost_chunk_size=100,
cap_core_count: int = 0, # 0 means cap at num cores - 1
):
self.model_name_pairs = model_name_pairs
self.cost_function = cost_function
self.num_runs = num_runs
self.n_c = n_c
self.n_s = n_s
self.m_max = m_max
self.target_cost = target_cost # This is not optional here!
self.num_dmc_gens = num_initial_dmc_gens
self.level_0_seed_seed = level_0_seed_seed
self.mcmc_seed_seed = mcmc_seed_seed
self.use_adaptive_steps = use_adaptive_steps
self.default_phi_step = default_phi_step
self.default_theta_step = default_theta_step
self.default_r_step = default_r_step
self.default_w_log_step = default_w_log_step
self.default_upper_w_log_step = default_upper_w_log_step
self.initial_cost_chunk_size = initial_cost_chunk_size
self.cap_core_count = cap_core_count
def execute(self) -> Sequence[MultiSubsetSimulationResult]:
output: List[MultiSubsetSimulationResult] = []
for model_index, model_name_pair in enumerate(self.model_name_pairs):
ss_results = [
SubsetSimulation(
model_name_pair,
self.cost_function,
self.n_c,
self.n_s,
self.m_max,
self.target_cost,
num_initial_dmc_gens=self.num_dmc_gens,
level_0_seed=[model_index, run_index, self.level_0_seed_seed],
mcmc_seed=[model_index, run_index, self.mcmc_seed_seed],
use_adaptive_steps=self.use_adaptive_steps,
default_phi_step=self.default_phi_step,
default_theta_step=self.default_theta_step,
default_r_step=self.default_r_step,
default_w_log_step=self.default_w_log_step,
default_upper_w_log_step=self.default_upper_w_log_step,
keep_probs_list=False,
dump_last_generation_to_file=False,
initial_cost_chunk_size=self.initial_cost_chunk_size,
cap_core_count=self.cap_core_count,
).execute()
for run_index in range(self.num_runs)
]
output.append(coalesce_ss_results(model_name_pair[0], ss_results))
return output
def coalesce_ss_results(
model_name: str, results: Sequence[SubsetSimulationResult]
) -> MultiSubsetSimulationResult:
num_finished = sum(1 for res in results if res.under_target_likelihood is not None)
estimated_likelihoods = numpy.array(
[
res.under_target_likelihood
if res.under_target_likelihood is not None
else res.lowest_likelihood
for res in results
]
)
_logger.info(estimated_likelihoods)
geometric_mean_estimated_likelihoods = numpy.exp(
numpy.log(estimated_likelihoods).mean()
)
_logger.info(geometric_mean_estimated_likelihoods)
arithmetic_mean_estimated_likelihoods = estimated_likelihoods.mean()
result = MultiSubsetSimulationResult(
child_results=results,
model_name=model_name,
estimated_likelihood=geometric_mean_estimated_likelihoods,
arithmetic_mean_estimated_likelihood=arithmetic_mean_estimated_likelihoods,
num_children=len(results),
num_finished_children=num_finished,
clean_estimate=num_finished == len(results),
)
return result
def reverse_bisect_right(a, x, lo=0, hi=None): def reverse_bisect_right(a, x, lo=0, hi=None):
"""Return the index where to insert item x in list a, assuming a is sorted in descending order. """Return the index where to insert item x in list a, assuming a is sorted in descending order.

8
poetry.lock generated
View File

@ -786,13 +786,13 @@ files = [
[[package]] [[package]]
name = "pdme" name = "pdme"
version = "1.2.0" version = "1.5.0"
description = "Python dipole model evaluator" description = "Python dipole model evaluator"
optional = false optional = false
python-versions = "<3.10,>=3.8.1" python-versions = "<3.10,>=3.8.1"
files = [ files = [
{file = "pdme-1.2.0-py3-none-any.whl", hash = "sha256:602710a053f22921b4adbc03d46d284149fe2367a65455cde56608708e01c84b"}, {file = "pdme-1.5.0-py3-none-any.whl", hash = "sha256:1b4fa30ba98a336957b3029563552d73286a3a5f932809ac1330e65a1f61c363"},
{file = "pdme-1.2.0.tar.gz", hash = "sha256:412806d7ae384c048515e0f2cba70252778bf153800829a1d3265a0596872263"}, {file = "pdme-1.5.0.tar.gz", hash = "sha256:cc0ac4ffab2994e08b4efde2991c6d9dccb2942c7e33c4be3b52e068366526d1"},
] ]
[package.dependencies] [package.dependencies]
@ -1275,4 +1275,4 @@ testing = ["big-O", "jaraco.functools", "jaraco.itertools", "more-itertools", "p
[metadata] [metadata]
lock-version = "2.0" lock-version = "2.0"
python-versions = ">=3.8.1,<3.10" python-versions = ">=3.8.1,<3.10"
content-hash = "918b6736766a9c1b6732a56e1ef2e7a53241f2e25babb884881e49c299801fc9" content-hash = "85114054176aa164964acea6fdc085581ee7fc2f94c1cd03ad77611b82e52c79"

View File

@ -1,12 +1,12 @@
[tool.poetry] [tool.poetry]
name = "deepdog" name = "deepdog"
version = "1.2.0" version = "1.7.0"
description = "" description = ""
authors = ["Deepak Mallubhotla <dmallubhotla+github@gmail.com>"] authors = ["Deepak Mallubhotla <dmallubhotla+github@gmail.com>"]
[tool.poetry.dependencies] [tool.poetry.dependencies]
python = ">=3.8.1,<3.10" python = ">=3.8.1,<3.10"
pdme = "^1.2.0" pdme = "^1.5.0"
numpy = "1.22.3" numpy = "1.22.3"
scipy = "1.10" scipy = "1.10"
tqdm = "^4.66.2" tqdm = "^4.66.2"
@ -22,6 +22,7 @@ syrupy = "^4.0.8"
[tool.poetry.scripts] [tool.poetry.scripts]
probs = "deepdog.cli.probs:wrapped_main" probs = "deepdog.cli.probs:wrapped_main"
subset_sim_probs = "deepdog.cli.subset_sim_probs:wrapped_main"
[build-system] [build-system]
requires = ["poetry-core>=1.0.0"] requires = ["poetry-core>=1.0.0"]

View File

@ -0,0 +1,26 @@
import re
import deepdog.direct_monte_carlo
def test_config_check_self():
config = deepdog.direct_monte_carlo.DirectMonteCarloConfig(
tag="test_tag",
bayesrun_file_timestamp=False,
)
expected_filename = "test_tag.realdata.fast_filter.bayesrun.csv"
actual_filename = config.get_filename()
assert actual_filename == expected_filename
regex = config.get_filename_regex()
assert re.match(regex, actual_filename) is not None
def test_config_check_self_with_timestamp():
config = deepdog.direct_monte_carlo.DirectMonteCarloConfig(
tag="test_tag",
bayesrun_file_timestamp=True,
)
expected_filename_ending = "test_tag.realdata.fast_filter.bayesrun.csv"
actual_filename = config.get_filename()
assert actual_filename.endswith(expected_filename_ending)
regex = config.get_filename_regex()
assert re.match(regex, actual_filename) is not None

View File

@ -0,0 +1,42 @@
import deepdog.direct_monte_carlo.cost_function_filter
import numpy
def test_px_cost_function_filter_example():
dipoles_1 = [
[1, 2, 3, 4, 5, 6, 7],
[2, 3, 2, 5, 4, 7, 6],
]
dipoles_2 = [
[15, 9, 8, 7, 6, 5, 3],
[30, 4, 4, 7, 3, 1, 4],
]
dipoleses = numpy.array([dipoles_1, dipoles_2])
def cost_function(dipoleses: numpy.ndarray) -> numpy.ndarray:
return dipoleses[:, :, 0].max(axis=-1)
expected_costs = numpy.array([2, 30])
numpy.testing.assert_array_equal(cost_function(dipoleses), expected_costs)
filter = deepdog.direct_monte_carlo.cost_function_filter.CostFunctionTargetFilter(
cost_function, 5
)
actual_filtered = filter.filter_samples(dipoleses)
expected_filtered = numpy.array([dipoles_1])
assert actual_filtered.size != 0
numpy.testing.assert_array_equal(actual_filtered, expected_filtered)
filter_stricter = (
deepdog.direct_monte_carlo.cost_function_filter.CostFunctionTargetFilter(
cost_function, 0.5
)
)
actual_filtered_stricter = filter_stricter.filter_samples(dipoleses)
assert actual_filtered_stricter.size == 0

View File

@ -10,3 +10,12 @@ def test_indexifier():
_logger.debug(f"setting up indexifier {indexifier}") _logger.debug(f"setting up indexifier {indexifier}")
assert indexifier.indexify(0) == {"key_1": 1, "key_2": "a"} assert indexifier.indexify(0) == {"key_1": 1, "key_2": "a"}
assert indexifier.indexify(5) == {"key_1": 2, "key_2": "c"} assert indexifier.indexify(5) == {"key_1": 2, "key_2": "c"}
assert len(indexifier) == 9
def test_indexifier_length_short():
weight_dict = {"key_1": [1, 2, 3], "key_2": ["b", "c"]}
indexifier = deepdog.indexify.Indexifier(weight_dict)
_logger.debug(f"setting up indexifier {indexifier}")
assert len(indexifier) == 6

View File

@ -1,4 +1,4 @@
import deepdog.results import deepdog.results.read_csv
def test_parse_groupdict(): def test_parse_groupdict():
@ -6,9 +6,9 @@ def test_parse_groupdict():
"geom_-20_20_-10_10_0_5-orientation_free-dipole_count_100_success" "geom_-20_20_-10_10_0_5-orientation_free-dipole_count_100_success"
) )
parsed = deepdog.results._parse_bayesrun_column(example_column_name) parsed = deepdog.results.read_csv._parse_bayesrun_column(example_column_name)
assert parsed is not None assert parsed is not None
expected = deepdog.results.BayesrunColumnParsed( expected = deepdog.results.read_csv.BayesrunColumnParsed(
{ {
"xmin": "-20", "xmin": "-20",
"xmax": "20", "xmax": "20",
@ -29,9 +29,9 @@ def test_parse_groupdict_with_magnitude():
"geom_-20_20_-10_10_0_5-magnitude_3.5-orientation_free-dipole_count_100_success" "geom_-20_20_-10_10_0_5-magnitude_3.5-orientation_free-dipole_count_100_success"
) )
parsed = deepdog.results._parse_bayesrun_column(example_column_name) parsed = deepdog.results.read_csv._parse_bayesrun_column(example_column_name)
assert parsed is not None assert parsed is not None
expected = deepdog.results.BayesrunColumnParsed( expected = deepdog.results.read_csv.BayesrunColumnParsed(
{ {
"xmin": "-20", "xmin": "-20",
"xmax": "20", "xmax": "20",
@ -48,6 +48,28 @@ def test_parse_groupdict_with_magnitude():
assert parsed == expected assert parsed == expected
def test_parse_groupdict_with_negative_magnitude():
example_column_name = "geom_-20_20_-10_10_0_5-magnitude_-3.5-orientation_free-dipole_count_100_success"
parsed = deepdog.results.read_csv._parse_bayesrun_column(example_column_name)
assert parsed is not None
expected = deepdog.results.read_csv.BayesrunColumnParsed(
{
"xmin": "-20",
"xmax": "20",
"ymin": "-10",
"ymax": "10",
"zmin": "0",
"zmax": "5",
"orientation": "free",
"avg_filled": "100",
"log_magnitude": "-3.5",
"field_name": "success",
}
)
assert parsed == expected
# def test_parse_no_match_column_name(): # def test_parse_no_match_column_name():
# parsed = deepdog.results.parse_bayesrun_column("There's nothing here") # parsed = deepdog.results.parse_bayesrun_column("There's nothing here")
# assert parsed is None # assert parsed is None

View File

@ -0,0 +1,19 @@
import deepdog.results
import pytest
def test_parse_bayesrun_filename():
valid1 = "20250226-204120-dot1-dot1-2-0.realdata.fast_filter.bayesrun.csv"
timestamp, slug = deepdog.results._parse_string_output_filename(valid1)
assert timestamp == "20250226-204120"
assert slug == "dot1-dot1-2-0"
valid2 = "dot1-dot1-2-0.realdata.fast_filter.bayesrun.csv"
timestamp, slug = deepdog.results._parse_string_output_filename(valid2)
assert timestamp is None
assert slug == "dot1-dot1-2-0"
with pytest.raises(ValueError):
deepdog.results._parse_string_output_filename("not_a_valid_filename")

View File

@ -0,0 +1,10 @@
# serializer version: 1
# name: test_subset_simulation_multi_result_coalescing_easy_arithmetic
MultiSubsetSimulationResult(child_results=[SubsetSimulationResult(probs_list=(), over_target_cost=1, over_target_likelihood=1, under_target_cost=0.99, under_target_likelihood=0.8, lowest_likelihood=0.5, messages=[]), SubsetSimulationResult(probs_list=(), over_target_cost=1, over_target_likelihood=1, under_target_cost=0.99, under_target_likelihood=0.6, lowest_likelihood=0.01, messages=[])], model_name='test', estimated_likelihood=0.6928203230275509, arithmetic_mean_estimated_likelihood=0.7, num_children=2, num_finished_children=2, clean_estimate=True)
# ---
# name: test_subset_simulation_multi_result_coalescing_easy_geometric
MultiSubsetSimulationResult(child_results=[SubsetSimulationResult(probs_list=(), over_target_cost=1, over_target_likelihood=1, under_target_cost=0.99, under_target_likelihood=0.1, lowest_likelihood=0.5, messages=[]), SubsetSimulationResult(probs_list=(), over_target_cost=1, over_target_likelihood=1, under_target_cost=0.99, under_target_likelihood=0.001, lowest_likelihood=0.01, messages=[])], model_name='test', estimated_likelihood=0.010000000000000004, arithmetic_mean_estimated_likelihood=0.0505, num_children=2, num_finished_children=2, clean_estimate=True)
# ---
# name: test_subset_simulation_multi_result_coalescing_include_dirty
MultiSubsetSimulationResult(child_results=[SubsetSimulationResult(probs_list=(), over_target_cost=1, over_target_likelihood=1, under_target_cost=0.99, under_target_likelihood=0.8, lowest_likelihood=0.5, messages=[]), SubsetSimulationResult(probs_list=(), over_target_cost=1, over_target_likelihood=1, under_target_cost=0.99, under_target_likelihood=0.08, lowest_likelihood=0.01, messages=[]), SubsetSimulationResult(probs_list=(), over_target_cost=None, over_target_likelihood=None, under_target_cost=None, under_target_likelihood=None, lowest_likelihood=0.0001, messages=[])], model_name='test', estimated_likelihood=0.01856635533445112, arithmetic_mean_estimated_likelihood=0.29336666666666666, num_children=3, num_finished_children=2, clean_estimate=False)
# ---

View File

@ -0,0 +1,92 @@
import deepdog.subset_simulation.subset_simulation_impl as impl
import numpy
def test_subset_simulation_multi_result_coalescing_include_dirty(snapshot):
res1 = impl.SubsetSimulationResult(
probs_list=(),
over_target_cost=1,
over_target_likelihood=1,
under_target_cost=0.99,
under_target_likelihood=0.8,
lowest_likelihood=0.5,
messages=[],
)
res2 = impl.SubsetSimulationResult(
probs_list=(),
over_target_cost=1,
over_target_likelihood=1,
under_target_cost=0.99,
under_target_likelihood=0.08,
lowest_likelihood=0.01,
messages=[],
)
res3 = impl.SubsetSimulationResult(
probs_list=(),
over_target_cost=None,
over_target_likelihood=None,
under_target_cost=None,
under_target_likelihood=None,
lowest_likelihood=0.0001,
messages=[],
)
combined = impl.coalesce_ss_results("test", [res1, res2, res3])
assert combined == snapshot
def test_subset_simulation_multi_result_coalescing_easy_arithmetic(snapshot):
res1 = impl.SubsetSimulationResult(
probs_list=(),
over_target_cost=1,
over_target_likelihood=1,
under_target_cost=0.99,
under_target_likelihood=0.8,
lowest_likelihood=0.5,
messages=[],
)
res2 = impl.SubsetSimulationResult(
probs_list=(),
over_target_cost=1,
over_target_likelihood=1,
under_target_cost=0.99,
under_target_likelihood=0.6,
lowest_likelihood=0.01,
messages=[],
)
combined = impl.coalesce_ss_results("test", [res1, res2])
assert combined.arithmetic_mean_estimated_likelihood == 0.7
assert combined == snapshot
def test_subset_simulation_multi_result_coalescing_easy_geometric(snapshot):
res1 = impl.SubsetSimulationResult(
probs_list=(),
over_target_cost=1,
over_target_likelihood=1,
under_target_cost=0.99,
under_target_likelihood=0.1,
lowest_likelihood=0.5,
messages=[],
)
res2 = impl.SubsetSimulationResult(
probs_list=(),
over_target_cost=1,
over_target_likelihood=1,
under_target_cost=0.99,
under_target_likelihood=0.001,
lowest_likelihood=0.01,
messages=[],
)
combined = impl.coalesce_ss_results("test", [res1, res2])
numpy.testing.assert_allclose(combined.estimated_likelihood, 0.01)
assert combined == snapshot